CN108447090B

CN108447090B - Object posture estimation method and device and electronic equipment

Info

Publication number: CN108447090B
Application number: CN201611130138.9A
Authority: CN
Inventors: 熊怀欣
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2016-12-09
Filing date: 2016-12-09
Publication date: 2021-12-21
Anticipated expiration: 2036-12-09
Also published as: CN108447090A

Abstract

The invention provides a method and a device for estimating the posture of an object and electronic equipment, and belongs to the field of image and video processing. The method for estimating the object posture utilizes a plurality of cameras to perform posture estimation on the same posture-estimated object, and comprises the following steps: determining camera attitude differences between every two cameras in a system formed by the cameras and the attitude object to be estimated, respectively performing attitude estimation by using views shot by the cameras to obtain a plurality of initial attitude estimation results, and selecting at least two initial attitude estimation results which can be fused from the plurality of initial attitude estimation results according to the camera attitude differences; establishing an optimization objective function and constraint conditions thereof by using the at least two initial attitude estimation results, and solving to obtain a correction increment of each initial attitude estimation result; and calculating by using the correction increment to obtain a target attitude estimation result. The technical scheme of the invention can improve the accuracy and the confidence of the object attitude estimation.

Description

Object posture estimation method and device and electronic equipment

Technical Field

The present invention relates to the field of image and video processing, and in particular, to a method and an apparatus for estimating an object pose, and an electronic device.

Background

Analysis of driver behavior is the most important thing for a safe driving system, in which estimation of the head posture of the driver is the basis of fatigue driving and distraction detection, and many researchers are attracted to optimize the estimation of the head posture of the driver to reduce the possibility of traffic accidents. Where vision-based pose estimation is widely used due to its non-intrusive working and easy deployment characteristics. The attitude of the head is usually 3 degrees of freedom, which can be represented by the angles of Pitch, Roll, and Yaw, which correspond to a unique rotation matrix. Vision-based 3D pose estimation is the estimation of the appropriate rotational and translational transformations of the head relative to the camera coordinates from the captured 2D image.

Most current head pose estimation methods can be classified into 2 types: model-based methods and image appearance-based methods. The former is usually based on the correspondence between 2D face features and 3D head models, or on the geometric properties of feature points. In the latter, the pose is estimated by classification regression or manifold embedding, and in most cases only a relatively rough estimate is obtained.

Since the human head can be regarded as a cylinder or a sphere, when the head rotates a larger angle, no matter which method is adopted, the attitude estimation result based on a single camera is not accurate due to the limited visual field of the single camera, and in order to improve the accuracy and the confidence coefficient of the attitude estimation, a multi-camera-based multi-view attitude estimation method is provided.

One prior approach to multi-view pose estimation is to select the view from the multiple cameras that has the smallest Yaw angle for pose computation. The method relies on the assumption that the view with the smallest Yaw angle has the most reliable and accurate pose estimation result than the other views. Obviously, this assumption is not necessarily satisfied in a multi-view system, subject to various factors. Another multi-view pose estimation employs a switching strategy from one camera to another, but again this pose estimation method depends on the correctness of the view estimate that is selected.

The prior art also has a method of multi-view pose estimation that fuses the individual pose estimation results provided by each view through a fusion operator of weighted average summation. In fact, different view estimation results have different confidence degrees and accuracies, and the confidence degree of the final posture calculation result cannot be effectively improved no matter how the preset weighted sum is set. In the prior art, a method for multi-view pose estimation is to use an average probability mode to fuse pose estimation results under different views, but this mode is only suitable for rough estimation of pose.

In summary, the accuracy, especially the confidence level, of the attitude estimation result obtained by the conventional multi-view attitude estimation is not really improved.

Disclosure of Invention

The invention aims to provide a method and a device for estimating the attitude of an object and electronic equipment.

To solve the above technical problem, embodiments of the present invention provide the following technical solutions:

in one aspect, a method for object pose estimation is provided, wherein pose estimation is performed on the same pose estimation object by using a plurality of cameras, the plurality of cameras can shoot views of the pose estimation object from different perspectives, and estimated pose object areas shot by two associated cameras in the plurality of cameras have overlapped parts, and the method comprises the following steps:

determining camera attitude differences between every two cameras in the system formed by the plurality of cameras and the attitude object to be estimated, wherein the camera attitude differences are fixed differences between attitude results obtained by different camera visual angles on the same attitude object to be estimated and are equal to rotation angles from one camera to the other camera along three orthogonal directions under a world coordinate system, and estimated attitude object areas shot by every two cameras respectively have overlapped parts;

respectively carrying out attitude estimation by using the views shot by the plurality of cameras to obtain a plurality of initial attitude estimation results;

selecting at least two initial attitude estimation results which can be fused from the initial attitude estimation results according to the camera attitude difference, wherein the difference distance between the attitude difference between the two initial attitude estimation results and the camera attitude difference of the corresponding camera is smaller than a preset first threshold value;

establishing an optimization objective function and a constraint condition thereof by using the at least two initial attitude estimation results, and solving to obtain a correction increment of each initial attitude estimation result, wherein the optimization objective function is to minimize an average back projection error of all feature points in the view corresponding to the corrected at least two initial attitude estimation results, the constraint condition is that the average back projection error of the feature points in each view corresponding to the corrected at least two initial attitude estimation results is smaller than a preset second threshold, a difference value between each corrected initial attitude estimation result and another corrected initial attitude estimation result in the at least two initial attitude estimation results is equal to a camera attitude difference of each corresponding pair of cameras, and the corrected initial attitude estimation result is the initial attitude estimation result plus the correction increment;

and calculating to obtain a target attitude estimation result, wherein the target attitude result is the sum of any one of the at least two initial attitude estimation results and the correction increment thereof, and converting the target attitude result into a world coordinate system of the estimated attitude object.

Preferably, any two cameras in the plurality of cameras are a first camera and a second camera, and the determining the camera pose difference between the plurality of cameras and every two cameras in the pose estimation object composition system comprises:

establishing a world coordinate system by taking the estimated attitude object as a center, respectively carrying out position calibration on the first camera and the second camera to obtain respective external parameters, and obtaining a rotation matrix R corresponding to the first camera_LRotation matrix R corresponding to the second camera_RThe rotation matrix is the rotation transformation from a camera coordinate system to a world coordinate system;

using a rotation matrix R_LAnd a rotation matrix R_RCalculating a rotation matrix R from the first camera to the second camera_LoRWherein R is_LoR＝R_L*R_R；

Calculating R by Euler angle_LoRAnd decomposing the angle into a pitch angle, a roll angle and a yaw angle to obtain the camera attitude difference between the first camera and the second camera.

Preferably, the method for respectively performing pose estimation by using the views shot by the plurality of cameras is to perform pose estimation of an estimated object on the views shot by the cameras based on a method that a 3D model corresponds to a 2D feature point;

when the at least two initial attitude estimation results are a and B, a is obtained by performing attitude estimation on a view shot by a first camera, B is obtained by performing attitude estimation on a view shot by a second camera, and the respective correction increments of a and B are Δ a and Δ B, establishing an optimization objective function and constraint conditions thereof by using the at least two initial attitude estimation results, and solving to obtain the increment of each initial attitude estimation result includes:

establishing an optimized objective function

The constraint conditions are as follows:

a + Δ a + CONST ═ B + Δ B constraint 1;

F_L(A+ΔA,i)＝||f_L(A+ΔA,MP_i)-Pi(L)||，

F_R(B+ΔB,i)＝||f_R(B+ΔB,MP_i)-Pi(R)||；

wherein CONST is a camera pose difference, MP, between the first camera and the second camera_iFor the ith point of the 3D model in the object coordinate system, Pi (L) is corresponding to the 3D model point MP_iCorresponding 2D feature points in the view taken by the first camera, pi (r) corresponding to the 3D model point MP_iCorresponding 2D feature points, f, in the view taken by the second camera_L(.) projection transformation formula from 3D model points to 2D feature points of the first camera, f_R(.) is a projection transformation formula from the 3D model point to the 2D feature point of the second camera, and V is a preset second threshold;

if n fusion objects exist, the target formula 1 accumulation summation item has n addition items for each i, meanwhile, the constraint condition formula 2 also has corresponding n constraint condition formulas, and if the fusion objects are associated with m corresponding camera attitude differences, the constraint condition 1 also has m corresponding constraint formulas;

and solving the optimized objective function by using an enumeration method or an iteration method to obtain delta A and delta B.

Preferably, the solving the optimization objective function by using an iterative method includes:

calculating a corrected total amount Δ D ═ B-a-CONST;

let n equal to 0, S equal to Δ D, A₀＝A，

Entering an iterative process:

while (S >) preset third threshold value

n++；

From three candidate solutions { A_n-1,A_n-1+S,A_n-1-S } selecting the best solution based on the optimization objective function and constraint condition as A_nWherein each of the 3 candidate solutions Ci satisfies the condition Ci>＝min{A,A+ΔD}and Ci<max{A,A+ΔD}；

S＝S/2；

}

S is the step length of each iteration;

if S < the preset third threshold, stopping iteration;

in this case, Δ a ═ An-a and Δ B ═ Δ a- Δ D were obtained.

An embodiment of the present invention further provides an apparatus for estimating a pose of an object, in which a plurality of cameras are used to perform pose estimation on the same pose object, the plurality of cameras can capture views of the pose object from different perspectives, and regions of the pose object to be estimated, captured by two cameras associated with each other, of the plurality of cameras have overlapping portions, the apparatus including:

the camera attitude difference determining module is used for determining the camera attitude difference between every two cameras in the system formed by the plurality of cameras and the estimated attitude object, the camera attitude difference is a fixed difference value between attitude results obtained by different camera visual angles on the same estimated attitude object, the fixed difference value is equal to a rotation angle from one camera to the other camera along three orthogonal directions under a world coordinate system, and the estimated attitude object areas shot by every two cameras respectively have overlapped parts;

the initial estimation module is used for respectively carrying out attitude estimation by using the views shot by the cameras to obtain a plurality of initial attitude estimation results;

the optimization solving module is used for selecting at least two initial attitude estimation results which can be fused from the initial attitude estimation results according to the camera attitude difference, and the difference distance between the attitude difference between the two initial attitude estimation results and the camera attitude difference of the corresponding camera is smaller than a preset first threshold value; establishing an optimization objective function and a constraint condition thereof by using the at least two initial attitude estimation results, and solving to obtain a correction increment of each initial attitude estimation result, wherein the optimization objective function is to minimize an average back projection error of all feature points in the view corresponding to the corrected at least two initial attitude estimation results, the constraint condition is that the average back projection error of the feature points in each view corresponding to the corrected at least two initial attitude estimation results is smaller than a preset second threshold, a difference value between each corrected initial attitude estimation result and another corrected initial attitude estimation result in the at least two initial attitude estimation results is equal to a camera attitude difference of each corresponding pair of cameras, and the corrected initial attitude estimation result is the initial attitude estimation result plus the correction increment;

the calculation module is used for calculating to obtain a target attitude estimation result, wherein the target attitude result is the sum of any one of the at least two initial attitude estimation results and the correction increment thereof;

and the conversion module is used for converting the target attitude result into a world coordinate system of the estimated attitude object after the target attitude estimation result is obtained through calculation.

Preferably, the camera pose difference determination module comprises:

a position calibration unit, configured to establish a world coordinate system with the estimated posture object as a center, perform position calibration on the first camera and the second camera respectively to obtain respective external parameters, and obtain a rotation matrix R corresponding to the first camera_LRotation matrix R corresponding to the second camera_RThe rotation matrix is the rotation transformation from a camera coordinate system to a world coordinate system;

a calculation unit for utilizing the rotation matrix R_LAnd a rotation matrix R_RCalculating a rotation matrix R from the first camera to the second camera_LoRWherein R is_LoR＝R_L*R_R；

A decomposition unit for decomposing R by calculation of Euler angles_LoRIs decomposed intoAnd the rotation angles in the three directions of the pitch angle, the roll angle and the yaw angle are used for obtaining the camera attitude difference between the first camera and the second camera.

when the at least two initial attitude estimation results are a and B, a is obtained by performing attitude estimation on a view shot by a first camera, B is obtained by performing attitude estimation on a view shot by a second camera, and the respective correction increments of a and B are Δ a and Δ B, the optimization solving module includes:

a selection unit configured to select at least two initial posture estimation results a and B that can be fused from the plurality of initial posture estimation results according to the camera posture difference;

an optimization objective function establishing unit for establishing an optimization objective function

The constraint conditions are as follows:

a + Δ a + CONST ═ B + Δ B constraint 1;

F_L(A+ΔA,i)＝||f_L(A+ΔA,MP_i)-Pi(L)||，

F_R(B+ΔB,i)＝||f_R(B+ΔB,MP_i)-Pi(R)||；

wherein CONST is a camera pose difference, MP, between the first camera and the second camera_iFor the ith point of the 3D model in the object coordinate system, Pi (L) is corresponding to the 3D model point MP_iCorresponding 2D feature points in the view taken by the first camera, pi (r) corresponding to the 3D model point MP_iCorresponding 2D feature points, f, in the view taken by the second camera_L(.) isProjective transformation formula of 3D model points to 2D feature points of the first camera, f_R(.) is a projection transformation formula from the 3D model point to the 2D feature point of the second camera, and V is a preset second threshold; if n fusion objects exist, the target formula 1 accumulation summation item has n addition items for each i, meanwhile, the constraint condition formula 2 also has corresponding n constraint condition formulas, and if the fusion objects are associated with m corresponding camera attitude differences, the constraint condition 1 also has m corresponding constraint formulas;

and the solving unit is used for solving the optimized objective function by using an enumeration method or an iteration method to obtain delta A and delta B.

Preferably, the solving unit is specifically configured to:

calculating a corrected total amount Δ D ═ B-a-CONST;

let n equal to 0, S equal to Δ D, A₀＝A，

Entering an iterative process:

while (S >) preset third threshold value

n++；

S＝S/2；

}

S is the step length of each iteration;

if S < the preset third threshold, stopping iteration;

in this case, Δ a ═ An-a and Δ B ═ Δ a- Δ D were obtained.

An embodiment of the present invention further provides an electronic device for object pose estimation, where a plurality of cameras are used to perform pose estimation on the same pose estimation object, the plurality of cameras can capture views of the pose estimation object from different perspectives, and regions of the pose estimation object captured by two associated cameras in the plurality of cameras have overlapping portions, the electronic device including:

a processor; and

a memory having computer program instructions stored therein,

wherein the computer program instructions, when executed by the processor, cause the processor to perform the steps of:

establishing an optimization objective function and a constraint condition thereof by using the at least two initial attitude estimation results, and solving to obtain an increment of each initial attitude estimation result, wherein the optimization objective function is to minimize an average back projection error of all feature points in the views corresponding to the corrected at least two initial attitude estimation results, the constraint condition is that the average back projection error of the feature points in each view corresponding to the corrected at least two initial attitude estimation results is smaller than a preset second threshold, a difference value between each corrected initial attitude estimation result and another corrected initial attitude estimation result in the at least two initial attitude estimation results is equal to a camera attitude difference of each corresponding pair of cameras, and the corrected initial attitude estimation result is the initial attitude estimation result plus a correction increment;

The embodiment of the invention has the following beneficial effects:

in the above-described scheme, the views of the posture-estimated object are taken from different perspectives by using a plurality of cameras, and the posture-estimated object regions taken by two cameras associated with each other in the plurality of cameras have overlapping portions, so that the views taken by the plurality of cameras have a relationship with each other. Since the confidence and the accuracy of the pose estimation result of each view are to be improved, the invention utilizes the relevance among a plurality of views to improve the confidence and the accuracy of the pose estimation result. Firstly, screening view attitude estimation results shot by a plurality of corresponding cameras according to camera attitude differences among the cameras, selecting view results capable of being fused from the view attitude estimation results, establishing an optimization scheme according to the initial attitude estimation results capable of being fused and the corresponding camera attitude differences, balancing errors of the initial attitude estimation results of each view, average projection errors of the views and the errors among the camera attitude differences, and well correcting the initial attitude estimation results of each view participating in fusion, so that a plurality of related incomplete independent events are corrected into concurrent events, and the probability theory can know that the accuracy and the confidence coefficient of the attitude estimation results are improved.

Drawings

FIG. 1 is a flow chart illustrating a method for object pose estimation according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a process of determining a camera pose difference between two cameras of a plurality of cameras according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart illustrating a process of establishing an optimized objective function and constraint conditions thereof by using at least two initial attitude estimation results, and solving to obtain an increment of each initial attitude estimation result according to an embodiment of the present invention;

FIG. 4 is a block diagram of an apparatus for estimating the pose of two objects according to an embodiment of the present invention;

FIG. 5 is a block diagram of a module for determining a difference in two camera poses according to an embodiment of the present invention;

FIG. 6 is a block diagram of an optimization solution module according to a second embodiment of the present invention;

FIG. 7 is a block diagram of an electronic device for three-object attitude estimation according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating a method for four-object attitude estimation according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of Pitch (Pitch angle), Roll (Roll angle), and Yaw (Yaw);

FIG. 10 is a schematic diagram of a fourth embodiment of the present invention, in which two cameras symmetrically disposed on two sides of the head of a user in the forward direction are used to capture views of the head of the user;

FIG. 11 is a schematic diagram of the distribution of feature points on a human face;

fig. 12 is a projection diagram of a 3D sphere of an equal rotation pitch grid onto a 2D plane.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved by the embodiments of the present invention clearer, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a method, a device and electronic equipment for object attitude estimation aiming at the problem that the accuracy and the confidence coefficient of the attitude estimation result are low in the prior art.

Example one

The present embodiment provides a method for estimating a pose of an object, which performs pose estimation on the same pose object by using a plurality of cameras, wherein the plurality of cameras can capture views of the pose object from different perspectives, and regions of the pose object to be estimated captured by two associated cameras in the plurality of cameras have overlapping portions, as shown in fig. 1, and the method comprises:

step 101: determining camera attitude differences between every two cameras in the system formed by the plurality of cameras and the attitude object to be estimated, wherein the camera attitude differences are fixed differences between attitude results obtained by different camera visual angles on the same attitude object to be estimated and are equal to rotation angles from one camera to the other camera along three orthogonal directions under a world coordinate system, and the attitude object areas to be estimated, which are shot by every two cameras, have overlapped parts;

step 102: respectively carrying out attitude estimation by using the views shot by the plurality of cameras to obtain a plurality of initial attitude estimation results;

step 103: selecting at least two initial attitude estimation results which can be fused from the initial attitude estimation results according to the camera attitude difference, wherein the difference distance between the attitude difference between the two initial attitude estimation results and the camera attitude difference of the corresponding camera is smaller than a preset first threshold value; establishing an optimization objective function and a constraint condition thereof by using the at least two initial attitude estimation results, and solving to obtain a correction increment of each initial attitude estimation result, wherein the optimization objective function is to minimize an average back projection error of all feature points in the views corresponding to the corrected at least two initial attitude estimation results, the constraint condition is that the back projection error of the feature points in each view corresponding to the corrected at least two initial attitude estimation results is smaller than a preset second threshold, and a difference value between the corrected initial attitude estimation result and another initial attitude estimation result in the at least two initial attitude estimation results is equal to a camera attitude difference of each corresponding pair of cameras, and the corrected initial attitude estimation result refers to the initial attitude estimation result plus the correction increment;

step 104: and calculating to obtain a target attitude estimation result, wherein the target attitude result is the sum of any one of the at least two initial attitude estimation results and the correction increment thereof, and converting the target attitude result into an estimated attitude object world coordinate system.

In the present embodiment, the views of the posture-estimated object are taken from different perspectives by using a plurality of cameras, and the posture-estimated object regions taken by two cameras associated with each other among the plurality of cameras have overlapping portions, so that there is a correlation between the views taken by the plurality of cameras. Since the confidence and the accuracy of the pose estimation result of each view are to be improved, the invention utilizes the relevance among a plurality of views to improve the confidence and the accuracy of the pose estimation result. Firstly, screening view attitude estimation results shot by a plurality of corresponding cameras according to camera attitude differences among the cameras, selecting view results capable of being fused from the view attitude estimation results, establishing an optimization scheme according to the initial attitude estimation results capable of being fused and the corresponding camera attitude differences, balancing errors of the initial attitude estimation results of each view, average projection errors of the views and the errors among the camera attitude differences, and well correcting the initial attitude estimation results of each view participating in fusion, so that a plurality of related incomplete independent events are corrected into concurrent events, and the probability theory can know that the accuracy and the confidence coefficient of the attitude estimation results are improved.

Because the target posture estimation results under different views are the postures of the estimated posture object under the camera coordinate system corresponding to the view, the target posture estimation results need to be converted into the estimated posture object under the world coordinate system and then output after being calculated.

Further, any two cameras of the plurality of cameras are a first camera and a second camera, as shown in fig. 2, the determining the camera pose difference between any two cameras of the plurality of cameras comprises:

step 201: establishing a world coordinate system by taking the estimated attitude object as a center, respectively carrying out position calibration on the first camera and the second camera to obtain external parameters, and obtaining a rotation matrix R corresponding to the first camera_LRotation matrix R corresponding to the second camera_RThe rotation matrix is the rotation transformation from a camera coordinate system to a world coordinate system;

step 202: using a rotation matrix R_LAnd a rotation matrix R_RCalculating a rotation matrix R from the first camera to the second camera_LoRWherein R is_LoR＝R_L*R_R；

Step 203: calculating R by Euler angle_LoRDecomposed into pitch, roll and yaw anglesAnd rotating angles in three directions to obtain a camera attitude difference between the first camera and the second camera.

Further, the method for respectively performing pose estimation by using the views shot by the plurality of cameras is to perform pose estimation of an estimated object on the views shot by the cameras based on a method that a 3D model corresponds to a 2D feature point;

when the at least two initial pose estimation results are a and B, a is obtained by performing pose estimation on the view captured by the first camera, B is obtained by performing pose estimation on the view captured by the second camera, and the respective correction increments of a and B are Δ a and Δ B, as shown in fig. 3, the step 103 includes:

step 301: selecting at least two initial attitude estimation results which can be fused from the plurality of initial attitude estimation results according to the camera attitude difference;

step 302: establishing an optimization objective function and constraint conditions thereof by using the at least two initial attitude estimation results;

the optimization objective function is:

the constraint conditions are as follows:

a + Δ a + CONST ═ B + Δ B constraint 1;

F_L(A+ΔA,i)＝||f_L(A+ΔA,MP_i)-Pi(L)||，

F_R(B+ΔB,i)＝||f_R(B+ΔB,MP_i)-Pi(R)||；

wherein CONST is a camera pose difference, MP, between the first camera and the second camera_iFor the ith point of the 3D model in the object coordinate system, Pi (L) is corresponding to the 3D model point MP_iCorresponding 2D feature points in the view taken by the first camera, pi (r) corresponding to the 3D model point MP_iSecond phase ofCorresponding 2D feature points f in a camera-shot view_L(.) projection transformation formula from 3D model points to 2D feature points of the first camera, f_R(.) is a projection transformation formula from the 3D model point to the 2D feature point of the second camera, and V is a preset second threshold;

the above formula is explained by the existence of two fusion objects, further, when there are n fusion objects, the target formula 1 cumulative summation term has n summation terms for each i, meanwhile, the constraint condition formula 2 also has n corresponding constraint condition formulas, and meanwhile, if the fusion object is associated with m corresponding camera differences, the constraint condition 1 also has m corresponding constraint formulas.

Step 303: and solving the optimized objective function by using an enumeration method or an iteration method to obtain delta A and delta B.

Further, the solving the optimization objective function by using an iterative method includes:

calculating a corrected total amount Δ D ═ B-a-CONST;

let n equal to 0, S equal to Δ D, A₀＝A，

Entering an iterative process:

while (S >) preset third threshold value

n++；

S＝S/2；

}

S is the step length of each iteration;

if S < the preset third threshold, stopping iteration;

in this case, Δ a ═ An-a and Δ B ═ Δ a- Δ D were obtained.

Example two

The present embodiment provides an apparatus for estimating a pose of an object, which performs pose estimation on the same pose object by using a plurality of cameras, wherein the plurality of cameras can capture views of the pose object from different perspectives, and regions of the pose object to be estimated captured by two associated cameras in the plurality of cameras have overlapping portions, as shown in fig. 4, the apparatus comprising:

a camera pose difference determining module 41, configured to determine a camera pose difference between two cameras in the system of the multiple cameras and the pose object to be estimated, where the camera pose difference is a fixed difference between pose results obtained for the same pose object to be estimated from different camera visual angles, and is equivalent to a rotation angle from one camera to the other camera along three orthogonal directions in a world coordinate system, and regions of the pose object to be estimated, which are captured by the two cameras, respectively, have overlapping portions;

an initial estimation module 42, configured to perform pose estimation on the views captured by the multiple cameras, respectively, to obtain multiple initial pose estimation results;

an optimization solving module 43, configured to select at least two initial pose estimation results that can be fused from the multiple initial pose estimation results according to the camera pose difference, where a difference distance between a pose difference between the two initial pose estimation results and a camera pose difference of a corresponding camera is smaller than a preset first threshold; establishing an optimization objective function and a constraint condition thereof by using the at least two initial attitude estimation results, and solving to obtain a correction increment of each initial attitude estimation result, wherein the optimization objective function is to minimize an average back projection error of all feature points in the view corresponding to the corrected at least two initial attitude estimation results, the constraint condition is that the average back projection error of the feature points in each view corresponding to the corrected at least two initial attitude estimation results is smaller than a preset second threshold, a difference value between each corrected initial attitude estimation result and another corrected initial attitude estimation result in the at least two initial attitude estimation results is equal to a camera attitude difference of each corresponding pair of cameras, and the corrected initial attitude estimation result is the initial attitude estimation result plus the correction increment;

a calculation module 44, configured to calculate a target attitude estimation result, where the target attitude estimation result is a sum of any one of the at least two initial attitude estimation results and a correction increment thereof;

and the conversion module 45 is used for converting the target posture result into a world coordinate system of the estimated posture object after the target posture estimation result is obtained through calculation.

Further, as shown in fig. 5, the camera pose difference determination module 41 includes:

a position calibration unit 411, configured to establish a world coordinate system with the estimated posture object as a center, perform position calibration on the first camera and the second camera respectively to obtain respective external parameters, and obtain a rotation matrix R corresponding to the first camera_LRotation matrix R corresponding to the second camera_RThe rotation matrix is the rotation transformation from a camera coordinate system to a world coordinate system;

a calculation unit 412 for utilizing the rotation matrix R_LAnd a rotation matrix R_RCalculating a rotation matrix R from the first camera to the second camera_LoRWherein R is_LoR＝R_L*R_R；

A decomposition unit 413 for dividing R by calculation of Euler angles_LoRAnd decomposing the angle into a pitch angle, a roll angle and a yaw angle to obtain the camera attitude difference between the first camera and the second camera.

when the at least two initial attitude estimation results are a and B, a is obtained by performing attitude estimation on the view captured by the first camera, B is obtained by performing attitude estimation on the view captured by the second camera, and the respective correction increments of a and B are Δ a and Δ B, as shown in fig. 6, the optimization solving module 43 includes:

a selection unit 431 configured to select at least two initial pose estimation results a and B that can be fused from the plurality of initial pose estimation results according to the camera pose difference;

an optimization objective function establishing unit 432 for establishing an optimization objective function

The constraint conditions are as follows:

a + Δ a + CONST ═ B + Δ B constraint 1;

F_L(A+ΔA,i)＝||f_L(A+ΔA,MP_i)-Pi(L)||，

F_R(B+ΔB,i)＝||f_R(B+ΔB,MP_i)-Pi(R)||；

And a solving unit 433, configured to solve the optimized objective function by using an enumeration method or an iteration method to obtain Δ a and Δ B.

Further, the solving unit 433 is specifically configured to:

calculating a corrected total amount Δ D ═ B-a-CONST;

let n equal to 0, S equal to Δ D, A₀＝A，

Entering an iterative process:

while (S >) preset third threshold value

n++；

S＝S/2；

}

S is the step length of each iteration;

if S < the preset third threshold, stopping iteration;

in this case, Δ a ═ An-a and Δ B ═ Δ a- Δ D were obtained.

EXAMPLE III

The present embodiment also provides an electronic device for object pose estimation, which performs pose estimation on the same pose estimation object by using a plurality of cameras, wherein the plurality of cameras can capture views of the pose estimation object from different perspectives, and regions of the pose estimation object captured by two associated cameras in the plurality of cameras have overlapping portions, as shown in fig. 7, and the electronic device 60 includes:

a processor 62; and

a memory 64, in which memory 64 computer program instructions are stored,

wherein the computer program instructions, when executed by the processor, cause the processor 62 to perform the steps of:

Further, as shown in fig. 7, the electronic device for object pose estimation further includes a network interface 61, an input device 63, a hard disk 65, and a display device 66.

The various interfaces and devices described above may be interconnected by a bus architecture. A bus architecture may be any architecture that may include any number of interconnected buses and bridges. Various circuits of one or more Central Processing Units (CPUs), represented in particular by processor 62, and one or more memories, represented by memory 64, are coupled together. The bus architecture may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like. It will be appreciated that a bus architecture is used to enable communications among the components. The bus architecture includes a power bus, a control bus, and a status signal bus, in addition to a data bus, all of which are well known in the art and therefore will not be described in detail herein.

The network interface 61 may be connected to a network (e.g., the internet, a local area network, etc.), and may obtain relevant data from the network and store the relevant data in the hard disk 65.

The input device 63 may receive various commands input by an operator and send the commands to the processor 62 for execution. The input device 63 may include a keyboard or a pointing device (e.g., a mouse, a trackball, a touch pad, a touch screen, or the like).

The display device 66 may display the results of the instructions executed by the processor 62.

The memory 64 is used for storing programs and data necessary for operating the operating system, and data such as intermediate results in the calculation process of the processor 62.

It will be appreciated that the memory 64 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory 64 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 64 stores elements, executable modules or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system 641 and application programs 642.

The operating system 641 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application program 642 includes various application programs, such as a Browser (Browser), and is used for implementing various application services. A program implementing the method of an embodiment of the present invention may be included in the application program 642.

The processor 62 may determine, when the application program and the data stored in the memory 64 are called and executed, specifically, the program or the instruction stored in the application program 642, a camera pose difference between two cameras in the system of the multiple cameras and the object with the pose to be estimated, where the camera pose difference is a fixed difference between pose results obtained by different camera visual angles for the same object with the pose to be estimated, and is equivalent to a rotation angle from one camera to another camera along three orthogonal directions in a world coordinate system, and the regions of the object with the pose to be estimated, which are respectively captured by the two cameras, have overlapping portions; respectively carrying out attitude estimation by using the views shot by the plurality of cameras to obtain a plurality of initial attitude estimation results; selecting at least two initial attitude estimation results which can be fused from the initial attitude estimation results according to the camera attitude difference, wherein the difference distance between the attitude difference between the two initial attitude estimation results and the camera attitude difference of the corresponding camera is smaller than a preset first threshold value; establishing an optimization objective function and a constraint condition thereof by using the at least two initial attitude estimation results, and solving to obtain an increment of each initial attitude estimation result, wherein the optimization objective function is to minimize an average back projection error of all feature points in the views corresponding to the corrected at least two initial attitude estimation results, the constraint condition is that the average back projection error of the feature points in each view corresponding to the corrected at least two initial attitude estimation results is smaller than a preset second threshold, a difference value between each corrected initial attitude estimation result and another corrected initial attitude estimation result in the at least two initial attitude estimation results is equal to a camera attitude difference of each corresponding pair of cameras, and the corrected initial attitude estimation result is the initial attitude estimation result plus a correction increment; and calculating to obtain a target attitude estimation result, wherein the target attitude result is the sum of any one of the at least two initial attitude estimation results and the correction increment thereof, and converting the target attitude result into a world coordinate system of the estimated attitude object.

The method disclosed by the above embodiment of the present invention can be applied to the processor 62, or implemented by the processor 62. The processor 62 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 62. The processor 62 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 64, and the processor 62 reads the information in the memory 64 and performs the steps of the above method in combination with the hardware thereof.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Specifically, the processor 62 establishes a world coordinate system with the estimated posture object as a center, and performs position calibration on the first camera and the second camera respectively to obtain respective external parameters, so as to obtain a rotation matrix R corresponding to the first camera_LRotation matrix R corresponding to the second camera_RThe rotation matrix is the rotation transformation from a camera coordinate system to a world coordinate system; using a rotation matrix R_LAnd a rotation matrix R_RCalculating a rotation matrix R from the first camera to the second camera_LoRWherein R is_LoR＝R_L*R_R(ii) a Calculating R by Euler angle_LoRAnd decomposing the angle into a pitch angle, a roll angle and a yaw angle to obtain the camera attitude difference between the first camera and the second camera.

Specifically, the processor 62 performs pose estimation on the view shot by the camera based on a method that the 3D model corresponds to the 2D feature point;

establishing an optimized objective function

The constraint conditions are as follows:

a + Δ a + CONST ═ B + Δ B constraint 1;

F_L(A+ΔA,i)＝||f_L(A+ΔA,MP_i)-Pi(L)||，

F_R(B+ΔB,i)＝||f_R(B+ΔB,MP_i)-Pi(R)||；

wherein CONST is a camera pose difference, MP, between the first camera and the second camera_iFor the ith point of the 3D model in the object coordinate system, Pi (L) is corresponding to the 3D model point MP_iCorresponding 2D feature points in the view taken by the first camera, pi (r) corresponding to the 3D model point MP_iCorresponding 2D feature points, f, in the view taken by the second camera_L(.) is a projective transformation formula of 3D model points to 2D feature points of the first camera,f_R(.) is a projection transformation formula from the 3D model point to the 2D feature point of the second camera, and V is a preset second threshold; the formula is described by the existence of two fusion objects, further, when there are n fusion objects, the target formula 1 cumulative summation term has n summation terms for each i, meanwhile, the constraint condition formula 2 also has n corresponding constraint condition formulas, and meanwhile, if the fusion object is associated with m corresponding camera differences, the constraint condition 1 also has m corresponding constraint formulas.

Example four

The method for estimating the object pose of the present invention is further described below by taking the pose-estimated object as the head of the user as an example. As shown in fig. 8, the method for estimating the object posture of the present embodiment includes the following steps:

step 801: determining a camera pose difference between the two cameras;

generally, as shown in FIG. 9, the head pose of the user contains three degrees of freedom, which can be represented by three orthogonal orientation angles of Pitch, Roll, and Raw, which correspond to a 3 × 3 rotation matrix. Object pose estimation based on 3D models is to find the appropriate rotational matrix and translation transformation of the object from the 2D views taken by the camera.

Since the head of a person can be approximately seen as a cylinder or a sphere, when the head is rotated to a large inclination angle, if a single camera is used to capture a view of the head of the user, the accuracy of the pose estimation result is reduced due to the limited field of view of the single camera. To obtain a larger field of view, and in particular to track the continuous head pose variations of the driver in the cockpit, the present embodiment may deploy multiple cameras to view the orientation of the head from different angles, better capturing the view of the user's head. One basic requirement in a multi-camera deployment is that user head views taken from different viewing angles need to have overlapping face portions two by two. Fig. 10 schematically shows a schematic diagram of head pose estimation using 2 cameras. In fig. 10, 2 cameras are symmetrically distributed on two sides of the face of the head of the user in the forward direction, and face the face of the user together, so that a multi-view posture estimation system is formed.

In a multi-view environment, each camera works independently and its pose estimation results have some randomness. When all cameras estimate the pose of the same object, they are correlated. The stable position relationship of the cameras can lead to certain relevance of head posture estimation results obtained under different view angles. The most direct representation of the correlation is that once the camera and object positions are determined, there is a stable angular difference in the attitude of the object observed from any two views, and the angular difference in the attitude of the same object observed from different views is referred to as the camera attitude difference in the present invention.

The following describes a parameter of the camera attitude difference by taking the system shown in fig. 10 as an example. In fig. 10, 2 cameras are symmetrically and horizontally distributed on two sides of the front of the head of the user, and a top view of the three points formed by the 2 cameras and the head of the user forms an included angle of 90 degrees. Thus, when the face is right in front, the left camera and the right camera give the user's head pose at this time at angles of-45 ° and 45 °, respectively, in the direction of the Yaw. If the face deflects to the left direction by 10 ° at this moment, the angles of the user head poses given by the left camera and the right camera in the Yaw direction at the new moment are-35 ° and 55 °, respectively, and it can be seen that an angle difference of 90 ° is always maintained between the angles of the user head poses given by the left camera and the right camera in the Yaw direction, where the angle difference of 90 ° is the value of the camera pose difference parameter in the Yaw direction in the present invention.

It should be noted that a camera pose difference only represents the pose difference existing when the associated 2 views describe the same object pose, that is, the rotation angle in three degrees of freedom. In a multi-view system composed of n cameras, if there is an overlapping region between two cameras and the object to be estimated has a camera pose difference, the maximum number of camera pose differences in the multi-view system should be n x (n-1)/2, for example, there is one camera pose difference between 2 cameras and 3 camera pose differences between 3 cameras.

After the positions of the camera and the object are determined, the camera attitude difference can be calculated in advance through a camera calibration technology. Specifically, the head pose of the user, i.e. the object pose, is equivalent to the external orientation of the camera in the world coordinate system of the object, i.e. the rotational and translational transformation of the camera coordinate system into the world coordinate system is determined. The camera extrinsic parameters obtained in the camera calibration are exactly the rotation and translation matrices of the camera with respect to the world coordinate system. Therefore, a world coordinate system can be established by taking the head of the user as the center, the position calibration of the left camera and the right camera is completed, and a left camera rotation matrix R is obtained_LAnd right camera rotation matrix R_RAnd then may be based on R_LAnd R_RCalculating a rotation matrix R for transforming from a left camera to a right camera_LoR(R_LoR＝R_L*R_R) Finally R_LoRThe attitude parameters CONST (Pitch, Roll, Yaw) of the two-phase machine can be decomposed into Pitch, Roll and Yaw angles through Euler angle calculation.

Step 802: independently performing head posture estimation based on the view shot by each camera to obtain an initial head posture estimation result;

under the multi-view system, each camera works independently, and head pose estimation is carried out based on the shot view respectively to obtain an initial head pose estimation result. Each camera in this embodiment may employ a model-based approach to estimate the head pose from a single view. The model-based pose estimation method mainly estimates the pose of an object by applying the corresponding geometric correspondence between 3D model points and 2D image points. In a specific implementation, a series of face feature points can be obtained by using an ASM/aam (active shape model) face alignment method, and the more effective face feature points are generally distributed near the contour of the face, the nose and the mouth of the eyebrow eyes. Fig. 11 shows 77 distribution of personal face feature points on a face, and then some or all of the feature points can be selected from the distribution, and face poses corresponding to the feature points are calculated by using a model-based pose estimation method posit (position from the original image and scaling with the orientations). POSIT can estimate the pose of an object from a single view, requiring at least four non-coplanar points to complete the mapping of a 3D model to a 2D view, where the 3D model can be obtained by scanning a 3D scanner over a head model.

The accuracy of the POSIT pose estimation depends on the capability of the 3D model for describing the current face and the accuracy of the face feature point positioning, and is often influenced by the face pose angle. As the head rotation angle increases, the accuracy and confidence of the corresponding pose estimation result gradually decreases. In this sense, the pose estimation result under a single view can be taken as a randomly distributed sample value associated with the true pose of the object, the true pose of the object is different, the corresponding distribution is also different, and the difference in distribution also means that the confidence is also different.

Step 803: judging whether an initial head posture estimation result which can be fused exists, if so, turning to a step 804, and if not, turning to a step 806, and directly utilizing a view shot by the existing camera to calculate to obtain a final head posture estimation result;

and the accuracy and the confidence coefficient of the attitude estimation result are enhanced by adopting multiple views, and the core of the method is to fuse the independent attitude estimation results of the multiple views. Not every view can provide a positive contribution to the improvement in accuracy. Due to the influence of random factors, the attitude estimation result of partial views may be wrong, so that an appropriate fusion object is selected to perform fusion of multi-view attitude estimation.

The criterion for the selection of the fusion object is that the difference value between the attitude difference between the two initial attitude estimation results and the camera attitude difference of the corresponding camera is smaller than a preset first threshold value. The camera pose difference in the present invention represents the stable pose difference constraint between 2 views. If the difference of the pose estimation results of the current 2 views greatly deviates from the pre-obtained camera pose difference, it indicates that one of the currently associated 2 pose estimation results is necessarily wrong. Conversely, the pose estimation results of the associated 2 views can be selected as the fusion object.

Since the attitude is a three-dimensional vector with three degrees of freedom, the attitude estimation results of 2 views can be directly subtracted on each component when the attitude difference is calculated, and Euclidean distance calculation is adopted when the calculation is compared with the camera attitude difference parameter. For example, the pose estimation result obtained from view 1 is a, the pose estimation result obtained from view 2 is B, the difference value Pd between the current poses of these 2 views is B-a, and the corresponding camera pose difference is Pc, so when the euclidean distance | Pd-Pc | < threshold T (i.e. preset first threshold), the pose estimation results a and B may be selected as the fusion object.

Step 804: constructing a fusion optimization scheme by using the fusion initial head attitude estimation result, wherein the fusion optimization scheme comprises an optimization objective function and constraint conditions thereof;

in a multi-view system, if a plurality of fusion objects exist, the accuracy and the confidence of the system attitude estimation can be improved based on the fusion of an optimization scheme.

The core idea of optimizing the objective function is to find an optimized self-adjustment for each fusion object to make the fusion object approach the associated camera pose difference, while satisfying the constraint of the camera pose difference and not allowing to destroy the current basis for estimating the pose from the view. This is because the pose estimation result from each view can be viewed as a sample from a particular distribution, and if the sample values are fine-tuned to make 2 independent events become synchronous events after the camera pose difference is met, without changing the distribution from which each sample came (adhering to the principle constraints of view estimation pose), then the confidence of the system will be improved.

In the model-based pose estimation method, the pose of the object is estimated from the 2D view, and the pose estimation result is usually verified by 3D projective transformation, which can be defined as follows:

a series of rotations around Pitch, Roll and Yaw directions can be represented as a 3X3 rotation matrix R, and the 3D object pose can be strictly described by the orientation rotation matrix R and its position T (3D translation vector) relative to the world coordinates, so that the pose P ═ R | T ] is a 3X4 matrix, and given a point (X, Y, Z) on the 3D object model under the object coordinate system, the position of its corresponding projection point (X, Y) in the 2D view can be defined as follows:

(x,y)^T＝(x0+fx*Xc/Zc,y0+fy*Yc/Zc)^T (1)

here (Xc, Yc, Zc)^T＝[R|T](X,Y,Z)^T

Where (x0, y0) is the coordinates of the 2D view center point and (fx, fy) is the focal length parameter of the camera in the "x" and "y" directions.

The rotation angle of the posture can be reflected in the rotation matrix after being adjusted, and then is associated with the feature points in the view through 3D projection transformation. The distance between a 2D point obtained by projecting a 3D model point into a 2D view and the corresponding feature point detected in the 2D view is called 3D projection error. The present embodiment verifies the correctness of the pose estimation result through the 3D projection error, that is, the projection error is used as a constraint condition of the pose estimation. Fig. 11 shows the projection points obtained after 5 3D model points are adjusted by 5 ° in the Yaw direction (the projection points are marked with white larger circles).

The pose estimation result for each view should be modified between the camera pose difference and the view pose estimation constraints with the optimization goal of minimizing the multi-view average projection error while keeping the projection error for each view below a preset second threshold. Taking two cameras symmetrically distributed on two positive sides of the head of the user in the multi-view system as an example, the optimization scheme can be described by the following mathematical formula:

the constraint conditions are as follows:

A+ΔA+CONST＝B+ΔB

wherein, F_L(.)/F_R(.) is the projection error calculation function for the left/right camera;

F_L(A+ΔA,i)＝||f_L(A+ΔA,MP_i)-Pi(L)||；

F_R(B+ΔB,i)＝||f_R(B+ΔB,MP_i)-Pi(R)||；

where CONST is a camera pose difference between the first camera and the second camera, MPi is an i-th point of the 3D model in the object coordinate system, Pi (L) is a point MP corresponding to the 3D model_iPi (r) is the feature point in the view taken by the first camera corresponding to the 3D model point MP_iOf the second camera, f_L(.) projection transformation formula from 3D model points to 2D feature points of the first camera, f_R(.) defining a reference formula (1) for a projective transformation formula of the 3D model points to the 2D feature points of the second camera; and V is a preset second threshold value.

After independent pose estimation by a single view, pose estimation results a (Raw, Pitch, Roll) and B (Raw, Pitch, Roll) of the object at the moment in the respective camera coordinates are obtained from the views taken by the left and right cameras, respectively, and the camera pose difference CONST (Raw, Pitch, Roll) between the 2 cameras at the moment is calculated in advance after the camera and object positions are determinedAnd (4) calculating. The optimization problem is to calculate the respective correction increment delta A and delta B of the attitude estimation results A and B, satisfy the constraint requirement of the camera attitude difference A + delta A + CONST-B + delta B, and the optimization aim is to minimize the average projection error of all the feature points in the views of the left camera and the right camera

While keeping the projection error of each view smaller than a preset second threshold V.

The optimization objective function is to find an optimized correction to balance various errors, including the structural error of the pose estimation, the average projection error, and the error with the difference of the pose of the camera for each view. It can be predicted that the correction amount of each view is different, fig. 12 is a projection diagram of a 3D sphere of an equal rotation distance grid on a 2D plane, each intersection point in the diagram can be regarded as a 3D projection point, and the diagram schematically shows the self-adjusting strategy of each fused view pose under the optimization goal. Even if the 3D spheres of the equal rotation pitch grid are rotated by the same angle, the projection errors of the rotated spheres are different with different postures of the spheres (the projection error at the center of the front is maximum, and then gradually decreases to the left, right, upper and lower sides, and reaches the minimum at the edge), which is similar to the head model. As can be further understood from fig. 12 and the optimization objective, in these 2 fusion objects, a relatively large correction amount (corresponding to the edge of the 3D sphere) is preferentially given to the object with a large deflection angle, and a smaller correction amount (corresponding to the center position right in front of the 3D sphere) is given to the object with a smaller deflection angle, so as to minimize the overall average projection error without changing the total correction amount.

Step 805: solving the optimized objective function;

the simplest solving method is to enumerate all possible combinations of delta A and delta B meeting the constraint conditions after discretization is applied to the total correction quantity, and since the optimization processing only involves matrix operation and not view processing, enumerating all possible combinations of delta A and delta B meeting the constraint conditions does not need to consume much calculation.

Besides the enumeration method, an iterative method of searching for correction amounts may be used, and taking the calculation of the respective adjustment amounts in the direction of the 2-camera Yaw as an example, this iterative search method may be expressed as the following process. Since the projection value in the Yaw direction only affects the X coordinate value, the projection error can be simplified and defined as the difference between the characteristic point in the X direction and the projection point as a measure to evaluate the optimization target

Firstly, calculating a total correction amount delta D-B-A-CONST;

let n equal to 0, S equal to Δ D, A₀＝A，

Entering an iterative process:

while (S >) preset third threshold value

n++；

S＝S/2；

}

S is the step length of each iteration;

if S < the preset third threshold, stopping iteration;

in this case, Δ a ═ An-a and Δ B ═ Δ a- Δ D were obtained.

In the above process, a and B are the object poses estimated by using the views taken by the left camera and the right camera respectively, and each candidate in the iterative process represents a new pose estimation result obtained by correcting the pose estimation result a, and the new pose estimation result obtained by correcting the pose estimation result B corresponding to the candidate can be calculated from the camera pose difference associated with the candidate. Therefore, the average projection error corresponding to the two cameras at the moment can be calculated for each candidate solution, the solution corresponding to the minimum average projection error in the three candidate solutions can be selected as the better solution in the current iteration, and the next iteration processing is carried out based on the selected preferred solution. Since the iteration process is performed until the step size of the iteration is smaller than the preset third threshold, the third threshold also represents the accuracy of the attitude estimation. The other two-direction rotation angle correction amount calculation can also be performed by adopting a similar scheme.

Step 806: and calculating to obtain a final head posture estimation result.

After Δ a and Δ B are obtained, a final head pose estimation result can be calculated. The final head pose estimate is the sum of A and Δ A, or B and Δ B.

The new attitude estimation results after being corrected by delta A and delta B meet the constraint requirement of the attitude difference of the camera, so that the new attitude estimation results can independently derive the same attitude result relative to the position right in front of the object, and therefore 2 independent events become concurrent events, and the confidence coefficient and the precision of the system are improved according to the probability theory.

Since the coordinate system of the head pose estimation result calculated from the image is the camera coordinate system, it is necessary to convert the coordinate system of the final pose estimation result from the camera coordinate system to the world coordinate system and output the converted coordinate system, which is a known technique. In general, the world coordinate system is centered on an object (a user's head) and faces in a forward direction.

In the present embodiment, two cameras are used to capture views of the head of a user from different perspectives, and the head regions of the user captured by the two cameras have overlapping portions, so that there is a relationship between the views captured by the two cameras. Since the confidence and the accuracy of the pose estimation result of each view are to be improved, the invention utilizes the relevance between the two views to improve the confidence and the accuracy of the pose estimation result. Firstly, screening view attitude estimation results shot by two cameras according to the camera attitude difference between the cameras, selecting view attitude estimation results capable of being fused from the view attitude estimation results, establishing an optimization scheme according to the initial attitude estimation results capable of being fused and the corresponding camera attitude difference, balancing the error of the initial attitude estimation result of each view, the average projection error of all the fused views and the error between the average projection error and the camera attitude difference, and well correcting the initial attitude estimation result of each view, so that a plurality of related incomplete independent events are corrected into concurrent events, and the probability theory can know that the accuracy and the confidence coefficient of head attitude estimation are improved.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of object pose estimation, wherein a same pose estimation object is pose estimated using a plurality of cameras capable of capturing views of the pose estimation object from different perspectives, and wherein estimated pose object regions captured by two associated cameras of the plurality of cameras have overlapping portions, the method comprising:

2. The method of object pose estimation according to claim 1, wherein any two cameras in the plurality of cameras are a first camera and a second camera, and wherein the determining the camera pose difference between two cameras in the plurality of cameras and the object composition system under estimation of pose comprises:

3. The method of object pose estimation according to claim 1, wherein the method of pose estimation using the views taken by the plurality of cameras respectively is pose estimation of an estimated object for the views taken by the cameras based on a method of corresponding 3D models to 2D feature points;

establishing an optimized objective function

The constraint conditions are as follows:

A+ΔA+CONST＝B+ΔB；

F_L(A+ΔA,i)＝||f_L(A+ΔA,MP_i)-Pi(L)||，

F_R(B+ΔB,i)＝||f_R(B+ΔB,MP_i)-Pi(R)||；

4. The method of object pose estimation according to claim 3, wherein said solving the optimized objective function using an iterative method comprises:

calculating a corrected total amount Δ D ═ B-a-CONST;

let n equal to 0, S equal to Δ D, A₀＝A，

Entering an iterative process:

while (S >) preset third threshold value

n++；

S＝S/2；

}

S is the step length of each iteration;

if S < the preset third threshold, stopping iteration;

in this case, Δ a ═ An-a and Δ B ═ Δ a- Δ D were obtained.

5. An apparatus for object pose estimation, wherein the same pose estimation object is pose-estimated using a plurality of cameras, the plurality of cameras being capable of capturing views of the pose estimation object from different perspectives, and wherein estimated pose object regions captured by two associated cameras of the plurality of cameras have overlapping portions, the apparatus comprising:

6. The apparatus of object pose estimation according to claim 5, wherein the camera pose difference determination module comprises:

a position calibration unit for establishing a world coordinate system with the estimated attitude object as a center, respectively performing position calibration on the first camera and the second camera to obtain respective external parameters, and obtaining a rotation matrix R corresponding to the first camera_LCorresponding to the second cameraOf (3) a rotation matrix R_RThe rotation matrix is the rotation transformation from a camera coordinate system to a world coordinate system;

A decomposition unit for decomposing R by calculation of Euler angles_LoRAnd decomposing the angle into a pitch angle, a roll angle and a yaw angle to obtain the camera attitude difference between the first camera and the second camera.

7. The apparatus of claim 5, wherein the method of performing pose estimation by using the views taken by the plurality of cameras respectively is to perform pose estimation of an estimated object on the views taken by the cameras based on a method of corresponding a 3D model to a 2D feature point;

The constraint conditions are as follows:

A+ΔA+CONST＝B+ΔB；

F_L(A+ΔA,i)＝||f_L(A+ΔA,MP_i)-Pi(L)||，

F_R(B+ΔB,i)＝||f_R(B+ΔB,MP_i)-Pi(R)||；

8. The apparatus for object pose estimation according to claim 7, wherein the solving unit is specifically configured to:

calculating a corrected total amount Δ D ═ B-a-CONST;

let n equal to 0, S equal to Δ D, A₀＝A，

Entering an iterative process:

while (S >) preset third threshold value

n++；

S＝S/2；

}

S is the step length of each iteration;

if S < the preset third threshold, stopping iteration;

in this case, Δ a ═ An-a and Δ B ═ Δ a- Δ D were obtained.

9. An electronic device for object pose estimation, wherein the same pose estimation object is pose-estimated by a plurality of cameras, the plurality of cameras can take views of the pose estimation object from different perspectives, and the estimated pose object regions taken by two associated cameras in the plurality of cameras have overlapping portions, the electronic device comprising:

a processor; and

a memory having computer program instructions stored therein,