CN117934593A

CN117934593A - Target object pose estimation method, system and computer readable storage medium

Info

Publication number: CN117934593A
Application number: CN202410095782.5A
Authority: CN
Inventors: 宋锐; 周智毅; 王铎; 李娇娇; 苏烈超; 王庆元; 黄煌
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-04-26

Abstract

The invention discloses a target object pose estimation method, a target object pose estimation system and a computer readable storage medium. According to the invention, a view angle with higher internal point rate and a view angle with lower internal point rate are determined from RGB images acquired from two view angles; screening effective inner points from the view angles with lower inner point rate according to a preset method; then re-projecting the inner points in the view angles with higher inner point rate into the view angles with lower inner point rate, and combining the inner points with the screened effective inner points to form an inner point set; and estimating the pose according to the inner point set. The invention realizes the fusion of the 2D point clusters only by RGB images, utilizes the effective 2D information in the images of all view angles and realizes the fusion, fully utilizes the useful information of each RGB image useful for pose estimation under the condition of reducing hardware cost, and improves the precision of pose estimation.

Description

Target object pose estimation method, system and computer readable storage medium

Technical Field

The invention relates to the field of computer vision, in particular to a target object pose estimation method based on double-view fusion.

Background

The 6D pose estimation is a widely used technology in the fields of computer vision and robotics, and finds the position and pose information of an object in a three-dimensional space by analyzing an image. The technology has important significance in the fields of three-dimensional reconstruction, augmented reality, automatic navigation and the like. However, the accuracy of 6D pose estimation is often affected due to illumination variations, noise interference, occlusion conditions, weaker target surface texture, and real-time requirements. The result directly affects the accurate understanding of the system for the target position and pose, and therefore becomes a key problem in the fields of computer vision and robotics.

Current 6D pose estimation techniques are broadly divided into three types: global methods (Holistic methods), key point based methods (Keypoint-based methods), and Dense methods (Dense methods).

The overall method (Holistic methods) aims to estimate the 3D position and orientation of the object directly from the single shot image. Traditional methods (e.g., comparing images using the Hausdorff distance, HUTTENLOCHER D P, etc.; DISCRIMINATIVE MIXTURE-of-TEMPLATES FOR VIEWPOINT CLASSIFICATION, GU C, etc.) rely primarily on template matching techniques, which are sensitive to clutter and changes in appearance. With the development of technology, CNN shows remarkable robustness to environmental changes, so PoseNet (see PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, KENDALL A, etc.) introduces a CNN architecture to directly return 6D camera pose from a single RGB image; however, locating objects directly in 3D is difficult due to the lack of depth information. To overcome this problem PoseCNN (see PoseCNN: AConvolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes, XIANG Y for details) locates objects in 2D images and predicts their depth, and then obtains the 3D position.

Instead of directly capturing poses from images, the keypoint-based method (Keypoint-based methods) employs a two-stage approach: firstly, predicting a 2D key point cluster of an object, and then calculating the gesture through a 2D-3D corresponding relation by using a PnP algorithm. The 2D keypoint cluster detects 3D localization and rotation estimation is relatively easy. For objects with rich textures, the traditional methods such as Object recognition from local scale-INVARIANT FEATURES proposed by lower D G and the like can be used for robustly detecting local key points, so that the object gesture can be estimated efficiently and accurately even under a disordered scene and serious shielding. However, it is difficult for the conventional method to process a non-textured object and process a low resolution image, and to solve this problem, "AScalable,Accurate,Robust to Partial Occlusion Method for Predicting the3D Poses of Challenging Objects without Using Depth" proposed by RAD M or the like defines a set of key points and uses CNN as a key point detector. Another class of methods (e.g., 6-DoF object pose from semantic keypoints, PAVLAKOS G, etc.) outputs a pixel heat map of the keypoints to solve the occlusion problem.

In Dense methods (methods), each pixel or block of pixels produces a prediction of the desired output, and the final result is then voted in a generalized Hough voting scheme. Some work uses random forests to predict the 3D object coordinates of each pixel and uses geometric constraints to generate 2D-3D corresponding hypotheses. However, these methods require RGB-D data; and compared with sparse key points, the object coordinates provide dense 2D-3D corresponding relation for pose estimation, have stronger robustness to shielding, but are harder to detect than key points due to larger output space.

Aiming at the defects and problems in the single-view 6D pose estimation technology, a multi-view pose refinement technology appears in the field. This technique presents two main trends, first of all, which rely on independent pose hypothesis predictions for each monocular image. Second, it is assumed that the relative camera transforms between frames are known in advance. Known camera transformations are used to fuse pose predictions for multiple views in a global coordinate system. Then, either choose the pose assumption to Ji Zuihao with other assumptions, or refine the pose to align better in 3D space. In KASKMAN R et al, 6DoF Pose Estimation of Textureless Objects from Multiple RGB Frames, a set of uncalibrated RGB images is used to create a 3D reconstruction of the scale blur. Although this method has good performance, it has the disadvantage that it involves many complex and time-consuming steps and requires the use of a large number of frames to obtain a reliable reconstruction. CosyPose (see CosyPose: consistent multi-view Multi-object 6D pose estimation for details,Y, etc.) also rely on independent pose assumptions for each frame, which are used in the RANSAC scheme to match the pose assumptions from multiple frames, and produce a unified object-level scene reconstruction and approximate relative camera pose, which are then jointly optimized by minimizing multi-view re-projection errors. ZENG A et al, multi-view Self-supervised DEEP LEARNING for 6D Pose Estimation in the Amazon Picking Challenge, propose a Multi-view 6D pose estimation method based on 15 to 18 RGB-D images, which use a fully convoluted neural network to segment each RGB image separately into 2D objects. Then, the segmentation result and the depth image are fused into a single segmentation point cloud. All of these methods described above apply the deep neural network independently for each View, but MV6D proposed by DUFFHAUSS F et al (see for details MV6D: multi-View 6D Pose Estimation on RGB-D Frames Using a Deep Point-wise Voting Network) can accept a variable number of RGB-D frames over only one whole large network, extract relevant features, and fuse them into a joint feature representation of the whole input scene, then perform semantic segmentation and 3D keypoint prediction on each object.

Although the existing single view 6D pose estimation technology is more and more perfect, many problems still exist in the actual 6D pose estimation system. The current difficulties and key points include low texture objects, shielding, illumination and other problems. The existing single-view 6D pose estimation technology cannot well overcome the problems of depth, scale blurring, shielding and illumination caused by weak textures, but the existing multi-view fusion 6D pose estimation technology mostly adopts an RGB-D sensor, fuses 3D point clouds to try to solve the problem of single view, but the problem leads to the rise of equipment cost and calculation cost; or the fusion strategy is to fuse at R, T level (SE (3)); or simply by multi-view to solve the depth blur problem. These methods do not effectively fuse the acquired 2D data for each view angle.

Disclosure of Invention

The invention aims at: aiming at all or part of the existing problems, a target object pose estimation scheme based on multi-view fusion is provided to solve the problem that equipment cost and calculation cost are high or are limited to a decision-making level fusion in the existing multi-view fusion problem, and the contribution of 2D data to 6D pose estimation is improved.

The technical scheme adopted by the invention is as follows:

A target object pose estimation method, comprising:

Determining a view angle with higher internal point rate and a view angle with lower internal point rate from RGB images acquired from two view angles;

screening effective inner points from the view angles with lower inner point rate according to a preset method;

Fusing the interior points in the view angle with higher interior point rate with the screened effective interior points to form an interior point set;

And estimating the pose according to the interior point set.

Further, the determining, from the RGB images collected from the two views, the view with the higher interior point rate and the view with the lower interior point rate includes:

Respectively detecting 2D key point clusters from RGB images acquired from each view angle, wherein the 2D key point clusters comprise N groups of 2D pixel points which are in one-to-one correspondence with N3D key points of the target object, and N is a positive integer;

performing RANSACPNP operation by using the 2D key point clusters of each view angle to obtain the internal point rate of each view angle;

And comparing the interior point rates of the views to determine the view_high with higher interior point rate and the view_low with lower interior point rate.

Further, performing RANSACPNP operations on the 2D key point clusters of each view angle to obtain the interior point rate of each view angle, including:

Performing iterative operation of a predetermined number of rounds on the 2D key point clusters of each view angle by RANSACPNP, wherein each iterative operation comprises:

Corresponding to each 3D key point of the target object, selecting a corresponding 2D pixel point from the 2D key point cluster of the current operation view angle, and calculating the pose according to the corresponding relation between the 2D pixel point and the 3D key point of the target object;

re-projecting each 3D key point to an image plane according to the calculated pose to obtain a corresponding first re-projected 2D pixel point;

Respectively calculating Euclidean distances between each 2D pixel point in the 2D key point cluster and the corresponding first reprojection 2D pixel point, determining the 2D pixel point with the Euclidean distance meeting the first condition as an interior point, and calculating the interior point rate;

And selecting the highest internal point rate in each round of iteration as the internal point rate of the visual angle, and taking the pose calculated by the internal points of the round of iteration as the pose under the visual angle.

Further, the screening the effective interior points from the view angles with lower interior point rate according to a predetermined method includes:

projecting the first reprojection 2D pixel point of the view angle view_high to the view angle view_low according to the relative pose between the view angle view_high and the view angle view_low to obtain a second reprojection 2D pixel point;

and selecting 2D pixel points, which have a relation with the second re-projection 2D pixel points and meet a second condition, as effective screened interior points for the 2D pixel points in the 2D key point cluster of the view_low.

Further, the second condition is: in the previous iteration, the screened effective interior points have the largest duty ratio relative to the original 2D key point cluster of the view_low.

Further, for the 2D pixel point in the 2D key point cluster of the view_low, selecting the 2D pixel point with the relationship with the second projection 2D pixel point satisfying the second condition as the screened interior point, including:

And performing iterative operation of a predetermined number of rounds by taking the second re-projection 2D pixel point as an initial value of the first set, wherein each iterative operation comprises:

Selecting a group of 2D pixel points which are nearest to the second projection 2D pixel point and are not selected from a 2D key point cluster of a view_low, and adding the group of 2D pixel points into the first set;

calculating the interior point rate of the first set;

And selecting the inner points added to the first set in one round of iteration with the highest inner point rate as the effective inner points screened out.

Further, the fusing the interior points in the view angle with higher interior point rate with the screened interior points includes:

Re-projecting each inner point in the view_high to the view_low to obtain a third re-projection 2D pixel point in one-to-one correspondence;

And merging the third re-projection 2D pixel point with the screened interior point.

Further, after the pose of each view angle is calculated, the calculated pose is corrected.

The invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is operated to execute the target object pose estimation method.

The invention also provides a target object pose estimation system, which comprises a processor and the computer readable storage medium connected with the processor; the processor runs a computer program in the computer readable storage medium to perform a target object pose estimation method.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:

According to the invention, a double-view fusion algorithm design based on 2D prediction key point cloud is carried out, 2D point clusters are fused only by using RGB images, an RGB-D sensor is not needed, and the effect of reducing the hardware cost of the system is achieved. The method can be inserted into any pose estimator based on 2D key point cluster prediction, and has good portability. The invention can utilize the effective 2D information in the images of all view angles and realize fusion, and even the useful information in the images of the view angles with poor illumination conditions and serious shielding can be effectively utilized, thereby improving the pose estimation precision. Compared with the single-view pose estimation result, the pose estimation result has the effect of improving the precision under different experimental conditions.

Drawings

The invention will now be described by way of example and with reference to the accompanying drawings in which:

fig. 1 is a flowchart of the judgment of the internal point rate high-low view angle.

Fig. 2 is a schematic diagram of the transformation of world coordinates into a camera coordinate system.

Fig. 3 is a schematic diagram of the transformation of a camera coordinate system to an image physical coordinate system.

Fig. 4 is a schematic diagram of the transformation of the physical coordinate system of an image into the pixel coordinate system.

Fig. 5 is a flow chart of inlier screening and fusion.

Detailed Description

All of the features disclosed in this specification, or all of the steps in a method or process disclosed, may be combined in any combination, except for mutually exclusive features and/or steps.

Any feature disclosed in this specification (including any accompanying claims, abstract) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. That is, each feature is one example only of a generic series of equivalent or similar features, unless expressly stated otherwise.

Example 1

A target object pose estimation method is used for estimating the pose of a target object based on RGB image information acquired by a double-view (such as a binocular camera), and obviously, the two view angles are different, and the RGB images are acquired by both view angles. The pose of the target object in the world coordinate system is determined and known, which is described by N (N is a positive integer) 3D keypoints, i.e. the world coordinates of the 3D keypoints of the target object are known and determined. The method comprises two stages, wherein the first stage is to select the view angle with higher interior point rate from the two view angles, and the other view angle with lower interior point rate, and the 'higher' and 'lower' can uniquely determine the pointed object due to only two view angles. The second stage is to screen out useful information (namely effective interior points in the text) for pose estimation from the view angle with lower interior point rate by a certain method, fuse the useful information with the interior points of the view angle with higher interior point rate, and then perform pose estimation.

Specifically, the method comprises the following steps:

s1, determining a view angle with higher interior point rate and a view angle with lower interior point rate from RGB images acquired from two view angles. In practice, one of them is selected and the other is uniquely determined.

This step, when embodied, comprises:

S1.1, respectively detecting 2D key point clusters from RGB images acquired from each view angle (namely two view angles), wherein the 2D key point clusters comprise N groups of 2D pixel points which are in one-to-one correspondence with N3D key points of a target object. Here the detection of 2D keypoint clusters may be implemented using a 2D keypoint detection part in a two-stage pose estimator.

S1.2, performing RANSACPNP operation by respectively using the 2D key point clusters of each view angle to obtain the internal point rate of each view angle.

The step is a multiple iteration process, performing iterative operation of a predetermined number of rounds on the 2D key point clusters of each view angle by using RANSACPNP, wherein each iterative operation comprises:

And selecting corresponding 2D pixel points from the 2D key point clusters of the current operation view angles according to a random sampling consistency principle, and calculating the pose according to the corresponding relation between the 2D pixel points and the 3D key points of the target object. For example, if the target object has 8 3D key points, the 2D key point cluster under the current view angle correspondingly includes 8 groups of 2D pixel points, each iteration is performed, 1 2D pixel point is randomly selected from each group of 2D pixel points, a group of 2D pixel point groups including 8 pixel points is obtained, the 8 3D key points are combined, and based on the corresponding relation of 3D-2D, the pose R and T are calculated by using PnP algorithm (for example EPnP, orthogonal iteration PnP, etc.).

And re-projecting each 3D key point to an image plane according to the calculated pose, and obtaining a corresponding first re-projected 2D pixel point. And after the 8 3D key points are re-projected, 8 re-projected 2D pixel points are obtained.

The reprojection process is shown in fig. 2-4, and as shown in fig. 2, in order to convert the 3D keypoints from the world coordinate system into the camera coordinate system, the conversion method is shown in formula (1):

Wherein X _w,Y_w,Z_w is the three-dimensional coordinate value of the 3D key point in the world coordinate system, and X _c,Y_c,Z_c is the three-dimensional coordinate value of the 3D key point in the camera coordinate system.

Again according to the procedure shown in fig. 3, there is by the similar triangle theorem:

△ABO_c～△oCO_c

△PBO_c～ΔpCO_c

wherein f is a camera reference, and there are:

The coordinates of the 3D keypoints in the camera coordinate system can be converted into the image physical coordinate system according to equation (2).

As shown in fig. 4, there are:

Wherein u ₀,v₀ is the camera reference. The coordinates of the 3D key points in the physical coordinate system of the image can be converted into the pixel coordinate system according to the formula (3). Thus, the 3D key point coordinate is re-projected to the 2D pixel point.

And respectively calculating Euclidean distances between each 2D pixel point in the 2D key point cluster and the corresponding first reprojection 2D pixel point, determining the 2D pixel point with the Euclidean distance meeting the first condition as an interior point, and calculating the interior point rate. The 2D key point cluster has 8 groups of 2D pixel points, and each group of pixel points corresponds to one 3D key point and also corresponds to one reprojection 2D pixel point, namely a first reprojection 2D pixel point. The Euclidean distance between each 2D pixel point in each group of 2D pixel points and the first reprojection 2D pixel point corresponding to the group is calculated, and the calculation method is as follows:

Where x ₁、y₁ represents the abscissa of the 2D pixel point, and x ₂、y₂ represents the abscissa of the first re-projected 2D pixel point.

And determining 2D pixel points with the distance meeting the first condition (for example, less than a first threshold value) as inner points, and conversely, determining 2D pixel points without the distance meeting the first condition as outer points, wherein the inner points in each group of 2D pixel points are determined according to the screening. And calculating to obtain the interior point rate of the round of iteration according to the total interior point number/the total 2D pixel point number.

According to the method, iterating is repeated, the highest internal point rate in each round of iteration is selected as the internal point rate of the visual angle, pnP is calculated by the internal points of the round of iteration, and the calculated pose is used as the pose R and T under the visual angle.

The pose R, T and the interior point rate of each view angle are obtained after multiple iterations, and the parameter T of each view angle can be considered to be corrected through technologies such as laser radar, ultrasonic ranging, laser ranging and the like. Of course, this is a preferable design, and for the basic function implementation, the more the correction T is made to be close to the true value, the better the effect of the subsequent interior point fusion, and the more accurate the pose estimation.

S1.3, comparing the internal point rates of the views, and determining the view_high with the higher internal point rate and the view_low with the lower internal point rate. It may be directly referred to as a view_high having a high interior point ratio and a view_low having a low interior point ratio.

S2, screening out effective interior points from view_low with low interior point rate according to a preset method. The inliers in view_low are not all accurately projected, and some may be far from the position of the correct projection, and these inliers contain useless or even unfavorable information for pose estimation, so that inliers containing useful information (called valid inliers) need to be screened for pose estimation.

As shown in fig. 5, in the implementation, this step includes:

S2.1, projecting the first re-projection 2D pixel point of the view angle view_high to the view angle view_low according to the relative pose between the view angle view_high and the view angle view_low, and obtaining a second re-projection 2D pixel point.

The relative pose between the view angle view_high and the view angle view_low is defined as DeltaR and DeltaT, which can be determined when two cameras are set, for example, when RGB images are acquired by adopting binocular cameras, the pose relationship between the binocular cameras is determined, and of course, the pose relationship between the two vision cameras can also be obtained through camera calibration calculation, namely the relative pose between the two view angles in the invention.

The process of re-projecting the 2D pixel point in the view angle A to the view angle B to obtain the corresponding re-projected 2D pixel point comprises the following steps:

first, the 2D pixel point in view_high is restored to the camera coordinate system of view_high, and the following equation 5 can be obtained:

The known 2D pixel point coordinates (u, v) can thus be restored to 3D point coordinates (x _c,y_c,z_c) in the camera coordinate system. The conversion process requires knowledge of the depth Z _c. And according to R and T calculated by RANSACPNP and 3D key point coordinates (x _w,y_w,z_w) in a well-defined world coordinate system, the depth Z _c can be calculated by the formula (1).

And then through a coordinate system conversion formula among coordinate systems:

Converting the 3D point coordinates (X _{c_A},y_{c_A},z_{c_A}) under the camera coordinate system of view a to the 3D point coordinates (X _{c_B},y_{c_B},z_{c_B}) under the camera coordinate system of view B, wherein (X _A,Y_A,Z_A) and (X _B,Y_B,Z_B) respectively represent the 3D point coordinates in view a and view B, view a corresponds to view_high in the present embodiment, and view B corresponds to view_low in the present embodiment.

Then, the 3D point under the camera coordinate system of the view angle B is projected to an image plane through the formula (5) to obtain 2D pixel point coordinates (u, v), and finally the 2D pixel point in one view angle is projected to the other view angle.

S2.2, regarding 2D pixel points in the 2D key point cluster of the view_low, selecting 2D pixel points of a second condition related to the second projection 2D pixel points as effective screened inner points.

The so-called second condition is: in the previous iteration, the screened effective interior points have the largest duty ratio relative to the original 2D key point cluster of the view_low. In a certain round of iteration, the effective interior points selected by the round of iteration are the most, and the effective interior points selected by the round of iteration are confirmed to meet the second condition if the effective interior points are the largest in duty ratio. In particular embodiments, the method comprises:

performing iterative operation of a predetermined number of rounds by taking the second re-projection 2D pixel point as an initial value of the first set, wherein each iterative operation comprises:

And selecting a group of 2D pixel points which are nearest to the second projection 2D pixel point and are not selected from the 2D key point cluster of the view angle view_low, and adding the group of 2D pixel points into the first set. The second re-projection 2D pixel is also exemplified by 8 2D pixels, and there are 8 groups of 2D pixels in the 2D key point cluster of view_low. In the first iteration, selecting the 2D pixel point closest to the second projection 2D pixel point corresponding to each group from each group of 2D pixel points, adding the 2D pixel point to the first set, and selecting 8 2D pixel points from 8 groups. Similarly, in the second iteration, the 2D pixel point closest to the second projection 2D pixel point is selected continuously and added to the first set, 8 groups of 2D pixel points are selected cumulatively 8*2 =16, and the k-th iteration is selected for 8k 2D pixel points.

An interior point rate of the first set is calculated. The method of determining the interior point is the same as that in step S1. Each time the 2D pixel points added to the first set are not necessarily interior points that are advantageous for pose estimation, orthogonal iteration PnP may be performed on the current first set at each iteration, and then the interior point rate may be calculated by the same method as in S1.

After iterating for a preset number of times, selecting the inner points added to the first set in one round of iteration with the highest inner point rate as the effective inner points.

S3, fusing the interior points in the view angles with higher interior point rates with the screened interior points to form an interior point set.

The step can be divided into two processes when implemented, firstly, the interior points in the view_high are re-projected into the view_low to obtain the third re-projected 2D pixel point, and the re-projection process can refer to the process of re-projecting the first re-projected 2D pixel point in the previous process, and the difference is that the process is to re-project all the interior points instead of only 8 2D pixel points. Note that if there are a plurality of interior points in view_high, then the re-projection into view_low will also be a plurality. Taking 8 groups of 2D pixel points as an example, the 8 groups of 2D pixel points include 8 groups of interior points, and all the 8 groups of interior points are projected into view_low to obtain 8 groups of reprojected third reprojected 2D pixel points. And combining the third re-projection 2D pixel point with the interior points screened from (each group of 2D pixel points of) the view_low 2D key point cluster to obtain 8 groups of finally determined 2D pixel point sets serving as an interior point set of pose estimation.

And S4, estimating the pose according to the interior point set.

According to the pose estimation method of the interior points, an orthogonal iteration PnP algorithm is adopted, and known 3D key points and the interior point set are finally calculated for one time to obtain a final pose R and T estimation result.

According to the method, the invention realizes low-cost and high-precision estimation of the pose of the target object, and has important significance in the estimation of the pose of the actual rigid body:

The meaning of the 6D pose estimation of the rigid body is that the accurate pose of the object can be obtained, the support is used for fine operation of the object, and the method is mainly applied to the robot grabbing field and the augmented reality field. In the field of robot gripping, the robotic arm can be manipulated to grip only when the precise 6D pose of the object to be gripped in space can be acquired. In the field of augmented reality, knowing the 6D pose of an object, virtual elements can be superimposed on the object, keeping the relative pose with the object unchanged as the object moves. With the maturation of SLAM and the like, robots have been able to perform very good localization in space, but if one wants to interact with objects in the environment, 6D pose estimation of the objects is a necessary technique.

In an actual rigid body 6D pose system, there are still many problems. The existing difficulties and key points comprise the problems of high difficulty in estimating the 6D pose of a low-texture or non-texture object, shielding, illumination and color change, depth blurring and the like. These problems are mainly due to the uncertainty and complexity of the optical image at a single view angle in texture, illumination, noise and occlusion. Therefore, a multi-view 6D pose estimation fusion algorithm is required to be introduced to carry out pose optimization so as to improve the accuracy and the robustness of a pose estimation result. In view of the wide application prospect and many to-be-solved problems of rigid body 6D pose estimation, optimizing a multi-view 6D pose estimation fusion algorithm to obtain better precision performance, lower hardware cost and less calculation time consumption is a challenging task.

The perfect multi-view 6D pose estimation system can be widely used for application scenes such as robots, augmented reality and the like. By using multiple cameras or sensors to obtain pose information of an object, more accurate and comprehensive pose estimation results can be provided. May help more accurately align virtual objects with the real world in AR and VR applications, providing a more realistic and immersive experience. In the field of robots, the robot can be helped to perform autonomous navigation, precise grabbing, operation, interaction and other actions in a complex environment by recognizing and tracking the gesture of an object. In the field of three-dimensional reconstruction, a three-dimensional scene can be reconstructed more accurately through observation of multiple view angles, and the positions, directions and relations of objects in the scene are understood.

For human development, the multi-view 6D pose estimation system has wide prospect and application. The method can promote the development of robot technology, realize higher-level autonomous navigation and operation capability, and improve industrial production efficiency and intelligent manufacturing. In addition, it may also promote advances in augmented reality and virtual reality technologies, providing a more realistic and immersive user experience. In the medical field, the multi-view 6D pose estimation system can be used for assisting operations, rehabilitation treatments and other applications, and improves the medical level and the nursing quality of patients. In summary, the development of the technology brings more innovation and development opportunities for various industries, and promotes the progress of human society. The invention can improve the pose estimation precision under the condition of reducing the cost, and has excellent application prospect.

The invention is not limited to the specific embodiments described above. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification, as well as to any novel one, or any novel combination, of the steps of the method or process disclosed.

Claims

1. The target object pose estimation method is characterized by comprising the following steps of:

And estimating the pose according to the interior point set.

2. The method of estimating the pose of a target object according to claim 1, wherein determining a view angle with a higher interior point rate and a view angle with a lower interior point rate from RGB images acquired from two view angles comprises:

3. The method of claim 2, wherein performing RANSACPNP operations on the 2D key point clusters of each view angle to obtain the interior point rate of each view angle comprises:

4. The method for estimating pose of target object according to claim 3, wherein said screening out valid inliers from view angles with low inliers according to a predetermined method comprises:

5. The method of estimating a pose of a target object according to claim 4, wherein said second condition is: in the previous iteration, the screened effective interior points have the largest duty ratio relative to the original 2D key point cluster of the view_low.

6. The method of claim 4 or 5, wherein for 2D pixel points in the 2D key point cluster of view_low, selecting 2D pixel points having a relationship with the second re-projected 2D pixel points satisfying a second condition as the screened effective interior points, includes:

calculating the interior point rate of the first set;

7. The method for estimating pose of target object according to claim 6, wherein said fusing the inliers in the view angle with higher inliers rate with the screened inliers comprises:

8. A target object pose estimation method according to claim 3 wherein the calculated pose is further corrected after the pose of each view is calculated.

9. A computer-readable storage medium having a computer program stored therein, wherein the computer program is operable to perform the target object pose estimation method according to any of claims 1-8.

10. A target object pose estimation system, characterized in that the system comprises a processor and the computer readable storage medium according to claim 9, to which the processor is connected; the processor runs a computer program in the computer readable storage medium to perform a target object pose estimation method.