CN112116653B

CN112116653B - Object posture estimation method for multiple RGB pictures

Info

Publication number: CN112116653B
Application number: CN202011316344.5A
Authority: CN
Inventors: 张键驰; 贾奎; 郭清达; 陈轲
Original assignee: South China University of Technology SCUT
Current assignee: Cross Dimension Shenzhen Intelligent Digital Technology Co ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2021-03-30
Anticipated expiration: 2040-11-23
Also published as: CN112116653A

Abstract

The invention belongs to a three-dimensional computer vision technology, and discloses an object posture estimation method for a plurality of RGB pictures, which comprises the following steps: respectively carrying out target detection on each RGB picture to obtain a two-dimensional boundary frame of a target object; interpolation and filtering processing are carried out on the detection result, a two-dimensional boundary frame of false detection is removed, and a two-dimensional boundary frame of missed detection is added; inputting all target objects into an object posture estimation network based on monocular RGB pictures respectively, and estimating the object type contained in each RGB picture, and the three-dimensional coordinate and orientation of each object; calculating the corresponding relation between objects in each two RGB pictures and a camera rotation matrix; constructing an object attitude diagram and optimizing the object attitude diagram; and projecting the three-dimensional coordinates, the orientation and the three-dimensional model corresponding to each object from three dimensions to two dimensions to obtain a new two-dimensional bounding box. Compared with the existing method, the method better solves the problem of occlusion and improves the accuracy of estimation.

Description

Object posture estimation method for multiple RGB pictures

Technical Field

The invention belongs to the technical field of three-dimensional computer vision, and particularly relates to an object posture estimation method for a plurality of RGB pictures.

Background

The smart manufacturing industry is gradually becoming an area of major concern. Compared with the traditional manufacturing industry, the intelligent manufacturing industry widely adopts the artificial intelligence technology, and the robot is expected to learn perception, planning and decision, so that the robot can replace the human to carry out repetitive labor. One of the key technologies is object pose estimation. The object posture estimation technology is an important component of several practical applications from robot navigation and manipulation to augmented reality, and has very important theoretical research value and application value.

With the development of deep learning technology and three-dimensional computer vision technology, object pose estimation technology has great progress in theory and application, such as template matching method, edge matching method and matching method based on 3D model, but the existing method still has limitations. How to accurately estimate the object pose under the condition of stacking or severe occlusion is still a problem to be solved; the estimation of the posture of the target object from multiple viewing angles by using multiple RGB pictures is a technical solution to solve the problem.

Disclosure of Invention

Aiming at the technical problem that the object posture is difficult to accurately estimate under the condition of stacking or severe shielding in the existing method, the invention provides the object posture estimation method of a plurality of RGB pictures by utilizing the relation among the plurality of RGB pictures under a plurality of different visual angles, and the accuracy of the object posture estimation technology under the conditions of stacking and severe shielding is improved.

The technical scheme provided by the invention is as follows: a method for estimating object postures of a plurality of RGB pictures comprises the following steps:

s1, selecting a plurality of RGB pictures from the same video frame, and respectively carrying out target detection on each RGB picture to obtain a two-dimensional boundary frame of a target object contained in each RGB picture as a target detection result of each RGB picture;

s2, carrying out interpolation and filtering processing on the target detection result of each RGB picture, removing a false detection two-dimensional boundary frame and adding a missing detection two-dimensional boundary frame;

s3, respectively inputting all target objects in each RGB picture into an object posture estimation network based on a monocular RGB picture, and estimating the object type, the three-dimensional coordinate of each object and the three-dimensional orientation of each object contained in each RGB picture;

s4, calculating the corresponding relation between objects in each two RGB pictures and a camera rotation matrix between each two RGB pictures according to the estimation result of the step S3;

s5, constructing an object posture diagram and optimizing the object posture diagram by utilizing the corresponding relation among the objects in the multiple RGB pictures, the camera rotation matrix, and the three-dimensional coordinates and the three-dimensional orientation of each object;

s6, after the poses of all objects in each RGB picture are obtained, three-dimensional coordinates, three-dimensional orientations and three-dimensional models corresponding to the objects are utilized to project from three dimensions to two dimensions, and a new two-dimensional bounding box is obtained;

s7, replacing the two-dimensional bounding box acquired in the step S1 with a new two-dimensional bounding box of each RGB picture, and repeating the steps S2-S6 until the repetition number reaches a set threshold value.

Compared with the prior art, the invention has the following beneficial effects:

the object posture estimation method based on the multiple RGB pictures well solves the problem of occlusion by utilizing the relation among the multiple RGB pictures under multiple different visual angles, and improves the accuracy of the object posture estimation technology under the conditions of stacking and serious occlusion.

Drawings

FIG. 1 is a flow chart of an object pose estimation method for a plurality of RGB pictures according to an embodiment of the present invention;

FIG. 2 is a flow chart of the application of a random sampling consistency algorithm in calculating the relative camera rotation matrices of two RGB pictures;

FIG. 3 is a schematic diagram of the optimization effect of the object pose graph optimization algorithm.

Detailed Description

The following describes the technical solution of the present invention in detail with reference to the examples and fig. 1 to 3, but the embodiments of the present invention are not limited thereto.

Examples

As shown in fig. 1, the method for estimating the pose of an object based on a plurality of RGB pictures and a three-dimensional model according to the present invention includes the steps of:

s1, target detection: selecting a plurality of RGB pictures from the same video frame, and respectively carrying out target detection on each RGB picture to obtain a two-dimensional boundary frame of a target object contained in each RGB picture as a target detection result of each RGB picture.

Specifically, each RGB picture is input into a target detection algorithm, which may be any target detection algorithm in the prior art, and the embodiment adopts a fast RCNN target detection method; the obtained target detection result, namely the two-dimensional boundary frame of the target object, is represented by [ x, y, w, h ], wherein (x, y) represents the pixel position of the central point of the object, and w and h respectively represent the width and height of the object in the RGB picture.

Furthermore, the existing training data set or the existing target detection network can be utilized to perform transfer learning on the target detection algorithm, so that the accuracy of target detection is improved.

And S2, carrying out interpolation and filtering processing on the target detection result of each RGB picture by using a voting mechanism, removing a two-dimensional boundary frame of false detection, and adding a two-dimensional boundary frame of missed detection.

According to the principle of scene content consistency, the object types and the number contained in all the RGB pictures are basically consistent; according to the characteristic, interpolation and filtering are carried out on the target detection result of the RGB picture, the two-dimensional boundary frame of false detection is removed, and the two-dimensional boundary frame of missed detection is recovered.

Further, the strategy for performing interpolation and filtering processing on the target detection result in the step is as follows: if not less than half of the target detection results of the RGB pictures contain an object, it can be considered that all the RGB pictures contain the object, so a two-dimensional bounding box is added to the corresponding position of the RGB picture where the object is not detected, and the corresponding position can be represented as [ (x1+ x2)/2, (y1+ y2)/2, (w1+ w2)/2, (h1+ h2)/2 ]; if no object is detected in the target detection results of more than half of the RGB pictures, all the RGB pictures are considered to contain no object, so that the two-dimensional bounding box can be deleted at the corresponding position in the RGB pictures in which the object is detected.

S3, object posture estimation based on the monocular RGB picture: and inputting all target objects in each RGB picture into an object posture estimation network based on the monocular RGB picture, so as to estimate the object type c, the three-dimensional coordinates (x, y, z) of each object and the three-dimensional orientation (qw, qx, qy, qz) of each object contained in each RGB picture.

The monocular RGB picture object pose estimation network can be any object pose estimation method based on a single RGB picture, a method based on key point review is adopted in the embodiment, improvement is performed on a typical PVNet network, only RGB channels of the PVNet network are utilized without using a depth image, only the RGB picture is used as input, and the network comprises 12 convolutional layers and two connecting layers.

S4, camera rotation matrix estimation: for any two RGB pictures, a random sampling consistency algorithm is used to calculate the correspondence between the objects in each two RGB pictures and the camera rotation matrix between each two RGB pictures according to the estimation result of step S3, and the flow is shown in fig. 2.

The strategy adopted by the random sampling consistency algorithm is as follows: for any two RGB pictures, randomly selecting two objects from the two RGB pictures according to the object pose estimation result obtained in the step S3, assuming that the two selected objects are corresponding and have the same object type c, and forming two pairs of objects with the same pair-wise consistent type from the two objects respectively selected from the two RGB pictures; calculating relative camera transformation between the two RGB pictures by using two pairs of objects with pairwise consistent categories, taking the reciprocal of the sum of attitude deviations of all objects in the two RGB pictures under the relative camera transformation as a confidence coefficient, and taking the relative camera transformation with the highest confidence coefficient and the object corresponding relation as a camera rotation matrix between the two RGB pictures and the corresponding relation between the two RGB pictures.

Because the object postures estimated from any two RGB pictures have deviations, namely deviations exist in the three-dimensional coordinates and the three-dimensional orientation, the method respectively performs three-dimensional coordinate deviation summation and three-dimensional orientation deviation summation, then superposes the numerical values obtained by the two deviations summation to obtain the final deviation sum, and takes the reciprocal of the final deviation sum as the confidence coefficient. In the present invention, "objects having a pair-wise coincidence classification" means: the object images of the same object taken at two different viewing angles of the camera constitute an object with a pair of consistent categories.

Specifically, for every two RGB pictures, the method selects two pairs of objects with pairwise consistent categories from the two RGB pictures

And

suppose thatSelecting objects of the categories c1 and c2 from one picture, and selecting objects of the categories c1 and c2 from the other picture, wherein one pair of objects of the category c1 form objects with pair-wise consistency, and the other pair of objects of the category c2 form objects with pair-wise consistency; using a pair of objects with pairwise coincident classes to build a camera rotation matrix

Also known as a camera rotation matrix or a camera transformation matrix. In order to solve the problem of position and attitude ambiguity caused by object symmetry,

the calculation method of (c) is as follows:

the right side of the equal sign of the calculation mode is matrix multiplication; wherein S is^*A symmetric transformation matrix for best alignment of the target object in the set of object pair-wise linear matrices,

representing camera view angle

，

Representing camera view angle

Lower target object

，

Representing camera view angle

Lower target object

The posture of (a) of (b),

representing camera view angle

，

Representing camera view angle

Lower target object

，

Representing camera view angle

Lower target object

The posture of (a) of (b),

to represent

Is inverted and it is assumed that

And

the same object corresponds, i.e. a pair of objects with a pair-wise identity category.

Calculate out

Thereafter, the method employs another pair of objects having pairwise coincident classes

The camera rotation matrix of (2) is verified:

wherein the content of the first and second substances,

representing camera view angle

Lower target object

，

Representing camera view angle

Lower target object

The posture of (a) of (b),

representing camera view angle

Lower target object

，

Representing camera view angle

Lower target object

The posture of (a) of (b),

to represent

Inverting the matrix of (1); when in use

Is less than a preset range, the calculated camera rotation matrix is considered

Is correct.

S5, pose graph construction and optimization: and constructing an object posture graph by utilizing the corresponding relation among the objects in the multiple RGB pictures, the camera rotation matrix and the estimated three-dimensional coordinate and three-dimensional orientation of each object, and optimizing the constructed object posture graph by adopting a light beam method adjustment.

For multiple RGB pictures, the same object image taken at two different viewing angles of the camera has no change in the three-dimensional coordinates and three-dimensional orientation of the object after eliminating the influence of the camera rotation transformation because only the camera moves and the object does not move, i.e., the relative world coordinate system of the same object in the RGB pictures does not change. In the previous step, the correspondence between objects in the respective pictures is known, and incoherent objects have been deleted. Therefore, a unique and consistent scene model can be recovered by performing global joint optimization after relative rotation transformation on the object and the camera.

Specifically, the object attitude map constructed by the method takes the three-dimensional coordinates and the three-dimensional orientation of the same object type estimated in each RGB picture as a vertex, the camera rotation matrix of each two RGB pictures as a side, the minimum global pose consistency deviation as an optimization target, and the camera rotation matrix between each two RGB pictures and the three-dimensional coordinates and the three-dimensional orientation of each object in each RGB picture are optimized by using adjustment of a light beam method.

As shown in fig. 3, when the shooting angle of view of the camera changes after the adjustment optimization by the beam method, the position and orientation of the object are unchanged, that is, the three-dimensional coordinates and three-dimensional orientation of the object are unchanged. In fig. 3, image 1, image 2, and image 3 respectively represent three shooting angles of the camera, R represents a three-dimensional rotation matrix, t represents a three-dimensional translation vector, R1 and t1, R2 and t2, R3 and t3 respectively form three positions and orientations of the camera, and a middle rectangular frame in each shooting angle represents a picture of an object to be shot; vertex X of the object₁The corresponding points in the three pictures are respectivelyp _1,1、p _1,2 、p _1,3。

S6, updating the two-dimensional bounding box: after the poses of all objects in each RGB picture are obtained, the three-dimensional coordinates, the three-dimensional orientation and the three-dimensional model corresponding to each object can be utilized to project from three dimensions to two dimensions, and a new compact two-dimensional bounding box is obtained. The three-dimensional model may be generic.

S7, pose estimation iteration: and (4) replacing the original two-dimensional bounding box acquired in the step (S1) with a new two-dimensional bounding box of each RGB picture, and repeating the steps (S2-S6) until the repetition number reaches a certain set threshold value.

Furthermore, the invention does not strictly limit the threshold value of the iteration times, can be suitable for any repeated iteration times, and can also be used for any more than two RGB pictures.

The invention discloses an object posture estimation method for a plurality of RGB pictures. According to the method, through the steps of target detection, two-dimensional boundary frame filtering and interpolation, monocular RGB picture-based object posture estimation, camera rotation matrix estimation, pose graph construction and optimization, two-dimensional boundary frame updating, pose estimation iteration and the like, the accuracy and robustness of the pose estimation method under the condition of stacking or serious shielding are improved, the problems existing in the existing method are effectively solved, and the method can be widely applied to the technical field of three-dimensional computer vision.

Claims

1. An object posture estimation method of a plurality of RGB pictures is characterized by comprising the following steps:

s7, replacing the two-dimensional bounding box obtained in the step S1 with a new two-dimensional bounding box of each RGB picture, and repeating the steps S2-S6 until the repetition times reach a set threshold;

step S4 is implemented by using a random sampling consistency algorithm, and the strategy is as follows:

for any two RGB pictures, two pairs of objects with pairwise consistent categories are randomly selected from the two RGB pictures according to the object pose estimation result obtained in the step S3; calculating relative camera transformation between the two RGB pictures by using two pairs of objects with pairwise consistent categories, taking the reciprocal of the sum of attitude deviations of all objects in the two RGB pictures under the relative camera transformation as a confidence coefficient, and taking the relative camera transformation with the highest confidence coefficient and the object corresponding relation as a camera rotation matrix between the two RGB pictures and the corresponding relation between the two RGB pictures.

2. The object pose estimation method according to claim 1, wherein the strategy of the step S2 for interpolating and filtering the target detection result is: if the target detection results of not less than half of the RGB pictures contain certain objects, the objects are considered to be contained in all the RGB pictures, and a two-dimensional boundary frame is added at the corresponding position of the RGB pictures where the objects are not detected; and if no object of a certain type is detected in the target detection results of more than half of the RGB pictures, all the RGB pictures are considered to contain no object of the type, and the two-dimensional boundary frame is deleted at the corresponding position in the RGB pictures in which the object of the type is detected.

3. The object pose estimation method according to claim 1, wherein the object pose estimation network of the monocular RGB pictures in step S3 is a key point review based method, and is improved on the PVNet network, only using the RGB channels of the PVNet network and not using the depth image.

4. The object pose estimation method according to claim 1, wherein in the random sampling consistency algorithm of step S4, the object poses estimated from any two RGB pictures are respectively subjected to three-dimensional coordinate deviation summation and three-dimensional orientation deviation summation, then the values obtained by the two deviation summations are superimposed to obtain the final deviation sum, and the reciprocal of the final deviation sum is taken as the confidence.

5. The object pose estimation method according to claim 1, wherein in the random sampling consistency algorithm of step S4, the camera rotation matrix is established by using one pair of objects with pairwise consistency classes, and the camera rotation matrix is verified by using the other pair of objects with pairwise consistency classes.

6. The object pose estimation method of claim 5, wherein a camera rotation matrix established with a pair of objects having pairwise coincident classes is employed

Comprises the following steps:

representing camera view angle

，

Representing camera view angle

Lower target object

，

Presentation cameraAngle of view

Lower target object

The posture of (a) of (b),

representing camera view angle

，

Representing camera view angle

Lower target object

，

Representing camera view angle

Lower target object

The posture of (a) of (b),

and

a pair of objects having a pair-wise uniform classification.

7. The object pose estimation method according to claim 1, wherein step S5 optimizes the constructed object pose graph by using bundle adjustment.

8. The object pose estimation method according to claim 1, wherein the object pose graph constructed in step S5 takes the three-dimensional coordinates and three-dimensional orientation of the same object class estimated in each RGB picture as a vertex, the camera rotation matrix of each two RGB pictures as an edge, and the three-dimensional coordinates and three-dimensional orientation of each object in each RGB picture as an optimization target, and optimizes the camera rotation matrix between each two RGB pictures and the three-dimensional coordinates and three-dimensional orientation of each object in each RGB picture.

9. The object pose estimation method according to claim 1, wherein step S1 employs a fast RCNN object detection method.