CN113724329A

CN113724329A - Object attitude estimation method, system and medium fusing plane and stereo information

Info

Publication number: CN113724329A
Application number: CN202111019985.9A
Authority: CN
Inventors: 何军; 孙琪; 蒋思为; 何钰霖
Original assignee: Renmin University of China
Current assignee: Renmin University of China
Priority date: 2021-09-01
Filing date: 2021-09-01
Publication date: 2021-11-30

Abstract

The invention belongs to the technical field of image recognition, and relates to a target attitude estimation method, a system and a medium for fusing plane and three-dimensional information, wherein the method comprises the following steps: acquiring a plane gray image, a depth map and a CAD model under a target scene; dividing the target in the plane gray level image, and corresponding the result to the depth map; mapping the depth map into a three-dimensional point cloud, and positioning a target in the three-dimensional point cloud; and mapping the positioning result in the three-dimensional point cloud to a CAD model corresponding to the target so as to estimate the posture of the target. The method can fuse plane information and stereo information and estimate the target attitude from multiple angles.

Description

Object attitude estimation method, system and medium fusing plane and stereo information

Technical Field

The invention relates to a method, a system and a medium for estimating a target posture by fusing plane and three-dimensional information, belongs to the technical field of image recognition, and particularly relates to the technical field of target posture recognition in a video image.

Background

Due to the popularization of two-dimensional plane image acquisition equipment, plane images are more and more convenient to acquire. The planar image can quickly calculate the edge information of the object through the change of the color gradient, thereby being beneficial to realizing the separation of the target and the application scene. Meanwhile, as the application range of the three-dimensional sensor is more and more extensive, the use of the three-dimensional point cloud to solve the related problems in the target scene is more and more common. As an important information carrier for designing and manufacturing products in the industry, the professional and exquisite three-dimensional CAD model also provides convenience for tasks such as object classification, grabbing, posture estimation and the like.

Most existing methods are based on a single two-dimensional or three-dimensional sensor. When processing such a complicated task as pose estimation, the method of acquiring planar image information only by a two-dimensional sensor has its limitations, though it has a fast processing speed. In particular, two-dimensional image descriptors tend to ignore structural information of the target object. Therefore, the target matching algorithm using only the plane image often needs to prepare a plurality of templates of a target object at a plurality of angles, and the early preparation workload is large and the accuracy is poor. Correspondingly, the method of using the stereo point cloud data obtained by the three-dimensional sensor alone as the descriptor obtains relatively complete information features, but is slow in processing speed and difficult to deal with much noise in the scene point cloud data.

Disclosure of Invention

In view of the above problems, it is an object of the present invention to provide a method, system and medium capable of estimating a target pose from multiple angles by fusing plane information and stereo information.

In order to achieve the purpose, the invention adopts the following technical scheme: a target attitude estimation method fusing plane and stereo information comprises the following steps: acquiring a plane gray image, a depth map and a CAD model under a target scene; dividing the target in the plane gray level image, and corresponding the result to the depth map; mapping the depth map into a three-dimensional point cloud, and positioning a target in the three-dimensional point cloud; and mapping the positioning result in the three-dimensional point cloud to a CAD model corresponding to the target so as to estimate the posture of the target.

And further, inputting the preprocessed plane gray level image into a convolutional neural network model for feature extraction to obtain the boundary of the target, thereby segmenting the target.

Further, the pretreatment method comprises the following steps: processing the single-channel plane gray level image into a three-channel image, labeling the obtained three-channel image, forming a data set by the label and the corresponding three-channel image, and dividing the data set into a training set, a verification set and a test set.

Further, performing feature extraction on the preprocessed plane gray level image by using a convolutional neural network model; generating a boundary area of a target object in the plane gray image through an area generation network so as to determine the position of the target object, corresponding the boundary area to the extracted features, establishing a mapping relation between the boundary area and the features, and generating a feature mapping with a fixed size in the boundary area of the target object.

Further, the method for positioning the target in the three-dimensional stereo point cloud comprises the following steps: and obtaining a target depth map according to the distance from the collector to the target scene, and converting the depth map into three-dimensional point cloud according to the collector parameters and a conversion formula.

Further, the target type of the target object is determined through the convolutional neural network according to the boundary area of the target object, and a corresponding CAD model is selected according to the target type.

Further, the method for estimating the attitude of the target comprises the following steps: extracting key points from the CAD model, calculating point pair characteristics of every two key points, and storing the point pair characteristics into a hash table; extracting key points from a target area in the three-dimensional point cloud, calculating point pair characteristics of every two key points, and inputting the point pair characteristics in the three-dimensional point cloud into a hash table for retrieval; if the similar point pair characteristics exist, calculating a pose transformation matrix according to the point pair characteristics, voting the pose transformation matrix according to the similarity degree of the point pair characteristics, and taking the transformation matrix with the largest number of votes as an initial pose transformation matrix.

And further, correcting the initial pose transformation matrix through ICP fine registration to obtain a final pose transformation matrix.

The invention also includes a target attitude estimation system fusing planar and stereoscopic information, comprising: the image acquisition module is used for acquiring a plane gray image, a depth map and a CAD model under a target scene; the model training module is used for segmenting the plane gray level image and corresponding the result to the depth map; the point cloud positioning module is used for mapping the depth map into a three-dimensional point cloud and positioning a target in the three-dimensional point cloud; and the attitude estimation module is used for determining a CAD model corresponding to the target according to the boundary area of the target object and mapping the positioning result in the three-dimensional point cloud to the CAD model corresponding to the target so as to estimate the attitude of the target.

The present invention also includes a computer readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the steps of any of the above-mentioned target pose estimation method for merging planar and stereoscopic information.

Due to the adoption of the technical scheme, the invention has the following advantages:

1) the method has the advantages of high precision and strong interpretability of the traditional method and the advantage of strong generalization of the deep learning method, realizes the classification, posture estimation and grabbing of the target, and simplifies the task steps;

2) the method can reduce the template number of the target and reduce the problem of low processing speed caused by only using the solid geometric information for prediction to a certain extent by combining two information of the plane image and the three-dimensional geometric information for prediction;

3) the invention does not need to label the target posture in advance, thereby reducing the burden of manpower and material resources.

Drawings

Fig. 1 is a flowchart of a target pose estimation method for fusing plane and stereo information according to an embodiment of the present invention.

Detailed Description

The present invention is described in detail by way of specific embodiments in order to better understand the technical direction of the present invention for those skilled in the art. It should be understood, however, that the detailed description is provided for a better understanding of the invention only and that they should not be taken as limiting the invention. In describing the present invention, it is to be understood that the terminology used is for the purpose of description only and is not intended to be indicative or implied of relative importance.

The method is mainly used for estimating the target attitude in the industrial production process, but the scheme of the invention can also be used in the fields of dangerous object detection, unsafe behavior detection and the like, and the method is not limited to the industrial field and can be used for target identification in any three-dimensional scene.

The invention relates to a method, a system and a medium for estimating a target posture by fusing plane and three-dimensional information. And (3) corresponding the target contour with the depth map, converting the depth map to obtain a three-dimensional point cloud, obtaining the position of the target in the three-dimensional point cloud, and matching the position with the CAD model to obtain the pose information of the target. The method can fuse plane information and stereo information and estimate the target attitude from multiple angles. The scheme of the application is explained in detail by three embodiments in the following with the attached drawings.

Example one

The embodiment discloses a target attitude estimation method fusing plane and stereo information, and as shown in fig. 1, the method includes:

and acquiring a plane gray image, a depth map and a CAD model under a target scene.

A complete image is composed of three channels, red, green and blue. The scaled views of the three red, green and blue channels are all displayed in grayscale. Different gray scales are used to represent the specific gravity of different colors in the image. Such images are typically displayed in gray scale from darkest black to brightest white, and normal images can be converted to flat gray scale images by floating point, integer, shift, and average algorithms.

Depth maps may also be referred to as range images, and the distance (depth) from an image capture to various points in a scene is recorded. Which reflects the surface geometry of the scene.

And segmenting the target in the plane gray level image, and corresponding the result to the depth map.

And inputting the preprocessed plane gray level image into a convolution neural network model for feature extraction. The curled neural network model preferably adopts a Mask RCNN model. RCNN (Region-CNN) model, Mask RCNN model introduces feature pyramid FPN (feature pyramid) network, respectively extracts features of different levels, makes the features have strong semantic information and strong spatial information, simultaneously outputs feature graphs of different sizes, and improves RPN network to have better effect of extracting small target features. The FPN is a universal framework and can be combined with different basic feature extraction networks and FPN frameworks to perform feature extraction. The MASK branch is specific to a MASK-RCNN model network, and different from FASTER-RCNN detection, the FASTER-RCNN can only detect the region where the target is located, and the MASK-RCNN increases a target MASK, so that the image can be visually and effectively displayed by increasing segmentation.

The pretreatment method comprises the following steps: processing the single-channel plane gray level image into a three-channel image, labeling the obtained three-channel image, forming a data set by the label and the corresponding three-channel image, and dividing the data set into a training set, a verification set and a test set.

Performing feature extraction on the preprocessed plane gray level image by using a convolutional neural network model; generating a boundary area of a target object in the plane gray image through an area generation network so as to determine the position of the target object, corresponding the boundary area to the extracted features, establishing a mapping relation between the boundary area and the features, and generating a feature mapping with a fixed size in the boundary area of the target object.

And mapping the depth map into a three-dimensional point cloud, and positioning the target in the three-dimensional point cloud.

The three-dimensional point cloud can well represent the geometric characteristics and the spatial position of a target, and is widely applied to the fields of 3D reconstruction, robotics, automatic driving and the like. In the process of actually acquiring point cloud data, due to factors such as equipment and environment, only a local area of an object to be detected can be scanned by one-time scanning, so that scanning needs to be performed at different angles, and then a plurality of scanned point clouds with different viewing angles are registered to restore a complete appearance of the object to be detected.

The method for positioning the target in the three-dimensional point cloud comprises the following steps: and obtaining a target depth map according to the distance from the collector to the target scene, and converting the depth map into three-dimensional point cloud according to the collector parameters and a conversion formula.

And determining a CAD model corresponding to the target according to the boundary area of the target object, and mapping the positioning result in the three-dimensional point cloud to the CAD model corresponding to the target, so as to estimate the posture of the target.

And determining the target type of the target object through a convolutional neural network according to the boundary area of the target object, and selecting a corresponding CAD model according to the target type. Extracting key points from the CAD model, calculating point pair characteristics of every two key points, and storing the point pair characteristics into a hash table; extracting key points from a target area in the three-dimensional point cloud, calculating point pair characteristics of every two key points, and inputting the point pair characteristics in the three-dimensional point cloud into a hash table for retrieval; if the similar point pair characteristics exist, calculating a pose transformation matrix according to the point pair characteristics, voting the pose transformation matrix according to the similarity degree of the point pair characteristics, and taking the transformation matrix with the largest number of votes as an initial pose transformation matrix.

And correcting the initial pose transformation matrix through ICP fine registration to obtain a final pose transformation matrix. The icp (iterative closest point) algorithm, i.e. the iterative closest point algorithm, mainly processes as follows: acting two initial pose transformation matrixes after initial registration, namely a rotation matrix R and a translational vector T on a CAD model of a target to serve as an initial point set of fine registration, wherein the initial point set is a target point set of an ICP algorithm, namely a matching point set, recording the matching point set as P, and recording a target point set in a scene as Q; finding out a target point set P' nearest to each point in the target point set Q from the matching point set P; according to the corresponding relation between the matching point set P ' and the target point set Q, a new pose transformation matrix, namely a new rotation matrix R ' and a new translation vector T ' is solved by applying a least square method and SVD decomposition, the new rotation matrix R ' and the new translation vector T ' are acted on a CAD model of a target to obtain a new matching point set P ', the average distance between the new matching point set P ' and the target point set Q is calculated, if the average distance is larger than a threshold value or the iteration times are larger than the given maximum iteration times, circulation is stopped, otherwise, the matching point set is redefined, and the steps are repeated until the deduction condition is met. Through the circulation, the attitude transformation matrix with the closest target point set in the scene can be obtained.

Example two

Based on the same inventive concept, the embodiment discloses a target attitude estimation system fusing plane and three-dimensional information, which comprises:

the image acquisition module is used for acquiring a plane gray image, a depth map and a CAD model under a target scene;

the model training module is used for segmenting the plane gray level image and corresponding the result to the depth map;

the point cloud positioning module is used for mapping the depth map into a three-dimensional point cloud and positioning a target in the three-dimensional point cloud;

and the attitude estimation module is used for determining a CAD model corresponding to the target according to the boundary area of the target object and mapping the positioning result in the three-dimensional point cloud to the CAD model corresponding to the target so as to estimate the attitude of the target.

EXAMPLE III

Based on the same inventive concept, the present embodiment discloses a computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement any of the above-mentioned steps of the method for target pose estimation for merging planar and stereoscopic information.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims. The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A target attitude estimation method fusing plane and stereo information is characterized by comprising the following steps:

acquiring a plane gray image, a depth map and a CAD model under a target scene;

segmenting the plane gray level image, and corresponding the result to a depth map;

mapping the depth map into the three-dimensional point cloud, and positioning a target in the three-dimensional point cloud;

2. The method of estimating the pose of an object fusing planar and stereoscopic information according to claim 1, wherein the preprocessed planar gray image is input into a convolutional neural network model for feature extraction to obtain the boundary of the object, thereby segmenting the object.

3. The method for estimating the object pose fusing planar and stereoscopic information according to claim 2, wherein the preprocessing method comprises: processing a single-channel plane gray image into a three-channel image, labeling the obtained three-channel image, forming a data set by the label and the corresponding three-channel image together, and dividing the data set into a training set, a verification set and a test set.

4. The method of estimating an object pose fusing planar and stereoscopic information according to claim 2, wherein the preprocessed planar gray image is feature extracted using a convolutional neural network model; generating a boundary area of a target object in the plane gray-scale image through an area generation network so as to determine the position of the target object, corresponding the boundary area to the extracted features, establishing a mapping relation between the boundary area and the features, and generating a feature map with a fixed size in the boundary area of the target object.

5. The method of object pose estimation fusing planar and volumetric information according to claim 4,

6. The method of claim 2, wherein the target type of the target object is determined by a convolutional neural network according to the boundary region of the target object, and the corresponding CAD model is selected according to the target type.

7. The method for estimating the attitude of an object fusing planar and stereoscopic information according to claim 6, wherein the method for estimating the attitude of the object comprises: extracting key points from the CAD model, calculating point pair characteristics of every two key points, and storing the point pair characteristics into a hash table; extracting key points from a target area in the three-dimensional point cloud, calculating point pair characteristics of every two key points, and inputting the point pair characteristics in the three-dimensional point cloud into the hash table for retrieval; if the similar point pair characteristics exist, calculating a pose transformation matrix according to the point pair characteristics, voting the pose transformation matrix according to the similarity degree of the point pair characteristics, and taking the transformation matrix with the largest number of votes as an initial pose transformation matrix.

8. The method for estimating the pose of an object fusing planar and stereoscopic information according to claim 7, wherein the initial pose transformation matrix is modified by ICP fine registration to obtain a final pose transformation matrix.

9. An object pose estimation system for fusing planar and stereoscopic information, comprising:

the point cloud positioning module is used for mapping the depth map into the three-dimensional point cloud and positioning a target in the three-dimensional point cloud;

10. A computer-readable storage medium, having stored thereon a computer program for execution by a processor to perform the steps of the method for object pose estimation for merging planar and volumetric information according to any of claims 1-8.