CN110910452B

CN110910452B - Low-texture industrial part pose estimation method based on deep learning

Info

Publication number: CN110910452B
Application number: CN201911172167.5A
Authority: CN
Inventors: 庄春刚; 赵恒�; 李少飞; 沈逸超
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2023-08-25
Anticipated expiration: 2039-11-26
Also published as: CN110910452A

Abstract

The invention discloses a low-texture industrial part pose estimation method based on deep learning, which relates to the technical field of computer vision, and comprises the following steps of: firstly, carrying out three-dimensional modeling on an industrial part of which the pose is required to be estimated, constructing a physical simulation environment, and generating a data set of the industrial part in different poses in the simulation environment; secondly, carrying out instance segmentation and clipping on the data set; and finally, establishing a pose estimation sub-network and a pose refinement sub-network based on deep learning to obtain the pose of the low-texture industrial part. According to the invention, the three-dimensional modeling is carried out on the industrial parts, the pose estimation sub-network and the pose refinement sub-network based on deep learning are established, and the RGB image, the depth image, the original point cloud and the new point cloud rendered by the original pose are respectively used as inputs, so that the recognition effect on the low-texture industrial parts with reflective surfaces is greatly improved, and the method has important application value on the capture of the industrial scattered parts.

Description

Low-texture industrial part pose estimation method based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a low-texture industrial part pose estimation method based on deep learning.

Background

Computer vision techniques play an important role in the perception of unstructured scenes in robots. The visual image is an effective means for acquiring real world information, and features of a corresponding task, such as object position, angle, gesture and the like, are extracted through a visual perception algorithm, so that a robot can execute corresponding operations to complete a designated operation task. For sorting of industrial robots, it is now possible to acquire scene data using visual sensors, but how to recognize a target object from a scene and estimate its position and posture, so that calculating the gripping position and the gripping path of the industrial robot becomes a core problem.

In recent years, with the rapid development of deep learning technology, a pose estimation technology based on deep learning has become a mainstream algorithm in the field of pose estimation. However, the existing mainstream pose estimation algorithm based on deep learning mostly depends on information such as color, texture and the like of the object surface, has poor part identification effect on the industrial low-texture and reflective surface, and has certain obstruction to realizing efficient automatic part sorting.

Therefore, a person skilled in the art is dedicated to developing a low-texture industrial part pose estimation method based on deep learning, simulating a real scene by combining a physical engine and a graphic engine, performing three-dimensional modeling on an industrial part by applying a UV mapping technology, establishing a pose estimation sub-network and a pose refinement sub-network based on deep learning, constructing a corresponding data set, and obtaining an object pose estimation algorithm through continuous iteration so as to improve the sorting capability of a robot on scattered part scenes.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the present invention aims to solve the problem that the existing pose estimation algorithm mostly depends on the information such as color and texture of the object surface, so that the recognition effect on the part with low texture and reflective surface is poor.

In order to achieve the above object, the present invention provides a low texture industrial part pose estimation method based on deep learning, which is characterized in that the method comprises the following steps:

step 1, carrying out three-dimensional modeling on an industrial part of which the pose is required to be estimated, constructing a physical simulation environment, and generating a data set of the industrial part in different poses in the simulation environment;

step 2, performing instance segmentation and clipping on the data set;

and 3, establishing a pose estimation sub-network and a pose refinement sub-network based on deep learning to obtain the pose of the low-texture industrial part.

Further, the three-dimensional modeling in step 1 is based on UV mapping technology, i.e. mapping the surface of the industrial part to the surface of a three-dimensional model in a two-dimensional map.

Further, the physical simulation environment in the step 1 is to simulate a real scene by combining a physical engine and a graphic engine.

Further, the data set in the step 1 includes an RGB map, a depth map, a category of the industrial part, a bounding box of the industrial part, and a mask of the industrial part.

Further, in the step 2, feature extraction is required to be performed on the RGB image and the depth image after clipping; the feature map sizes of the RGB map and the depth map are 64 XH x W; predicting an initial pose according to the feature map, wherein the loss function is as follows:

wherein N is the number of feature points, [ R|t ]]For the true pose of the person,to predict the resulting pose, x _i Is the three-dimensional point coordinates on the model.

Further, the pose estimation sub-network in the step 3 takes the RGB map and the depth map as inputs; the RGB map is an RGB image within a minimum bounding box region containing a single industrial part; the depth map is a depth image within a minimum bounding box region that includes a single industrial part.

Further, the pose refinement sub-network in the step 3 takes an original point cloud and a new point cloud rendered by the initial pose as inputs.

Further, the original point cloud is composed ofThe mask area of the single industrial part on the depth map is calculated and is marked as P ₀ The calculation formula is as follows:

wherein ,(x_w ,y _w ,z _w ) The coordinates of the feature points in the camera coordinate system, (u, v) the coordinates of the feature points in the pixel coordinate system, and z _c For depth values of feature points, u ₀ ,v ₀ Dx, dy, f are internal references of the camera.

Further, the new point cloud is recorded as [ R ] by calculating the optimal pose in the initial pose set ₀ |t ₀ ]Performing projection rendering on the model reconstruction to obtain the optimal pose [ R ] ₀ |t ₀ ]The depth map below is calculated and a new point cloud under the pose is calculated and is recorded as P ₁ 。

Further, the pose refinement sub-network in the step 3 is refined through multiple iterations until the pose meeting the precision requirement is obtained, and the iteration formula is as follows:

wherein ,for the final pose, M is the number of iterations, +.>The pose predicted for the (i+1) th iteration;

the loss function is:

wherein N is the number of feature points, [ R|t ]]For the true pose of the person,for the final pose obtained by the current iteration, x _i Is the three-dimensional point coordinates on the model.

Compared with the prior art, the implementation of the invention has at least the following beneficial technical effects:

according to the low-texture industrial part pose estimation method based on deep learning, three-dimensional modeling is carried out on industrial parts, a pose estimation sub-network and a pose refinement sub-network based on deep learning are established, an RGB image, a depth image, an original point cloud and a new point cloud obtained by initial pose rendering are respectively used as inputs, the recognition effect on the low-texture industrial parts with reflective surfaces is greatly improved, and the method has important application value on grabbing industrial scattered parts.

The conception, specific structure, and technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, features, and effects of the present invention.

Drawings

FIG. 1 is a flow chart of a method for estimating pose of a low-texture industrial part based on deep learning according to a preferred embodiment of the present invention;

FIG. 2 is a flow chart of a physical simulation build dataset provided by a preferred embodiment of the present invention;

FIG. 3 is a diagram of part models and simulation results before and after mapping provided by a preferred embodiment of the present invention;

FIG. 4 is a flow chart of a pose estimation sub-network provided by a preferred embodiment of the present invention;

FIG. 5 is a block diagram of a pose estimation sub-network provided by a preferred embodiment of the present invention;

FIG. 6 is a flow chart of a pose refinement sub-network provided by a preferred embodiment of the present invention;

FIG. 7 is a block diagram of a pose refinement sub-network provided by a preferred embodiment of the present invention;

FIG. 8 is a diagram showing the result of the pose estimation process according to a preferred embodiment of the present invention;

FIG. 9 is a diagram of the final result of pose estimation provided by a preferred embodiment of the present invention.

Detailed Description

The following description of the preferred embodiments of the present invention refers to the accompanying drawings, which make the technical contents thereof more clear and easy to understand. The present invention may be embodied in many different forms of embodiments and the scope of the present invention is not limited to only the embodiments described herein.

In the drawings, like structural elements are referred to by like reference numerals and components having similar structure or function are referred to by like reference numerals. The dimensions and thickness of each component shown in the drawings are arbitrarily shown, and the present invention is not limited to the dimensions and thickness of each component. The thickness of the components is exaggerated in some places in the drawings for clarity of illustration.

As shown in fig. 1, a flowchart of a low-texture industrial part pose estimation method based on deep learning according to a preferred embodiment of the present invention is provided, the method includes the following steps:

when the three-dimensional modeling is carried out on the industrial part, as shown in fig. 2, the simulation of the real scene is realized by combining a physical engine and a graphic engine, and the rendering result is close to the real scene by further using the attributes such as illumination texture and the like; the physical engine can endow the rigid object in the scene with real physical properties, including quality and collision properties, and the scene has the same physical properties as the real environment by matching with a grid-based rapid collision detection algorithm;

in order to bring the industrial part and the modeling environment into close proximity with the real world, the invention applies a UV mapping technology to map the surface of the industrial part to the surface of the three-dimensional model in a two-dimensional mapping manner; spreading the three-dimensional grid model in a reasonable mode, spreading the surface of the three-dimensional grid model on a two-dimensional plane, wherein the two-dimensional space is called a UV texture space, providing a mapping relation between the surface of an object and a texture image, and determining the position of each pixel on the texture image corresponding to the surface of the model by placing a preset texture image in the UV texture space;

in the physical simulation environment, a data set of the industrial part in different positions is generated in a free fall mode, wherein the data set comprises an RGB image, a depth image, the category of the industrial part, a surrounding frame of the industrial part and a mask of the industrial part.

Step 2, performing instance segmentation and clipping on the data set;

in order to acquire bounding boxes and masks of single parts in an image, an image shot by a camera is subjected to example segmentation by using an existing algorithm, an RGB image and a depth image are cut according to the result of the example segmentation, and the cut result is input into a pose estimation sub-network: firstly, respectively extracting features of a cut RGB image and a cut depth image, wherein the original RGB image is 3 XH×W, and the original depth image is 1 XH×W, wherein H and W respectively represent the height and width of the cut image; secondly, after feature extraction, the sizes of the feature images are 32 XH multiplied by W, and the two feature images are spliced to obtain a feature image with the size of 64 XH multiplied by W; finally, predicting an initial pose for each feature point through a plurality of fully connected layers, wherein the loss function is defined as follows:

Step 3, establishing a pose estimation sub-network and a pose refinement sub-network based on deep learning to obtain the pose of the low-texture industrial part;

aiming at the characteristic of low texture of the industrial part, the invention provides a pose estimation sub-network and a pose refinement sub-network, wherein the pose estimation sub-network takes an RGB image and a depth image as input, the RGB image is an RGB image in a minimum bounding box area containing a single industrial part, and the depth image is a depth image in the minimum bounding box area containing the single industrial part; the pose refinement sub-network takes an original point cloud and a new point cloud rendered by the initial pose as inputs;

for the pose estimation sub-network, inputting the clipped result into the pose estimation sub-network: firstly, respectively carrying out feature extraction on the cut RGB image and the depth image, obtaining a feature image with the size of 64 XH x W after the feature extraction, and finally predicting an initial pose for each feature point through a plurality of full-connection layers;

for the pose refinement sub-network, an original point cloud is calculated from mask areas of single industrial parts on the depth map and is recorded as P ₀ The calculation formula is as follows:

wherein ,(x_w ,y _w ,z _w ) The coordinates of the feature points in the camera coordinate system, (u, v) the coordinates of the feature points in the pixel coordinate system, and z _c For depth values of feature points, u ₀ ,v ₀ Dx, dy, f are internal references of the camera;

the pose estimation sub-network predicts an initial pose for each feature point, calculates the optimal pose in the initial pose set through a pose clustering algorithm, and marks the optimal pose as R ₀ |t ₀ ]Performing projection rendering on the model reconstruction to obtain the optimal pose [ R ] ₀ |t ₀ ]The depth map below is calculated and a new point cloud under the pose is calculated and is recorded as P ₁ The method comprises the steps of carrying out a first treatment on the surface of the Pose refinement sub-network to origin point cloud P ₀ And rendering the resulting new point cloud P ₁ As input, predicting to obtain more accurate pose, and rendering with the current pose again to obtain new point cloud P ₂ Then the original pose P is combined ₀ Inputting the data into a pose refinement sub-network, and performing repeated iteration refinement until the accuracy is metThe calculated pose and the iteration formula are as follows:

the loss function is:

Examples

In this embodiment, the software platform implemented by the whole deep learning method mainly includes a physical simulation engine Blender and a deep learning framework Pytorch, and the computer hardware is configured as a NVIDIA GeForce GTX 1080TI graphics card.

Step 1, as shown in fig. 1 and fig. 2, three-dimensional modeling is performed on an industrial part with a required pose estimation, a physical simulation environment is constructed by combining a physical engine and a graphic engine, a UV mapping technology is applied to map the surface of the industrial part to the surface of the three-dimensional model in a two-dimensional mapping mode, a data set of the industrial part in different poses is generated in the simulation environment, specifically including an RGB map, a depth map, a category of each industrial part, a bounding box and a mask of the part, about 1000 pictures are generated through physical simulation, and about 3000 examples of each industrial part are generated, wherein the specific generation process is as follows: firstly, randomly initializing the number and the type of parts in a current scene, selecting a corresponding part model initialization pose according to the number and the type of the parts, setting a position random range above a scene stacking area, and superposing random offset generation through a plurality of uniformly distributed position points to ensure that certain distance deviation exists between the parts, wherein the pose is expressed by Euler angles and is an arbitrary value within the range of 0-360 degrees; secondly, calculating a stacking scene formed by the falling of the parts under the action of gravity through a physical simulation model; and finally, randomly placing a light source in a specified range, adjusting the brightness of the light source, and determining the corresponding pose of the camera, thereby forming the rendering configuration of a sample. FIG. 3 shows a three-dimensional model of a part before and after mapping and the RGB and depth maps generated;

step 2, as shown in fig. 4 and fig. 5, are respectively a flow chart and a structure chart of the pose estimation sub-network provided by the present embodiment, an image shot by a camera is subjected to example segmentation by using an existing algorithm, an RGB image and a depth image are cut according to a result of the example segmentation, and a cut result is input into the pose estimation sub-network: firstly, respectively extracting features of a cut RGB image and a cut depth image, wherein the original RGB image is 3 XH×W, and the original depth image is 1 XH×W, wherein H and W respectively represent the height and width of the cut image; secondly, after feature extraction, the sizes of the feature images are 32 XH multiplied by W, and the two feature images are spliced to obtain a feature image with the size of 64 XH multiplied by W; finally, predicting an initial pose for each feature point through a plurality of full-connection layers;

step 3, predicting an initial pose for each feature point by the pose estimation sub-network, calculating to obtain the optimal pose in the initial pose set through a pose clustering algorithm, and recording the optimal pose as [ R ] ₀ |t ₀ ]Performing projection rendering on the model reconstruction to obtain the optimal pose [ R ] ₀ |t ₀ ]The depth map below is calculated and a new point cloud under the pose is calculated and is recorded as P ₁ The method comprises the steps of carrying out a first treatment on the surface of the As shown in fig. 6 and 7, the pose refinement sub-network provided in the present embodiment is a flowchart and a structure diagram of the pose refinement sub-network using an original point cloud P ₀ And rendering the resulting new point cloud P ₁ As input, pre-emptMeasuring more accurate pose, and rendering with the current pose again to obtain a new point cloud P ₂ Then the original pose P is combined ₀ Inputting the pose into a pose refinement sub-network, and performing repeated iteration refinement until the pose meeting the precision requirement is obtained; fig. 8 shows the process result of the pose estimation algorithm of the present embodiment, where the first column is an acquired scene picture, the second column is an image obtained by re-projecting a single three-dimensional model according to the pose predicted by the pose estimation sub-network, the third column is an image obtained by re-projecting a single three-dimensional model according to the pose predicted by the pose refinement sub-network, and the fourth column is an image in which the contours of the projected models of the second column and the third column are displayed on one image; fig. 9 shows the final result of the pose estimation algorithm of the present embodiment, where the first line is an acquired scene picture and the second line is an image obtained by re-projecting the three-dimensional model according to the estimated pose. In this embodiment, the three-dimensional model is re-projected onto the image according to the predicted pose, and the contours of the pose projected models in two stages of pose estimation and pose refinement are displayed on the image at the same time, which indicates that the pose refinement sub-network can improve the final pose accuracy.

The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention without requiring creative effort by one of ordinary skill in the art. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. A low-texture industrial part pose estimation method based on deep learning, which is characterized by comprising the following steps:

step 1, carrying out three-dimensional modeling on an industrial part of which the pose is required to be estimated, constructing a physical simulation environment, and generating a data set of the industrial part in different poses in the simulation environment; the data set comprises an RGB map, a depth map, a category of the industrial part, a bounding box of the industrial part and a mask of the industrial part;

step 2, performing instance segmentation and clipping on the data set;

step 3, establishing a pose estimation sub-network and a pose refinement sub-network based on deep learning to obtain the pose of the low-texture industrial part; the pose estimation sub-network takes the RGB map and the depth map as input; the pose estimation sub-network performs feature extraction on the cut RGB image and the depth image to obtain a feature image, and predicts an initial pose according to the feature image; the pose refinement sub-network takes an original point cloud and a new point cloud rendered by the initial pose as inputs; the pose refinement sub-network refines through repeated iteration until the pose meeting the precision requirement is obtained;

the original point cloud is obtained by calculating mask areas of single industrial parts on the depth map and is marked as P ₀ The calculation formula is as follows:

the new point cloud is recorded as [ R ] by calculating the optimal pose in the initial pose set ₀ |t ₀ ]Performing projection rendering on the model reconstruction to obtain the optimal pose [ R ] ₀ |t ₀ ]The depth map below is calculated and a new point cloud under the pose is calculated and is recorded as P ₁ ；

The iterative formula of the pose refinement sub-network is as follows:

wherein ,for the final pose, M is the number of iterations, +.>The pose predicted for the (i+1) th iteration; the loss function is:

2. The method for estimating pose of low texture industrial part based on deep learning according to claim 1, wherein the three-dimensional modeling in step 1 is based on UV mapping technique, i.e. mapping the surface of the industrial part to the surface of three-dimensional model in a two-dimensional map.

3. The method for estimating pose of industrial part with low texture based on deep learning according to claim 1, wherein the physical simulation environment in step 1 is a simulation of real scene by combining a physical engine and a graphic engine.

4. The deep learning based low texture industrial part pose estimation method of claim 1, wherein the initial pose is predicted from the feature map using a loss function of:

wherein N is a featureNumber of dots [ R|t ]]For the true pose of the person,to predict the resulting pose, x _i Is the three-dimensional point coordinates on the model.

5. The deep learning based low texture industrial part pose estimation method according to claim 1, wherein the RGB map is an RGB image within a minimum bounding box area containing a single industrial part; the depth map is a depth image within a minimum bounding box region that includes a single industrial part.