CN111906782B

CN111906782B - Intelligent robot grabbing method based on three-dimensional vision

Info

Publication number: CN111906782B
Application number: CN202010652696.1A
Authority: CN
Inventors: 兰旭光; 赵冰蕾; 张翰博; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2021-07-13
Anticipated expiration: 2040-07-08
Also published as: CN111906782A

Abstract

The invention discloses an intelligent robot grabbing method based on three-dimensional vision, which takes observation point cloud containing a target object as input, and uses a point grabbing confidence evaluation network of a deep convolutional network to evaluate the grabbing confidence of each point in the observation point cloud so as to obtain a point suitable for being used as the center of a grabbing part. And taking the characteristics of the grabbing area as input, and using an area grabbing part detection network based on a grabbing anchor point mechanism to detect the object grabbing part. And taking the feature of the captured closed region and the feature of the captured region after fusion as input, and optimizing the detected captured part by using a captured part optimization network. And selecting the grabbing part with the highest grabbing quality index, and obtaining the position and the posture of the grabbing part in the robot coordinate system through coordinate system transformation so as to plan the clamp posture of the robot. The invention ensures that the robot accurately grabs different types of objects in the unstructured environment so as to improve the safety and reliability of the intelligent robot operation and the interaction with the outside.

Description

Intelligent robot grabbing method based on three-dimensional vision

Technical Field

The invention belongs to the field of computer vision and intelligent robots, and particularly relates to an intelligent robot grabbing method based on three-dimensional vision.

Background

The intelligent robot grabbing plays a vital role in robot operation and interaction with the outside, and at present, robot grabbing algorithms based on machine vision are mainly divided into algorithms based on models and data driving. Compared with the traditional model-based robot grabbing method, the data-driven grabbing method can basically grab objects in an unstructured environment. However, due to the uncertainty caused by the different shape structures of various objects and sensor noise, it is still a difficult task to reliably grasp different kinds of objects in an unstructured environment. At present, a rectangular grabbing area is generated mostly based on RGB, depth or RGB-D images based on a data-driven robot grabbing method, but the grabbing positions of a parallel clamp of a robot in a three-dimensional space are simplified, so that the methods lack consideration on geometric information of the surface of an object and grabbing quality indexes, and the optimal grabbing position is difficult to find. Therefore, the three-dimensional vision-based intelligent robot grabbing method can learn a more robust representation of the robot grabbing part in point cloud, and compared with a rectangular grabbing representation, the grabbing representation in a three-dimensional space can more accurately represent the grabbing posture of a clamp with high grabbing quality indexes. However, the more accurate detection of the grabbing in the three-dimensional space helps to perform more reliable object grabbing, and for this reason, how to accurately detect the grabbing parts with higher grabbing quality in the three-dimensional space to obtain a robot grabbing algorithm capable of reliably grabbing different kinds of objects in an unstructured environment is a current outstanding problem.

Disclosure of Invention

The invention aims to overcome the defects and provide an intelligent robot grabbing method based on three-dimensional vision, which can accurately detect grabbing parts with higher grabbing quality in a three-dimensional space so as to ensure that a robot can reliably grab different types of objects in an unstructured environment, thereby realizing the reliability and safety of interaction between robot operation and the outside.

In order to achieve the purpose, the invention adopts the following technical means:

an intelligent robot grabbing method based on three-dimensional vision comprises the following steps:

taking observation point cloud containing a target object as input, and using a point capture confidence evaluation network of a depth convolution network to evaluate the capture confidence of each point in the observation point cloud to obtain a point suitable for being used as the center of a capture part;

taking the characteristics of a grabbing area as input, and using an area grabbing part detection network based on a grabbing anchor point mechanism to detect the object grabbing part;

taking the feature of the captured closed region and the captured region after feature fusion as input, and optimizing the detected captured part by using a captured part optimization network to obtain an optimized captured part;

and selecting the grabbing part with the highest grabbing quality index, and obtaining the position and the posture of the grabbing part in the robot coordinate system through coordinate system transformation so as to plan the clamp posture of the robot and finish the robot operation.

As a further improvement of the invention, the observation point cloud is obtained under the current scene through a Kinect sensor.

As a further improvement of the present invention, the method for establishing the capturing confidence evaluation network comprises:

and inputting the input observation point cloud into a trained point cloud feature extraction network, wherein the training loss is the classification loss of the capturing confidence of each point, and the network parameters are optimized by minimizing the loss function to obtain a capturing confidence evaluation network model.

As a further improvement of the invention, the specific process for obtaining the point suitable for being used as the center of the grabbing part is as follows:

coding all input point clouds to group characteristics through a trained point cloud characteristic extraction network, and decoding the group characteristics to point-by-point characteristics through a distance difference method so as to realize the characteristic extraction of the input point clouds; and then, segmenting the extracted features of each point through a trained two-classification segmentation network, wherein each point predicted as a positive class is suitable for being used as the center of the captured part.

As a further improvement of the present invention, the object capture area detection is to predict the corresponding capture area by using a point predicted to be suitable as the center of the capture area as a regression center through a regional capture area detection network.

As a further improvement of the present invention, the specific process of predicting the corresponding capture region is as follows:

selecting k1 regression points which cover different structures as much as possible by a farthest point sampling method to obtain a grasping area of k1 points; a grabbing benchmark with a preset grabbing direction is introduced into each grabbing area, the characteristics of each grabbing area are used as input, and deviation values of the position, the direction and the angle of a grabbing part relative to the preset grabbing benchmark are regressed through a maximum pooling layer and a multilayer sensor; and combining the regression deviation value and the classification result of the preset standard to obtain k1 detected grasping part proposals.

As a further improvement of the invention, the classification and regression of the gripping references requires first matching of the calibrated gripping location with a preset gripping reference; and when the difference value of the grabbing directions of the preset reference and the calibration grabbing part is smaller than a specific threshold value, matching is successful, a positive label is given to the successfully matched preset reference, and the residual error needing regression is the difference value of the calibration grabbing part and the preset reference corresponding to the positive label. The training loss includes the loss of classification of the preset grasping reference and the regression loss of the deviation values of the position, the direction and the angle of the grasping portion relative to the preset grasping reference. And training the network by minimizing the loss function to obtain a model of the region capture part detection network.

As a further improvement of the invention, the optimization obtains a more accurate grasping part, and the specific process is as follows:

selecting proposals that k2 corresponding grabbing closed areas contain more than a certain number of points from the obtained grabbing part proposals for optimization; converting points in the selected grabbing closed area from a world coordinate system to a grabbing coordinate system through coordinate conversion, and acquiring the characteristics of k2 grabbing closed areas through a multilayer sensor; combining the characteristics of the grabbing closed area with the characteristics of the grabbing area to obtain corresponding fusion characteristics; and (4) taking the fusion characteristics as input, and regressing deviation values of the positions, the directions and the angles of the grabbing parts relative to the grabbing proposal through a maximum pooling layer and a multilayer perceptron to finally obtain k2 optimized predicted grabbing parts.

As a further improvement of the invention, the proposal is preferably that the characteristic after the combination of the gripping area characteristic and the clamp closed area is taken as input, and the input is sent into a maximum pooling layer and a multilayer sensor; in the training process, firstly classifying k2 predicted grasping part proposals, and if the predicted grasping part proposals are close to the corresponding marked grasping parts, giving positive labels to the predicted grasping part proposals; when the deviation of the grasping portion is regressed, only the position, direction and angle of the grasping portion proposal given the positive label are optimized.

As a further improvement of the invention, the close gripping position means that the difference between the gripping direction and the gripping angle is less than 2 pi/9 and pi/3 respectively.

Compared with the prior art, the invention has the following advantages:

the intelligent robot grabbing method based on the three-dimensional vision is based on observation point cloud, grabbing parts with high grabbing quality indexes can be detected by providing a grabbing area, a grabbing anchor point mechanism and a grabbing part optimization network, so that the precision of three-dimensional grabbing part detection is greatly improved, meanwhile, due to the arrangement of point cloud feature extraction network parameter sharing, the forward propagation time of the whole network is not increased, and the real-time performance of the algorithm is improved.

Based on a deep learning algorithm, the point capture confidence evaluation network, the area capture part detection network and the capture part optimization network extract network parameters by sharing point cloud characteristics, so that the real-time performance of the algorithm is improved; the grabbing position with high grabbing quality index can be detected by using the grabbing area, the grabbing anchor point mechanism and the grabbing position optimization network, and the detection precision of the three-dimensional grabbing position is improved. Based on the three-dimensional point cloud observed by the depth camera, the invention can ensure that the robot can accurately grab different types of objects in an unstructured environment so as to improve the safety and reliability of the intelligent robot operation and the interaction with the outside.

Drawings

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

FIG. 1 is a process framework diagram of the present invention;

FIG. 2 is a schematic diagram of a point capture confidence evaluation network of the present invention;

FIG. 3 is a schematic diagram of a region capture site detection network of the present invention;

FIG. 4 is a schematic diagram of a grasping portion optimizing network of the present invention;

fig. 5 is a schematic diagram illustrating a detection result of a grasping portion of a visual object.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

The invention discloses an intelligent robot grabbing method based on three-dimensional vision, which comprises the following steps:

and taking the observation point cloud containing the target object as input, and using a point capture confidence evaluation network of the deep convolutional network to evaluate the capture confidence of each point in the observation point cloud to obtain a point suitable for being used as the center of a capture part. And taking the characteristics of the grabbing area as input, and using an area grabbing part detection network based on a grabbing anchor point mechanism to detect the object grabbing part.

And taking the feature of the captured closed region and the feature of the captured region after fusion as input, and optimizing the detected captured part by using a capture part optimization network to obtain a more accurate captured part.

As shown in fig. 1, the invention relates to an intelligent robot grabbing method based on three-dimensional vision, which comprises the following steps:

the method comprises the following steps: acquiring an observation point cloud P under a current scene through a Kinect sensor;

step two: and finishing the grasping confidence evaluation of each point in the observation point cloud P under the current scene through a point grasping confidence evaluation network so as to obtain a point suitable for being used as the center of the grasping part.

As shown in fig. 2, the specific process is as follows:

coding of all input point clouds to group features is achieved through a trained point cloud feature extraction network, and decoding of the group features to point-by-point features is achieved through a distance difference method, so that feature extraction of the input point clouds is achieved. Then, the features extracted from each point are segmented through a trained two-classification segmentation network, and each point predicted as a positive class is suitable for being used as the center of a captured part;

preferably, the input observation point cloud P is fed into the trained point cloud feature extraction network. For training a point grabbing confidence evaluation network, the method and the device add an index of point grabbing confidence in the generated grabbing data set, and visually represent the density of grabbing parts with high grabbing quality indexes near each point in the observation point cloud. The training loss is the classification loss of the grasping confidence of each point, and the network parameters are optimized by minimizing the loss function to obtain a grasping confidence evaluation network model.

Step three: and (4) predicting the corresponding captured part by using the point predicted to be suitable as the center of the captured part in the step one as a regression center through a regional captured part detection network.

As shown in fig. 3, the specific process is as follows: multiple grabbing modes exist for one object, so that all the normal class points in the step one do not need to be used as regression points, and then k1 regression points which cover different structures as much as possible are selected through a farthest point sampling method, so that a grabbing area (a sphere area with the points as centers) of the k1 points is obtained; a grabbing benchmark with a preset grabbing direction is introduced into each grabbing area, the characteristics of each grabbing area are used as input, and deviation values of the position, the direction and the angle of a grabbing part relative to the preset grabbing benchmark are regressed through a maximum pooling layer and a multilayer sensor; and combining the regression deviation value and the classification result of the preset standard to obtain k1 detected grasping part proposals.

Preferably, k1 points are selected from the normal points obtained by the step two classification as regression points, and the feature extracted by extracting the point cloud feature corresponding to the captured region with the k1 points as the center is selected as the captured region feature. And inputting the characteristics of the grabbing areas as input, sending the characteristics into a maximum pooling layer and a multilayer sensor, and outputting k1 grabbing position proposals.

As a preferred embodiment, in order to achieve higher positioning accuracy of the grasping portion, classification and regression based on the grasping reference are adopted in the present application instead of a method of directly regressing the grasping portion. The classification and regression based on the grabbing criteria requires first matching of the calibrated grabbing part with the preset grabbing criteria. And when the difference value of the grabbing directions of the preset reference and the calibration grabbing part is smaller than a specific threshold value, matching is successful, a positive label is given to the successfully matched preset reference, and the residual error needing regression is the difference value of the calibration grabbing part and the preset reference corresponding to the positive label. The training loss includes the loss of classification of the preset grasping reference and the regression loss of the deviation values of the position, the direction and the angle of the grasping portion relative to the preset grasping reference. And training the network by minimizing the loss function to obtain a model of the region capture part detection network. In the step, the detection precision is improved by extracting the characteristics of the grabbing area and the proposed grabbing anchor point mechanism, and compared with a mode of direct regression by using single-point characteristics, the precision is improved by 5.79% on the grabbing data set constructed by the method.

Step four: and (4) optimizing to obtain a more accurate grasping part on the basis of the grasping part predicted in the step three through a grasping part optimization network.

As shown in fig. 4, the specific process is as follows:

compared with the grabbing area, the information contained in the clamp closed area of the grabbing part proposal obtained in the step three is closer to the real grabbing part; selecting proposals of which k2 corresponding grasping closed regions contain more than 50 points from the obtained grasping part proposals for optimization; converting points in the selected grabbing closed area from a world coordinate system to a grabbing coordinate system through coordinate conversion, and acquiring the characteristics of k2 grabbing closed areas through a multilayer sensor; combining the characteristics of the grabbing closed area with the characteristics of the grabbing area to obtain corresponding fusion characteristics; and (4) taking the fusion characteristics as input, and performing regression on deviation values of the positions, the directions and the angles of the grabbing parts relative to the grabbing proposal obtained in the step three through a maximum pooling layer and a multilayer sensor to finally obtain k2 optimized predicted grabbing parts.

Preferably, the maximum pooling layer and the multilayer perceptron are fed by taking the fused characteristics of the gripping area characteristics and the clamp closing area as input. In the training process, the application firstly classifies k2 predicted grabbing part proposals, and if the predicted grabbing part proposals are close to the corresponding marked grabbing parts (the difference between the grabbing direction and the angle is less than 2 pi/9 and pi/3 respectively), the predicted grabbing part proposals are endowed with positive labels. When the deviation of the grasping portion is regressed, only the position, direction and angle of the grasping portion proposal given the positive label are optimized.

The final loss function thus consists of the classification loss of the grasping-site proposal and the regression loss of the positively labeled grasping-site proposal. And the training of the network is completed by minimizing the loss function through a random gradient descent method. The performance of the grabbing part optimization network is improved by 1.11% on the grabbing data set constructed by the method.

Step five: and selecting the grabbing part with the highest grabbing quality index from the predicted grabbing parts obtained by the grabbing part optimization network, and obtaining the position and the posture of the grabbing part in a robot coordinate system through coordinate system transformation so as to plan the posture of the clamp during the operation of the robot and finish the operation of the robot.

Supplemental simulation example

Step six: the invention uses the effective grabbing proportion as an evaluation standard to evaluate the performance of the method on the grabbing part detection task in the three-dimensional space. The process of calculating the effective grabbing proportion comprises the following steps: firstly, acquiring all grabbing parts given with positive labels and output by a grabbing part optimization network, and recording the number of the output grabbing parts as k; then, converting the grabbing parts into an object coordinate system according to the mapping relation between the object coordinate system in the data set and a world coordinate system, and calculating the grabbing mass fraction of the grabbing parts converted into the object coordinate system on the corresponding object; counting the number kT of the positive example grabbing parts according to a set grabbing quality score threshold; the effective grabbing proportion is the ratio of kT to k.

The effective grabbing proportion of the invention on the test set reaches 92.47 percent, which is far beyond the grabbing part detection algorithms in the previous three-dimensional space. The invention respectively counts the detection performance of the grabbing parts of the objects in the training set and the detection performance of the grabbing parts of the objects not in the training set, and the result is shown in table 1.

TABLE 1

Training objects appearing in the set	Mustard bottle	Gelatin box	Banana	Peach (Chinese character)
					Effective capture ratio	94.41％	99.30％	99.92％	87.28％
Objects not present in training set	Candy box	Pudding box	Golf ball	Hammer
					Effective capture ratio	87.06％	97.45％	85.76％	72.44％

Fig. 5 visualizes the capture-site detection results for these objects, where the objects in the first row are objects present in the training set, the objects in the second row are objects not present in the training set, blue are positive capture sites, and red are negative capture sites.

In conclusion, based on the deep learning algorithm, the point capture confidence evaluation network, the area capture part detection network and the capture part optimization network extract the parameters of the network by sharing point cloud characteristics, so that the real-time performance of the algorithm is improved; the grabbing position with high grabbing quality index can be detected by using the grabbing area, the grabbing anchor point mechanism and the grabbing position optimization network, and the detection precision of the three-dimensional grabbing position is improved. Based on the three-dimensional point cloud observed by the depth camera, the invention can ensure that the robot can accurately grab different types of objects in an unstructured environment so as to improve the safety and reliability of the intelligent robot operation and the interaction with the outside.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided would be apparent to those of skill in the art upon reading the above description. The scope of the present teachings should, therefore, be determined not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. The disclosures of all articles and references, including patent applications and publications, are hereby incorporated by reference for all purposes. The omission in the foregoing claims of any aspect of subject matter that is disclosed herein is not intended to forego such subject matter, nor should the applicant consider that such subject matter is not considered part of the disclosed subject matter.

Claims

1. An intelligent robot grabbing method based on three-dimensional vision is characterized by comprising the following steps:

the method for establishing the capturing confidence evaluation network comprises the following steps:

inputting the input observation point cloud into a trained point cloud feature extraction network, wherein the training loss is the classification loss of the capturing confidence of each point, and the network parameters are optimized by minimizing the loss function to obtain a capturing confidence evaluation network model;

the specific process for obtaining the point suitable for being used as the center of the grabbing part is as follows:

coding all input point clouds to group characteristics through a trained point cloud characteristic extraction network, and decoding the group characteristics to point-by-point characteristics through a distance difference method so as to realize the characteristic extraction of the input point clouds; then, the features extracted from each point are segmented through a trained two-classification segmentation network, and each point predicted as a positive class is suitable for being used as the center of a captured part;

2. The method of claim 1,

the observation point cloud is obtained under the current scene through a Kinect sensor.

3. The method of claim 1,

the object grabbing part detection is to predict a corresponding grabbing part by taking a point predicted to be suitable as the center of the grabbing part as a regression center through a regional grabbing part detection network.

4. The method of claim 3,

the specific process of predicting the corresponding grasping part is as follows:

5. The method of claim 4,

the classification and regression of the grabbing benchmark require matching of a calibrated grabbing part and a preset grabbing benchmark; when the difference value of the grabbing directions of the preset reference and the calibration grabbing part is smaller than a specific threshold value, matching is successful, a positive label is given to the successfully matched preset reference, and the residual error needing to be regressed is the difference value of the calibration grabbing part and the preset reference corresponding to the positive label; the training loss comprises the loss of classification of the preset grabbing reference and the regression loss of deviation values of the position, the direction and the angle of the grabbing part relative to the preset grabbing reference; and training the network by minimizing the loss function to obtain a model of the region capture part detection network.

6. The method of claim 1,

the optimization obtains a more accurate grasping part, and the specific process is as follows:

7. The method of claim 6,

the proposal is preferably carried out by taking the characteristic of the gripping area fused with the characteristic of the clamp closed area as input and sending the input into a maximum pooling layer and a multilayer sensor; in the training process, firstly classifying k2 predicted grasping part proposals, and if the predicted grasping part proposals are close to the corresponding marked grasping parts, giving positive labels to the predicted grasping part proposals; when the deviation of the grasping portion is regressed, only the position, direction and angle of the grasping portion proposal given the positive label are optimized.

8. The method of claim 7,

the close grasping position means that the difference between the grasping direction and the angle is less than 2 pi/9 and pi/3 respectively.