CN112489117B

CN112489117B - Robot grabbing pose detection method based on domain migration under single-view-point cloud

Info

Publication number: CN112489117B
Application number: CN202011418811.5A
Authority: CN
Inventors: 钱堃; 景星烁; 柏纪伸; 赵永强; 施克勤
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2022-11-18
Anticipated expiration: 2040-12-07
Also published as: CN112489117A

Abstract

The invention discloses a robot grabbing pose detection method based on domain migration under single-view-point cloud, which comprises the following steps of: 1) Acquiring single-view point cloud of a scene grabbed by the robot through a depth camera; 2) Preprocessing the collected point cloud; 3) Uniformly and randomly sampling on the target point cloud, calculating a local frame, and acquiring candidate grabbing poses; 4) Defining a new coordinate system by the center of the gripper, and encoding the grabbing pose into a multi-channel projection image; 5) Constructing a capture pose evaluation model which takes a multi-channel capture image as input and realizes unsupervised domain self-adaptive migration from a simulation domain to a physical domain based on a generated countermeasure network; 6) And constructing a large-scale simulation object data set, constructing a real object data set, and automatically labeling a rubbing and grabbing detection method to form a training set and a test set. The method relieves the cost of data acquisition and marking in an unsupervised domain migration mode, and has generalization performance on unknown and irregular objects.

Description

Robot grabbing pose detection method based on domain migration under single-view-point cloud

Technical Field

The invention relates to the field of grabbing detection in robot operation skill learning, in particular to a robot grabbing pose detection method based on domain migration under single-view-point cloud.

Background

With the development of artificial intelligence, a dramatic progress has also occurred in the field of robotics. At the present stage, the intelligent robot is expected to have the capabilities of sensing the environment and interacting with the environment, in the robot operation skill learning, the robot environment sensing is also an indispensable ring, and the sensing capability given to the robot is also a long-term target of computer vision and robot subjects. In the operation skills of the robot, the gripping skills can bring great effects to the society, such as completing the picking and placing tasks with heavy human labor, helping the disabled or the old to complete the daily grabbing tasks, and the like, so the robot is the most basic and the most important skill. The key technology lies in detection of a robot grabbing pose. For the detection of the 6DoF grabbing pose, the traditional grabbing pose estimation method needs to calculate the pose of an object in a scene in advance and determine the grabbing pose according to the CAD model of the known object in a model library, and the object in an actual scene is usually unknown and an accurate CAD model is not easy to acquire. At present, the mainstream detection method is based on two-stage grab detection, namely, candidate grab poses are generated firstly and then evaluated. Mainly embodied in point cloud-based deep learning methods, such methods still face difficulties and challenges: the point cloud data acquired by the sensor contains single-view-point clouds which are relatively noisy and mostly incomplete, and a capture detection algorithm is difficult to generalize; the data sets participating in training are obtained through a complicated reconstruction method, some scenes are difficult to acquire, the data volume is limited, large-scale construction and labeling are difficult, and large-scale data sets generated in a simulation environment are not fully utilized.

A typical example of the conventional model matching-based method is the ROS grasping framework proposed by Chitta et al (see "Chitta S, jones E, ciocaliem and Hsiao K, perspective, planning, and execution for mobile interaction in unstructured interactions, IEEE Robotics and Automation Magazine 2012"), which registers a CAD model of a known object onto a point cloud and then plans a feasible grasping route. Based on the most classical algorithm proposed by Pas et al (see "ten Pat A and Platt R, using geometry to detect shots in 3d points groups, proceedings of the International symposium on Robotics Research 2015"), a series of shots candidates are first sampled by geometric constraints, then the projected image is encoded by Using HOG features, and a support vector machine is used for the evaluation of the shots. Subsequently, pas et al perform optimization (see "ten Pas a, gualtieri M, saenko K, platt R, grassp position detection in point clouds, IJRR 2017"), replace the classifier portion of the pose projection image with a deep learning method, and improve the performance of grab detection. Some subsequent researches are based on the method, but sample labeling and domain migration problems in the algorithm based on the point cloud information are not considered.

At present, a robot grabbing pose detection scheme aiming at domain migration is still lacking, a few of migration methods used in the field of robots also exist in a detection framework of a two-dimensional image, and some migration methods also use a domain enhancement strategy. In the field of computer vision, the transfer learning develops rapidly, and particularly in the current deep learning hot tide, the domain self-adaptation of features is mainly focused. Therefore, the migration problem in the grabbing detection is solved by means of the advanced domain self-adaptive technology in the field, the cost of sample collection and labeling can be greatly reduced, cheap training data in a simulation environment is fully utilized, and the generalization performance of the model is improved.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art in the field, the invention provides a robot grabbing pose detection method based on domain migration under single-view-point cloud, which can realize unsupervised feature self-adaptation from a simulation domain to a real object domain, thereby reducing the cost of sample labeling, fully utilizing simulation data and improving the generalization performance of a model to a new object.

The technical scheme is as follows: the invention adopts the following technical scheme:

a robot grabbing pose detection method based on domain migration under single-view-point cloud comprises the following steps:

step 1, acquiring an original point cloud of a desktop grabbing scene through a depth camera;

step 2, preprocessing the original scene point cloud, wherein the processing flow comprises space cutting, plane extraction and outlier filtering to obtain a target point cloud set of the object to be grabbed;

and 3, calculating candidate grabbing poses according to the target point cloud, firstly, carrying out uniform random sampling in the target point cloud set, calculating Darbour local frame in a neighborhood by taking each sampling point as a center to construct a basic pose, then, carrying out search expansion in a two-dimensional grid through rotation and translation, calculating the deepest grabbing pose, judging the effectiveness, and obtaining the final candidate grabbing pose.

And 4, encoding the candidate grabbing poses, converting a point cloud coordinate system to the center of the holder, and encoding the grabbing poses into a multi-channel image in a projection mode.

And 5, constructing a domain migration grabbing pose evaluation model by using a deep learning framework, labeling the simulation data set and the real object data set in a force closure and collision detection mode, and training the model, so as to predict the probability of the multi-channel image.

And 6, sequencing according to the predicted probabilities of all candidate grabbing poses, and selecting the pose with the highest probability as the pose for the robot to grab.

The scene data in the step 1 is point cloud data acquired under a single visual angle fixed by a depth camera, and the scene type is the grabbing of a single object on a desktop by a robot.

The spatial clipping operation is to clip the desktop part and the part of the object except the point cloud, the plane extraction operation is to remove the point cloud of the desktop part, and the outlier filtering is to remove the outlier left after clipping and plane extraction.

The uniform random sampling in step 3 is to use a congruence method to generate random numbers to obtain point cloud indexes.

And 3, calculating the neighborhood of the Darbour local frame in the neighborhood by taking each sampling point as the center to construct the basic pose, wherein the neighborhood is represented as an intra-sphere region with the radius of r.

The candidate grabbing pose acquiring process in the step 3 comprises the following steps:

(1) Randomly and uniformly sampling on the target point cloud to obtain a sampling point set;

(2) Calculating a basic frame in the neighborhood of each sampling point to obtain a basic pose;

(3) Carrying out pose expansion by rotating and translating a basic pose;

(4) Performing collision detection on each expanded pose, and calculating the deepest grabbing pose;

(5) And eliminating invalid grabbing candidates by judging whether the clamping area corresponding to the deepest grabbing pose contains point clouds or not.

And 5, the simulation data set is a 3D-NET data set, and the real object data set is obtained through three-dimensional reconstruction of a Kinectfusion algorithm.

The grasping pose evaluation model of the domain migration in the step 5 comprises a weight self-adaptive basic convolution network, a lightweight fully-connected layer and a generation countermeasure network, the optimization mode during model training adopts an alternate optimization mode, and the training step comprises the following steps:

(1) The input data comprises a simulation data set and a real object data set, and forward calculation is carried out on all the networks;

(2) Updating a discriminator D, calculating the cross entropy loss of the simulation data set, the domain discrimination loss of the simulation data, the domain discrimination loss of the simulation generated data and the domain discrimination loss of the real object generated data, and adding a gradient penalty item to inhibit mode collapse;

(3) Updating a generator G, and using domain discrimination loss and label classification cross entropy loss after the simulation data set passes through the generator and the discriminator;

(4) Updating a classifier C, and only using the classified cross entropy loss of the simulation data set after passing through the F network and the C network;

(5) And updating the feature extractor F, and using the simulation classification loss, the label classification cross entropy loss of the simulation generated data and the domain discrimination loss of the real object generated data.

The penalty function used to optimize arbiter D is:

wherein N represents the number of samples of the batch data, D _c And D _rf Respectively representing a discriminator label classification branch and a domain discrimination branch, s _n In order to simulate the sample, the sample is,

and

respectively representing simulation feature combination input and physical feature combination input of the generator;

the penalty function used to optimize generator D is:

the penalty function for the optimized classifier C is:

wherein, the first and the second end of the pipe are connected with each other,

a simulated feature input representing a classifier;

the loss function of the optimization feature extractor F is:

where α and γ represent the balance weights of the two loss portions, respectively.

In the feature extraction network, an SE structure is added, so that the weight can be self-adapted, the model performance is improved, the number of full connection layers is reduced through multilayer convolution operation, and the over-fitting problem can be effectively inhibited.

Has the advantages that: compared with the prior art, the method for detecting the grabbing pose of the robot based on the domain migration under the single-view-point cloud has the following beneficial effects:

1. the method has the advantages that the single-view point cloud is acquired by using the RGB-D sensor with the fixed view angle, the installation difficulty of the sensor is simplified, the cost is reduced, the grabbing pose of 6DoF can be detected by the algorithm, the grabbing task requirement of the three-dimensional space robot is met, and the method has higher practicability compared with plane grabbing.

2. The method has the advantages that the required input is single-view-point cloud, the algorithm can deal with incomplete noisy data, candidate capture generation is simple and effective, and capture pose is encoded into a multi-channel image, so that a deep learning model can be effectively utilized to carry out capture stability prediction.

3. Aiming at the problem that a real object domain sample is difficult to collect, the method adopts a domain self-adaptive technology in the transfer learning, simultaneously trains a model on a simulation domain data set and a real object domain data set, does not need to be labeled, performs the counterstudy by means of a generated counternetwork to extract the domain self-adaptive characteristics, and realizes the capture stability prediction of the real object domain data. Massive simulation data set information can be effectively utilized, a physical domain does not need to be labeled, complex scanning reconstruction tasks are replaced, the generalization capability is improved, and the method is economical and practical.

4. The self-adaptive weight method is used in the domain self-adaptive grabbing detection model to adjust the result obtained by the convolutional layer, the number of full connection layers is reduced, and the overfitting problem during model training is effectively inhibited.

Drawings

FIG. 1 is an overall flow chart of the disclosed method;

FIG. 2 is a schematic diagram of point cloud pre-processing;

FIG. 3 is a schematic diagram of candidate grabbing pose encoding;

FIG. 4 is an architecture diagram of a capture detection model for domain migration;

FIG. 5 is a capture pose and a corresponding three-channel captured image;

FIG. 6 is a class simulation image generated by a generator in a domain migration grab detection model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described below with reference to the accompanying drawings.

As shown in fig. 1, which is an overall flowchart in the method disclosed by the present invention, the present invention discloses a method for detecting a robot grabbing pose based on domain migration under a single-view-point cloud, which mainly comprises six steps: step 1, acquiring a point cloud image in a captured scene, and acquiring a fixed and single view angle; step 2, point cloud preprocessing, namely extracting the point cloud of the target object through space cutting, plane extraction and outlier filtering; step 3, generating candidate grabbing poses, namely uniformly and randomly sampling on a target point cloud set, establishing a sphere field calculation basic frame by using sampling points, expanding the basic poses by rotating and translating the basic poses, and calculating the deepest grabbing point of each expanded pose to obtain the candidate grabbing poses; step 4, encoding each candidate pose into a multi-channel image in a projection mode, wherein a coordinate system is changed from an original point cloud coordinate system to the center of a holder; step 5, constructing a capture evaluation model of domain migration, simultaneously training on a simulation domain with a label and a real object domain data set without a label, and directly applying the trained model to a real object domain multi-channel image to obtain capture confidence prediction; and 6, screening the pose corresponding to the image with the highest confidence coefficient as the final robot execution pose.

The implementation of the invention needs to use an RGB-D depth sensor and a GPU, and the specific implementation process adopts one desktop with a Geforce 2080GPU and one Kinect V1 depth camera.

The method disclosed by the invention specifically comprises the following steps:

step 1, point cloud data under a single object desktop grabbing scene are obtained;

and (3) acquiring point clouds of the captured scene by using an RGB-D depth camera, and setting the point clouds to be acquired under a single fixed visual angle, wherein only point cloud information is required to be utilized in the method.

Step 2, point cloud pretreatment;

as shown in fig. 2, the whole scene point cloud is first clipped to remove the surrounding cluttered scene, then the desktop point cloud information is removed by plane extraction, and finally the outlier filtering is performed to remove the outlier.

Step 3, generating candidate grabbing poses;

a schematic diagram of the gripper coordinate system is shown in fig. 3. And uniformly and randomly sampling on the target point cloud set, estimating a basic frame by taking a sampling point as a center, expanding by rotating and translating, and calculating the deepest grabbing so as to obtain the final candidate grabbing pose.

The step 3 specifically comprises the following 5 sub-steps, and the specific implementation method is as follows:

(311) Uniformly and randomly sampling by randomly generating point cloud index numbers on the target point cloud, wherein the sampling quantity is fixed;

(312) Establishing a sphere radius neighborhood for each sampling point, counting all points in a sphere, and calculating a basic frame (Darbour frame), wherein the calculation formula is as follows:

wherein p represents sampling points, n (-) and n (-) respectively ^T Respectively representing the surface normal vector and its transpose of a point in the neighborhood,

representing the target point cloud, B _r (p) represents a sphere region having a radius r around a sample point, and the normal vector v of the sample point can be estimated by calculating the eigenvector corresponding to the minimum eigenvalue of M (p) ₁ (p) calculating the eigenvector corresponding to the largest eigenvalue can estimate the minimum principal curvature direction v of the point ₃ (p) the direction of maximum principal curvature v can be calculated by vector orthogonality ₂ (p), the basic frame F (p) = [ v) can be obtained finally ₁ (p),v ₂ (p),v ₃ (p)]I.e. the initial pose of the gripper, and the initial position of the gripper is the coordinates of the sampling point p, so the initial pose h (p) = [ p ] of the gripper can be obtained _x ,p _y ,p _z ,v ₁ (p),v ₂ (p),v ₃ (p)]；

(313) Expanding the pose of each initial gripper through rotation around a z-axis and translation along a Y-axis, and realizing the pose h of each expanded gripper in a mode of constructing a two-dimensional grid (phi, Y) and searching in the grid _φ,y (p) all are realized by right multiplication of the initial gripper pose by a transfer matrix;

(314) Moving each extended pose in the positive direction of x to find the minimum x meeting the collision-free condition ^* To obtain a new pose

(315) And eliminating invalid poses by judging whether the holder area corresponding to each new pose contains point clouds or not, thereby obtaining all candidate poses.

Step 4, as shown in fig. 4, encoding each candidate grabbing pose in the following encoding modes: and extracting the point cloud in the closed area of the holder corresponding to each pose, converting a coordinate system by taking the current grabbing pose as a reference, and projecting in a plurality of axial directions so as to encode into a multi-channel image.

Step 5, as shown in fig. 5, a capture pose evaluation model based on domain migration is constructed, and it can be seen that, in the training stage, domain adaptive features are extracted by using simulation domain data and physical domain data in combination with a generation countermeasure network, and the idea is mainly to constrain the features extracted by the two domains and align generated images (as shown in fig. 6) obtained as input by a generator to the simulation domain, so that the two domains are mapped to the same distribution. In the testing stage, only an F network and a C network are needed, the method can be used for testing in the physical domain, multi-channel captured images in the physical domain are used as input, probability prediction is carried out through a captured pose evaluation model based on domain migration, and confidence coefficient that each candidate pose can be successfully captured is obtained.

In the step 5, two training data sets are mainly provided, one is an object set constructed from a 3D-NET model library and used as a simulation domain, and the other is an object set obtained through three-dimensional reconstruction of a Kinectfusion algorithm and used as a real domain. And similarly, sampling candidate and coding and other operations are carried out, and rubbing is marked by means of force closure principle detection.

In addition, the optimization mode during model training adopts an alternate optimization mode, and the training process comprises the following substeps:

(511) Inputting data comprising a simulation data set and a real object data set, and then carrying out forward calculation on the network;

(512) Updating a discriminator D, calculating the cross entropy loss and the domain discrimination loss of the simulation data set and the domain confrontation loss of the real domain, and adding a gradient penalty term to inhibit mode collapse, wherein the optimized loss term is as follows:

wherein N represents the number of samples of the batch data, D _c And D _rf Respectively representing a discriminator tag classification branch and a domain discrimination branch, s _n To simulate a sample, F _csn And F _ctn Respectively representing simulation feature merging input and real domain feature merging input of a generator;

(513) And updating the generator G, and using the domain discrimination loss and the label classification loss of the simulation data set after the generator and the discriminator, wherein the optimized loss items are as follows:

(514) Updating a classifier C, only using classification loss of the simulation data set after passing through the F network and the C network, wherein the optimized loss item is as follows:

wherein the content of the first and second substances,

a simulated feature input representing a classifier;

(515) Updating a feature extractor F, using the simulation classification loss and the domain confrontation loss of the real domain generated data, adopting a multi-loss weighting mode as a loss item optimization mode, wherein the loss item is as follows:

In addition, an SE structure is added in the feature extraction network, so that the weight can be self-adapted, the model performance is improved, the number of full connection layers is reduced through multilayer convolution operation, and the over-fitting problem can be effectively inhibited.

And 6, sequencing according to the probability predicted by the model in the step 5, screening out the pose corresponding to the captured image with the highest probability of stably capturing the object, and taking the pose as the capturing execution pose of the robot.

Claims

1. A robot grabbing pose detection method based on domain migration under single-view-point cloud is characterized by comprising the following steps:

And 4, encoding the candidate grabbing poses, converting a point cloud coordinate system to the center of the gripper, and encoding the grabbing poses into a multi-channel image in a projection mode.

And 5, constructing a capture pose evaluation model of domain migration by using a deep learning framework, labeling the simulation data set and the physical data set in a force closure and collision detection mode, and training the model, so as to predict the probability of the multi-channel image.

2. The method for detecting the grabbing pose of the robot based on the domain migration under the single-view-point cloud according to claim 1, wherein the scene data in the step 1 is point cloud data acquired under a single view angle fixed by a depth camera, and the scene type is grabbing of a single object on a desktop by the robot.

3. The method for detecting the grabbing pose of the robot based on the domain migration under the single-view-point cloud as claimed in claim 1, wherein the spatial clipping operation is to clip a desktop part and parts except the point cloud of the object, the plane extraction operation is to remove the point cloud of the desktop part, and the outlier filtering is to remove the outliers left after the clipping and the plane extraction.

4. The method for detecting the grabbing pose of the robot based on the domain migration under the single-view-point cloud as claimed in claim 1, wherein the uniform random sampling in the step 3 is to use a congruence method to generate a random number to obtain the point cloud index.

5. The method for detecting the grabbing pose of the robot based on the domain migration under the single-view-point cloud according to claim 1, wherein the neighborhood of the single-view-point cloud when calculating the Darbour local frame in the neighborhood by taking each sampling point as the center and constructing the basic pose is represented as an intra-sphere region with the radius r in the step 3.

6. The method for detecting the grabbing pose of the robot based on the domain migration under the single-view-point cloud according to claim 1, wherein the candidate grabbing pose obtaining process in the step 3 is as follows:

(3) Carrying out pose expansion by rotating and translating a basic pose;

7. The method for detecting the grabbing pose of the robot based on the domain migration under the single-view-point cloud as claimed in claim 1, wherein the simulation dataset in the step 5 is a 3D-NET dataset, and the physical dataset is obtained by three-dimensional reconstruction through a Kinectfusion algorithm.

8. The method for detecting the grabbing pose of the robot based on the domain migration under the single-view-point cloud according to claim 1, wherein the grabbing pose evaluation model of the domain migration in the step 5 comprises a weight self-adaptive basic convolution network, a lightweight full-connected layer and a generation countermeasure network, an alternating optimization mode is adopted in an optimization mode during model training, and the training step comprises:

(4) Updating the classifier C, and only using the classified cross entropy loss of the simulation data set after passing through the F network and the C network;

9. The method for detecting the grabbing pose of the robot based on the domain migration under the single-view-point cloud according to claim 8, wherein a loss function for optimizing the discriminator D is as follows:

wherein N represents the number of samples of the batch data, D _c And D _rf Respectively representing a discriminator label classification branch and a domain discrimination branch, s _n In order to simulate a sample,

and

respectively represent lifeMerging and inputting simulation features and object features of the finished product;

the penalty function used to optimize generator D is:

the penalty function for the optimization classifier C is:

wherein the content of the first and second substances,

a simulated feature input representing a classifier;

the loss function of the optimization feature extractor F is:

where α and γ represent the balance weights of the two loss parts, respectively.