CN111652928A

CN111652928A - Method for detecting object grabbing pose in three-dimensional point cloud

Info

Publication number: CN111652928A
Application number: CN202010390619.3A
Authority: CN
Inventors: 王晨曦; 方浩树; 苟铭浩; 卢策吾
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-05-11
Filing date: 2020-05-11
Publication date: 2020-09-11
Anticipated expiration: 2040-05-11
Also published as: CN111652928B

Abstract

A method for detecting object grabbing pose in three-dimensional point cloud comprises the steps of training an end-to-end object grabbing pose detection model by arranging object grabbing pose in a sample image to serve as a training set, and identifying three-dimensional point cloud data to be detected to obtain candidate grabbing pose scores so as to achieve object grabbing pose detection. The invention closely relates the overall and local characteristic relation in the point cloud through the end-to-end full scene training test, and improves the detection accuracy while optimizing the running speed.

Description

Method for detecting object grabbing pose in three-dimensional point cloud

Technical Field

The invention relates to a technology in the field of image processing, in particular to a method for detecting an object grabbing pose in three-dimensional point cloud.

Background

Object grabbing is a basic problem in the field of robots, and has wide application prospects in the industries of manufacturing, building, service and the like. The most critical step in capture is the capture pose (pose of the capture device in space) detection for a given visual scene (e.g., a picture or a point cloud).

The existing grabbing pose technology is divided into two technical routes, wherein one route indirectly generates the grabbing pose by estimating the pose of an object in space, but the prediction result is very sensitive to the accuracy of estimation of the object pose, so that the prediction accuracy is greatly reduced; the other route is to estimate the capture pose directly in the scene without knowing the pose information of the object, and can be realized by reinforcement learning or deep learning. Iterative correction and evaluation are carried out on the current pose of the mechanical arm based on a reinforcement learning method, the current pose gradually approaches to an object, and finally a reliable grabbing pose is generated. A method based on deep learning obtains a large number of grabbing pose candidates through grid sampling on point clouds, then the grabbing poses are coded into a 2D image, and whether grabbing can be carried out or not is judged through CNN. The further improved technology encodes the grabbing pose into a three-dimensional point cloud, so that although the classification accuracy is improved, the candidate pose is still realized through grid sampling, and the calculation amount is huge.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the method for detecting the object grabbing pose in the three-dimensional point cloud, which closely links the overall characteristic relation with the local characteristic relation in the point cloud through the end-to-end full scene training test, thereby optimizing the running speed and improving the detection accuracy.

The invention is realized by the following technical scheme:

the invention relates to a method for detecting object grabbing pose in three-dimensional point cloud, which trains an end-to-end object grabbing pose detection model by arranging object grabbing pose in a sample image as a training set, and identifies three-dimensional point cloud data to be detected to obtain candidate grabbing pose scores so as to realize object grabbing pose detection.

The finishing is as follows: RGB-D images of object capture pose detection with different object combinations and different placing modes are obtained from an existing image library serving as sample images, and a point cloud scene and corresponding training labels are synthesized.

The end-to-end object grabbing pose detection model comprises the following steps: the device comprises a candidate grabbing point prediction module for processing coded spatial information, a spatial transformation module for generating candidate grabbing pose characteristics, a grabbing parameter prediction module and a grabbing affinity prediction module, wherein: the candidate grabbing point prediction module is internally provided with a PointNet + + model for processing a point cloud scene and carries out candidate grabbing point location and main shaft direction prediction, the space transformation module is used for cutting point clouds near candidate grabbing poses and converting the point clouds into a clamping jaw coordinate system, the grabbing parameter prediction module is used for predicting grabbing pivoting angles, clamping widths and grabbing scores, and the grabbing affinity prediction module is used for judging the robustness of the grabbing poses.

The object grabbing pose detection is preferably further realized by performing threshold judgment on candidate grabbing pose scores.

Technical effects

The invention integrally solves the problem that the three-dimensional point cloud scene information cannot be effectively utilized in the object grabbing pose detection.

Compared with the prior art, the method has the advantages that the grabbing pose is coded in a three-dimensional point cloud mode, the data information amount is greatly improved compared with two-dimensional coding, an end-to-end full-scene training test process is provided, undifferentiated grid sampling candidate poses are avoided, the overall and local characteristic relations in the point cloud are closely related, the detection accuracy is improved while the running speed is optimized, and the best performance is achieved on a large general grabbing data set GraspNet.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic view of a jaw coordinate system;

FIG. 3 is a schematic diagram illustrating the effect of the present invention.

Detailed Description

As shown in fig. 1, the present embodiment relates to a method for detecting an object capture pose in a three-dimensional point cloud, which includes the following steps:

step 1: data preprocessing: the method comprises the following steps of obtaining RGB-D images containing different object combinations and different placing modes and used for object capturing pose detection from an existing image library, and synthesizing a point cloud scene and corresponding training labels, wherein the method specifically comprises the following steps:

step 1.1: synthesizing a point cloud using the intra-camera parameters, wherein: the camera intrinsic parameters are expressed as scale factors f of the camera on the u-axis and the v-axis of an image coordinate system_xAnd f_yPrincipal point coordinates (c) of image coordinate system_x,c_y) And an image depth value scaling s, which represents the coordinate of a point on the image coordinate system by (u, v), represents the corresponding depth value by d, and represents the three-dimensional coordinate in the camera coordinate system by (x, y, z), in combination

And converting the RGB-D image into a point cloud scene under a camera coordinate system.

Step 1.2: adding a plurality of training labels for the point cloud scene, specifically comprising: and obtaining 25600 point cloud scenes with capture pose labels by using the object type labels, the object poses in the point cloud, the capture poses on each object and the scores of the capture poses.

As shown in fig. 2, the grabbing pose includes a translation amount, a main shaft direction, a pivoting angle and a clamping width, the main shaft direction is along the z axis in the clamping jaw coordinate system, the pivoting angle refers to a rotation amount from the z axis to the y axis, and the clamping width refers to a distance between two fingers of the clamping jaw on the y axis.

Step 2: constructing an end-to-end object grabbing pose detection model, wherein the model comprises the following steps: the end-to-end object grabbing pose detection model comprises the following steps: the device comprises a candidate grabbing point prediction module for processing coded spatial information, a spatial transformation module for generating candidate grabbing pose characteristics, a grabbing parameter prediction module and a grabbing affinity prediction module, wherein: the candidate grabbing point prediction module predicts candidate grabbing point positions and main shaft directions, the space transformation module cuts point clouds near candidate grabbing poses and converts the point clouds into a clamping jaw coordinate system, the grabbing parameter prediction module predicts an axial rotation angle, a clamping width and grabbing scores of grabbing, and the grabbing affinity prediction module judges robustness of the grabbing poses.

The candidate grabbing point prediction module comprises: a base network unit and a prediction unit, wherein: the base network unit encodes the 20000 multiplied by 3 point cloud, and the prediction unit predicts the appropriate grabbing point location and clamping jaw main shaft direction in the scene by using the encoded point cloud.

The base network unit adopts a mode recorded in a document 'PointNet + +: deep hierarchical Feature Learning on Point set in a Metric Space' (NIPS 2017) by Charles R.Qi and the like to realize a PointNet + + three-dimensional deep Learning network, and the input of the network is a network which comprises a plurality of objects and has the size of N × C₁Scene point cloud (N represents the number of points, C)₁Number of characteristic channels representing points), and obtaining the size of N × C after a plurality of times of downsampling, 1 × 1 convolution, maximum neighborhood pooling and characteristic jump splicing₂And further processed to size M × C using the farthest point sampling₂The point set of (2). Wherein:

N＝20000，M＝1024，C₁＝3，C₂＝256。

the prediction unit takes the point set output by the base network unit as input and has the size of (C)₂V +2, V +2), the first 2 dimensions of each sampling point are two categories whether the point can be grabbed, the last V dimensions represent the prediction of the grabbing principal axis direction, the process is to uniformly sample V viewing angles from a unit circle with the point as the center, and select the viewing angle with the highest score as the candidate principal axis direction. The model was taken as V300.

The cutting is as follows: and the space transformation module cuts out a point cloud block in a cylindrical space according to the candidate grabbing point position and the main shaft direction, the cylindrical main shaft is along the main shaft direction of the clamping jaw, and the center of the bottom surface circle is the candidate grabbing point.

The rotary shaftThe method comprises the following steps: converting the point cloud block after cutting into a clamping jaw coordinate system, and enabling v to be [ v ]₁,v₂,v₃]Then the transformation matrix of the point cloud coordinates is represented as o ═ o₁,o₂,o₃]Wherein:

each candidate shears a cylinder point cloud corresponding to a different jaw depth, wherein: the radius of the bottom surface of the cylinder is 0.05m, and the depth is 0.01m, 0.02m, 0.03m and 0.04m in sequence. Processing all candidate grabbing poses to obtain 4M pieces of shearing point clouds, and sampling each piece of point cloud to N_sPoint, finally 4M × N is obtained_s×C₁The point cloud array of (1), wherein: n is a radical of_s＝64。

The grabbing parameter prediction module comprises: multilayer sensor, maximum pooling layer, first full-link layer F₁The first normalization layer B₁A first activation function layer R₁A second full-junction layer F₂The second batch of normalization layers B₂A second activation function layer R₂And a third fully-connected layer F₃。

The grabbing parameters are input through a grabbing parameter prediction module according to the point cloud after the tangent transformation, and the size of the grabbing parameters is (C)₃,C₄,C₅) The obtained multi-layer sensor has a size of 4M × N_s×C₅The characteristic point cloud array of (1), passing through the maximum pooling layer P₁Obtaining a size of 4M × C₅The point cloud global feature array sequentially passes through a first full connecting layer F₁The first normalization layer B₁A first activation function layer R₁A second full-junction layer F₂The second batch of normalization layers B₂A second activation function layer R₂A third full-junction layer F₃The size of the obtained product is 4M × C₈And (4) obtaining a predicted value of the pose parameter.

The multilayer perceptron is three groups of 1 × 1 convolution layers, batch normalization layers and activation function layer sequences, the activation function layer of each group of sequences adopts a ReLU function, the number of convolution layer output channels and the number of batch normalization units in the first group are C₃64, second group of convolutional layersThe number of channels and the number of batch normalization units are both C₄128, the number of output channels of the convolution layer and the number of batch normalization units in the third group are both C₅＝256。

The first full connecting layer F₁Number of output channels C₆128, second fully-connected layer F₂Number of output channels C₇128, third fully-connected layer F₃Number of output channels C₈＝36。

Said first and second activation function layers R₁And R₂The ReLU function is used.

The first and second normalization layers B₁And B₂The number of the units of (a) is 128.

The grabbing parameter prediction module outputs a predicted value of the grabbing parameter, and specifically comprises the following steps: and dividing the pivoting angle (0-180 ℃) into 12 classes in an interval range of 15 degrees, wherein the numerical values of 36 channels in the predicted values respectively represent the fraction of each angle, the corresponding holding width and the scoring predicted value of the grabbing pose at the angle, and the angle with the highest score, the holding width and the grabbing fraction are the final predicted values.

The grasping affinity prediction module comprises: multilayer perceptron, second largest pooling layer, fourth full-link layer F₄The fourth batch of normalization layer B₄A fourth activation function layer R₄A fifth full-junction layer F₅The fifth normalization layer B₅A fifth activation function layer R₅And a sixth fully-connected layer F₆。

The grasping affinity degree is input by a grasping affinity degree prediction module by taking the converted point cloud as input, and the grasping affinity degree is (C)₁₀,C₁₁,C₁₂) The obtained multi-layer sensor has a size of 4M × N_s×C₁₂The characteristic point cloud array of (1), passing through the maximum pooling layer P₂Obtaining a size of 4M × C₁₂The point cloud global feature array sequentially passes through a fourth full connecting layer F₄The fourth batch of normalization layer B₄An activation function layer R₄A fifth full-junction layer F₅The fifth normalization layer B₅A fifth activation function layer R₅A sixth full connection layerF₆The size of the obtained product is 4M × C₁₅The capture affinity prediction value of (1).

The multilayer perceptron is three groups of 1 × 1 convolution layers, batch normalization layers and activation function layer sequences, the activation function layer of each group of sequences adopts a ReLU function, the number of convolution layer output channels and the number of batch normalization units in the first group are C₁₀The output channel number and the batch normalization unit number of the convolution layer in the second group are C₁₁128, the number of output channels of the convolution layer and the number of batch normalization units in the third group are both C₁₂＝256。

The fourth full connecting layer F₄Number of output channels C₁₃128, fifth fully-connected layer F₅Number of output channels C₁₄64, sixth fully-connected layer F₅Number of output channels C₁₅＝12。

Said fourth and fifth activation function layers R₄And R₅All adopt ReLU function, fourth and fifth batch normalization layer B₄And B₅The number of the units of (a) is 128.

The grasping affinity output by the grasping affinity prediction module is corresponding grasping affinity under 12 angles, and the affinity under the prediction angle is taken as a final predicted value, wherein: the grabbing affinity represents the maximum fluctuation amplitude of the object which can be still grabbed by the clamping jaw by changing parameters under the current grabbing pose.

And step 3: training an end-to-end object grabbing pose detection model, and specifically comprises the following steps:

step 3.1: and initializing the parameters to be trained in the model by using Gaussian distribution sampling with the average value of 0 and the standard deviation of 0.01.

Step 3.2: inputting 25600 point cloud scenes with object grabbing pose labels obtained in the step 1 into the model as training samples for training, transforming the training samples in two stages in the step 2.1 and the step 2.2, and transmitting the transformed training samples to an output layer to obtain a sampling point graspable confidence coefficient predicted value { c_iThe predicted values of the grabbing fractions(s) corresponding to the directions of the main shafts of the clamping jaws at different visual angles are (i is the serial number of a sampling point)_ijH (i is sampling point serial number, j is view angle serial number), and a predicted value of the fraction of the rotation angle around the axis R is R_ijI is a point cloud block number, j is a depth number), and a clamping width predicted value W_ijCapturing pose score predicted values (S) by using a point cloud block serial number i and a depth serial number j_ijAnd (i is a point cloud block serial number, j is a depth serial number), capturing the affinity predicted value (T)_ijAnd (i is the point cloud block serial number and j is the depth serial number).

The training sample comprises: scene point cloud P, sampling point snatchable confidence label

(i is the number of sampling points), and grabbing fraction labels corresponding to the directions of the main shafts of the clamping jaws at different visual angles

(i is sampling point serial number, j is visual angle serial number), and the label is divided according to the rotating angle around the shaft

(i is point cloud block serial number, j is visual angle serial number), and clamping width label

(i is point cloud block serial number, j is visual angle serial number), and grabbing pose score labels

(i is point cloud block serial number, j is visual angle serial number), and grabbing affinity labels

(i is the point cloud block number, j is the view angle number).

Step 3.3: adjusting model parameters by using a multi-target joint loss function in combination with a Back Propagation (BP) algorithm, wherein the multi-target joint loss function comprises: candidate grab point loss function L^AGrabbing pose parameter loss function L^RAnd grasping the affinity loss function L^F。

The candidate grabbing point loss function is

Wherein:

if and only if there is a label with an angle less than 5 deg. to the predicted principal axis direction is 1,

softmax loss function representing the confidence that a sample point can grab,

smooth L representing the principal axis direction and label on the graspable point₁Regression loss function, λ₁＝0.5。

The grabbing pose parameter loss function is

Wherein:

sigmoid cross entropy loss function representing angle classification when the grab depth is d,

smooth L representing grabbing pose score when grabbing depth is d₁The function of the regression loss is used,

smooth L representing grip width at grip depth d₁Regression loss function, λ₂＝1.0，λ₃＝0.2。

The grabbing affinity loss function is

Wherein:

smooth L representing grab affinity₁A regression loss function.

The target function of the back propagation BP algorithm is L ═ L^A({c_i},{s_i})+αL^R({R_ij},{S_ij},{W_ij})+βL^F({T_ijH) α ═ 0.5 and β ═ 0.1.

In this embodiment, the learning rate of the back propagation BP algorithm is 0.001 at the initial value, the whole training data set is iterated 90 times, and the learning rates sequentially become 0.0001, 0.00001, and 0.000001 after 40, 60, and 80 iterations.

And 4, step 4: carrying out object grabbing pose detection through the trained end-to-end object grabbing pose detection model: 7680 RGB-D images are adopted, scene point clouds are synthesized by the images to be detected according to the method of the step 1.1 and then input into the model, and the predicted value of the object grabbing pose is obtained through layer-by-layer change and calculation.

The model checking criteria used in this example were: the predicted grabbing pose is subjected to non-maximum inhibition screening, distributed to the object with the closest distance, given the surface friction coefficient mu, detected by using a complete object scanning model whether the object can be successfully grabbed or not, and the corresponding average accuracy AP is calculated_μThe corresponding AP is calculated by increasing μ from 0.2 to 1.0 stepwise at 0.2 intervals_μAnd taking the mean value to obtain the final score AP.

According to the model test standard used by the method, the optimal result is achieved on the large-scale general captured data set GraspNet, which is shown in the following table:

through a specific practical experiment, the system is operated on a PyTorch computing frame by using a single NVIDIA RTX 2080GPU, and data on GraspNet is tested, wherein the obtained experimental data are as follows: the best performance is achieved on test data of three difficulties, wherein AP of the Sen difficulty reaches 27.56/29.88 (data test results collected from RealSense/Kinect cameras respectively, the same is shown below), AP of the Unseen difficulty reaches 26.11/27.84, and AP of the Novel difficulty reaches 10.55/11.51.

Compared with the prior art, the method does not need to rely on the pose information of the object, improves the prediction speed, greatly improves the accuracy of the detection of the grabbing pose through the steps of the direction prediction of the main shaft of the clamping jaw, the grabbing affinity prediction and the like, and achieves the best performance on a large-scale general grabbing data set GraspNet.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A method for detecting object grabbing pose in three-dimensional point cloud is characterized in that an end-to-end object grabbing pose detection model is trained by arranging object grabbing pose in a sample image as a training set, and then three-dimensional point cloud data to be detected are identified to obtain candidate grabbing pose scores so as to realize object grabbing pose detection;

the end-to-end object grabbing pose detection model comprises the following steps: the device comprises a candidate grabbing point prediction module for processing coded spatial information, a spatial transformation module for generating candidate grabbing pose characteristics, a grabbing parameter prediction module and a grabbing affinity prediction module.

2. The method for detecting the object grabbing pose in the three-dimensional point cloud according to claim 1, wherein when the candidate grabbing point prediction module predicts the main shaft direction of the clamping jaw, a unit sphere is generated by taking a grabbing point as a sphere center, the sphere is discretized into 300 uniformly distributed viewpoints, and the direction of the viewpoint pointing to the sphere center represents the main shaft direction of the clamping jaw, so that the prediction of the main shaft direction is converted into the classification problem of the viewpoint; the candidate grabbing point prediction module is internally provided with a PointNet + + model for processing a point cloud scene and carries out candidate grabbing point location and main shaft direction prediction.

3. The method for detecting the object grabbing pose in the three-dimensional point cloud of claim 1, wherein the space transformation module is used for clipping the point cloud near the candidate grabbing pose and converting the point cloud into a clamping jaw coordinate system, and the grabbing parameter prediction module is used for predicting the grabbing pivoting angle, the clamping width and the grabbing score.

4. The method for detecting the object grabbing pose in the three-dimensional point cloud of claim 1, wherein the grabbing affinity prediction module outputs the affinity of candidate grabbing poses, namely the maximum disturbance range when the object can be grabbed after the predicted pose is disturbed, the grabbing affinity represents the robustness of the grabbing pose, and the stronger the grabbing affinity, the stronger the robustness of the predicted grabbing pose.

5. The method for detecting the object grabbing pose in the three-dimensional point cloud according to claim 1, wherein the sorting is as follows: RGB-D images of object capture pose detection with different object combinations and different placing modes are obtained from an existing image library serving as sample images, and a point cloud scene and corresponding training labels are synthesized.

6. The method for detecting the object capture pose in the three-dimensional point cloud of claim 1, wherein the object capture pose detection is further realized by performing threshold judgment on candidate capture pose scores.

7. The method for detecting the object capture pose in the three-dimensional point cloud according to claim 5, wherein the sorting comprises:

step 1.1: synthesizing a point cloud using the intra-camera parameters, wherein: the camera intrinsic parameters are expressed as scale factors f of the camera on the u-axis and the v-axis of an image coordinate system_xAnd f_yPrincipal point of image coordinate systemLabel (c)_x，c_y) And an image depth value scaling s, which represents the coordinate of a point on the image coordinate system by (u, v), represents the corresponding depth value by d, and represents the three-dimensional coordinate in the camera coordinate system by (x, y, z), in combination

Converting the RGB-D image into a point cloud scene under a camera coordinate system;

8. The method for detecting the object capture pose in the three-dimensional point cloud according to claim 1 or 2, wherein the candidate capture point prediction module comprises: a base network unit and a prediction unit, wherein: the base network unit encodes the 20000 multiplied by 3 point cloud, and the prediction unit predicts the appropriate grabbing point location and clamping jaw main shaft direction in the scene by using the encoded point cloud.

9. The method for detecting the object grabbing pose in the three-dimensional point cloud according to claim 3, wherein the cutting is as follows: the space transformation module cuts out a point cloud block in a cylindrical space according to the candidate grabbing point position and the main shaft direction, the cylindrical main shaft is along the main shaft direction of the clamping jaw, and the circle center of the bottom surface is the candidate grabbing point;

the conversion is as follows: converting the point cloud block after cutting into a clamping jaw coordinate system, and enabling v to be [ v ]₁，v₂，v₃]Then the transformation matrix of the point cloud coordinate is represented as O ═ O₁，o₂，o₃]Wherein:

each candidate shears a cylinder point cloud corresponding to a different jaw depth, wherein: the radius of the bottom surface of the cylinder is 0.05m, the depth is 0.01m, 0.02m, 0.03m and 0.04m in sequence, and all candidate grabbing poses are processedObtaining 4M pieces of shearing point clouds, sampling each piece of point cloud to N_sPoint, finally 4M × N is obtained_s×C₁The point cloud array of (1), wherein: n is a radical of_s＝64。

10. The method for detecting the object grabbing pose in the three-dimensional point cloud according to claim 1 or 3, wherein the grabbing parameter prediction module comprises: multilayer sensor, maximum pooling layer, first full-link layer F₁The first normalization layer B₁A first activation function layer R₁A second full-junction layer F₂The second batch of normalization layers B₂A second activation function layer R₂And a third fully-connected layer F₃；

The grabbing parameters are input through a grabbing parameter prediction module according to the point cloud after the tangent transformation, and the size of the grabbing parameters is (C)₃，C₄，C₅) The obtained multi-layer sensor has a size of 4M × N_s×C₅The characteristic point cloud array of (1), passing through the maximum pooling layer P₁Obtaining a size of 4M × C₅The point cloud global feature array sequentially passes through a first full connecting layer F₁The first normalization layer B₁A first activation function layer R₁A second full-junction layer F₂The second batch of normalization layers B₂A second activation function layer R₂A third full-junction layer F₃The size of the obtained product is 4M × C₈And (4) obtaining a predicted value of the pose parameter.

11. The method for detecting the object capture pose in the three-dimensional point cloud according to claim 1 or 4, wherein the capture affinity prediction module comprises: multilayer perceptron, second largest pooling layer, fourth full-link layer F₄The fourth batch of normalization layer B₄A fourth activation function layer R₄A fifth full-junction layer F₅The fifth normalization layer B₅A fifth activation function layer R₅And a sixth fully-connected layer F₆；

The grasping affinity degree is input by a grasping affinity degree prediction module by taking the converted point cloud as input, and the grasping affinity degree is (C)₁₀，C₁₁，C₁₂) The obtained multi-layer sensor has a size of 4M × N_s×C₁₂The characteristic point cloud array of (1), passing through the maximum pooling layer P₂Obtaining a size of 4M × C₁₂The point cloud global feature array sequentially passes through a fourth full connecting layer F₄The fourth batch of normalization layer B₄An activation function layer R₄A fifth full-junction layer F₅The fifth normalization layer B₅A fifth activation function layer R₅Sixth full-junction layer F₆The size of the obtained product is 4M × C₁₅The capture affinity prediction value of (1).