CN111080693A

CN111080693A - Robot autonomous classification grabbing method based on YOLOv3

Info

Publication number: CN111080693A
Application number: CN201911159864.7A
Authority: CN
Inventors: 王太勇; 冯志杰; 韩文灯; 彭鹏; 张凌雷
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-04-28

Abstract

The invention discloses a robot autonomous classification grabbing method based on YOLOv3, which is characterized by comprising the steps of collecting and constructing a target object sample data set; training a YOLOv3 target detection network to obtain a target object recognition model; collecting a color image and a depth image of a target object; processing the color image by using a trained YOLOv3 target detection network to obtain the category information and the position information of a target object to be captured, and further processing by combining a depth image to obtain point cloud information of the target object; and (3) solving the point cloud information by a minimum bounding box, calculating the main direction of the point cloud by combining a PCA algorithm, calibrating X, Y, Z-axis coordinate data of the target object, and calculating the six-degree-of-freedom pose of the target object relative to a robot coordinate system. According to the invention, a YOLOv3 algorithm is adopted, and object grabbing pose is estimated by point cloud preprocessing, PCA and other methods, so that the robot can grab target objects in a classified manner.

Description

Robot autonomous classification grabbing method based on YOLOv3

Technical Field

The invention relates to a robot autonomous classified grabbing method, in particular to a robot autonomous classified grabbing method based on YOLOv 3.

Background

At present, the population of China is seriously aged, the labor force is in short supply, and the demand on the service robot is more and more increased, but the unstructured environment in which the service robot works also brings a plurality of technical problems, wherein a main problem is the autonomous grabbing of the robot in the unstructured environment. Grabbing is one of the main ways for robots to interact with the real world and is an urgent problem to be solved. Unlike industrial robots that grab workpieces in a structured environment, the automatic grabbing of service robots in an unstructured environment faces many challenges, such as dynamic environment, illumination variation, mutual occlusion between objects, and above all, there are a lot of unknown objects in the unstructured environment in addition to known objects, and most of all, applying a well-established grabbing planning method on most industrial robots relies on obtaining models of objects in advance to build a database, or executing fixed actions under a pre-programmed program, while for service robots working in an unstructured environment, obtaining models of all objects to be grabbed in advance is not practical, so the robots must be able to perform fast, stable and reliable grabbing planning on the unknown objects online. The invention adopts a method in computer vision, extracts a color image and a depth image of a target object through a camera, and then identifies and positions the target object by using a target detection method to obtain the category of the target object and the position represented by a rectangular frame. And then the specific pose of the object can be obtained through image processing and a related algorithm of point cloud, and then the object is grabbed through the mechanical arm. In the aspect of target object identification, the traditional algorithm generally adopts an image processing method to perform edge extraction, surf, sift and other methods to perform feature extraction on an image, and then performs matching with a template. However, the algorithm is easily affected by the working environment, is sensitive to illumination, object shape, size and the like, and has poor robustness and weak generalization capability.

Disclosure of Invention

The invention provides the robot autonomous classified grabbing method based on the YOLOv3, which has better robustness and strong generalization capability, for solving the technical problems in the prior art.

The technical scheme adopted by the invention for solving the technical problems in the prior art is as follows: a robot autonomous classification grabbing method based on YOLOv3 is characterized by comprising the steps of collecting and constructing a sample data set of a target object; training a YOLOv3 target detection network by using the sample data set to obtain a target object recognition model; collecting a color image and a depth image of a target object to be grabbed; processing the color image by adopting a trained YOLOv3 target detection network to obtain the category information and the position information of a target object to be grabbed, introducing a depth image for further processing to obtain point cloud information of the target object to be grabbed; and (3) solving the point cloud information by a minimum bounding box, calculating the main direction of the point cloud by combining a PCA algorithm, calibrating X, Y, Z-axis coordinate data of the target object to be grabbed, and calculating the six-degree-of-freedom pose of the target object to be grabbed relative to a robot coordinate system.

Further, the step of collecting and constructing the target object sample data set comprises:

step a, using a kinect camera to acquire images of various target objects and acquiring images of various target object combinations;

step b, constructing a sample data set conforming to the YOLOv3 target detection network, and enabling the sample data set to be as follows: 1: the scale of 1 is divided into a training set, a validation set, and a test set.

Further, the step of training the YOLOv3 target detection network is as follows: the method comprises the steps of constructing a YOLOv3 target detection network comprising a Darknet-53 network, firstly inputting a sample data set into the Darknet-53 neural network, carrying out down-sampling by changing the step length of a convolution kernel in the Darknet-53 neural network, and simultaneously splicing up-sampling results of a middle layer and an output layer of the YOLOv3 target detection network to obtain three feature maps with different scales.

Furthermore, when the image information of the combination of the multiple target objects is processed, the class marking and the position marking are carried out on the objects in the image through an image marking tool.

Further, the position information of the target object is represented by a rectangular frame; the rectangular box calculation method is as follows:

in the formula:

b_xcoordinates in the X direction of the center point of the object bounding box predicted by YOLOV 3;

b_ycoordinate in Y direction of center point of boundary box of object predicted by YOLOV 3;

b_wthe predicted width of the object bounding box in the X direction for YOLOV 3;

b_hthe predicted Y-direction width of the object bounding box for YOLOV 3;

c_xcoordinates in the X direction of the upper left corner of a grid on the feature map are obtained;

c_ycoordinates of the upper left corner of the grid on the feature map in the Y direction are obtained;

t_xthe predicted X-coordinate offset value of the target object for YOLOV 3;

t_ythe predicted Y-coordinate offset value of the target object for YOLOV 3;

t_wa lateral scale scaling value of the target object predicted for YOLOV 3;

t_ha longitudinal scale scaling value of the target object predicted for YOLOV 3;

p_wthe preset transverse dimension of the anchor frame on the characteristic diagram is adopted;

p_hthe longitudinal dimension of a preset anchor frame on the characteristic diagram is obtained;

sigma is sigmoid function.

Further, through traversing the pixel mask of the target object in the rectangular frame, and combining the depth image, calculating point cloud information of the target object; the calculation formula is as follows:

in the formula:

x_wis the coordinate of the object in the X direction under the camera coordinate system;

y_wis the coordinate of the object in the Y direction under the camera coordinate system;

z_wis the Z-direction coordinate of an object under a camera coordinate system;

z_cdepth information of an object under a camera coordinate system;

u is the horizontal coordinate of the pixel point under the pixel coordinate system;

v is the coordinate of the pixel point in the vertical direction under the pixel coordinate system;

u₀the pixel coordinate of the image center in the horizontal direction;

v₀pixel coordinates in the vertical direction of the center of the image are taken;

f is the focal length of the camera;

wherein u is₀、v₀And f is a camera parameter and is obtained by calibrating a camera.

Further, the RGB-D image information of the target object is acquired by a vision sensor.

Further, a depth image of the target object is acquired by the Kinect depth camera.

And further, calculating the six-degree-of-freedom pose of the target object relative to a camera coordinate system according to the obtained X, Y and Z axis coordinates of the target object, and obtaining the six-degree-of-freedom pose of the target object under the robot base coordinate by combining the hand-eye calibration result.

The invention has the advantages and positive effects that: the method adopts a YOLO V3 target detection algorithm in computer vision to identify and locate the target object. The algorithm is mature at present, the accuracy and the speed of the algorithm are higher than those of the previous target detection algorithm, and the algorithm is very suitable for classified grabbing of robots with high real-time requirements. Through analysis of a YOLO V3 target detection algorithm result, estimation of object grabbing pose is carried out through point cloud preprocessing, PCA and other methods, so that classified grabbing of the robot on the target object is achieved, and automatic grabbing of the mechanical arm is achieved. The method has the advantages of good robustness and strong generalization capability.

Drawings

FIG. 1 is a schematic workflow diagram of the present invention.

Detailed Description

For further understanding of the contents, features and effects of the present invention, the following embodiments are enumerated in conjunction with the accompanying drawings, and the following detailed description is given:

the English Chinese in this application is explained as follows:

YOLOv 3: a single-stage target detection algorithm proposed by Joseph Redmon in 2018;

PCA: a principal component analysis method;

darknet-53: the deep convolutional neural network is used for extracting image features and is a core module of a YOLOv3 algorithm;

kinect: and the visual sensor can obtain RGB information and depth information of the object.

RGB: a three-channel color image;

RGB-D: a total name of three-channel color images and depth images;

R-CNN: regional convolutional neural networks, a target detection algorithm proposed by Ross Girshick et al 2014.

Fast R-CNN: a fast regional convolutional neural network, a target detection algorithm proposed by Ross Girshick et al in 2015 for R-CNN detection speed.

SSD: a single-stage target detector for multiple categories, the target detection algorithm proposed by Wei Liu et al in 2016;

cell: a grid;

ROS: the robot operating system writes a software architecture with high flexibility of a robot software program;

message: a communication mechanism in the ROS robot operating system;

an anchor: an anchor frame;

confidence: a confidence level;

referring to fig. 1, a robot autonomous classification grabbing method based on YOLOv3 collects and constructs a sample data set of a target object; training a YOLOv3 target detection network by using the sample data set to obtain a target object recognition model; collecting a color image and a depth image of a target object to be grabbed; processing the color image by adopting a trained YOLOv3 target detection network to obtain the category information and the position information of a target object to be grabbed, introducing a depth image for further processing to obtain point cloud information of the target object to be grabbed; and (3) solving the point cloud information by a minimum bounding box, calculating the main direction of the point cloud by combining a PCA algorithm, calibrating X, Y, Z-axis coordinate data of the target object to be grabbed, and calculating the six-degree-of-freedom pose of the target object to be grabbed relative to a robot coordinate system. A target object recognition model based on a YOLOv3 target detection network is built, a YOLOv3 target detection network is trained by using a sample data set of the collected and built target object, and category information, position information and the like of the target object are obtained.

Preferably, the step of acquiring and constructing the target object sample data set may be as follows:

step a, a kinect camera can be used for carrying out image acquisition on various target objects, and image acquisition can be carried out on various target object combinations;

Preferably, the step of training the YOLOv3 target detection network may be: the YOLOv3 target detection network comprising the Darknet-53 network can be constructed, a sample data set can be input into the Darknet-53 neural network firstly, downsampling can be carried out by changing the step length of a convolution kernel in the Darknet-53 neural network, and meanwhile, upsampling results of a middle layer and an output layer of the YOLOv3 target detection network can be spliced to obtain three feature maps with different scales.

Preferably, when processing image information of a combination of multiple target objects, the image annotation tool can be used to perform category annotation and position annotation on the objects in the image.

Preferably, the position information of the target object may be represented by a rectangular frame; the rectangular box calculation method can be shown as the following formula:

in the formula:

b_hthe predicted Y-direction width of the object bounding box for YOLOV 3;

t_xthe predicted X-coordinate offset value of the target object for YOLOV 3;

t_ythe predicted Y-coordinate offset value of the target object for YOLOV 3;

t_wa lateral scale scaling value of the target object predicted for YOLOV 3;

sigma is sigmoid function.

σ is sigmoid function, which can be used to convert t_x、t_yCompressed at [0,1 ]]And the interval prevents excessive deviation.

Preferably, the point cloud information of the target object can be calculated by traversing the pixel mask of the target object in the rectangular frame and combining the depth image; the calculation method can be shown as the following formula:

in the formula:

z_wis the Z-direction coordinate of an object under a camera coordinate system;

z_cdepth information of an object under a camera coordinate system;

u₀the pixel coordinate of the image center in the horizontal direction;

f is the focal length of the camera;

wherein u is₀、v₀And f is a camera parameter which can be obtained by calibrating a camera.

Preferably, the RGB-D image information of the target object may be acquired with a vision sensor.

Preferably, the depth image of the target object may be acquired by a Kinect depth camera.

Preferably, the six-degree-of-freedom pose of the target object under the robot base coordinate system can be calculated according to the obtained X, Y and Z axis coordinates of the target object, and the six-degree-of-freedom pose of the target object under the robot base coordinate system can be obtained by combining the hand-eye calibration result.

The working process and working principle of the invention are further explained below with reference to the preferred embodiments of the invention:

the YOLOv3 target detection network is an algorithm with the strongest comprehensive performance in the current target detection, the algorithm is mature, the precision is high, the speed is high, and the method is well applied to the robot field and the unmanned field at present. In short, the Prior detection (Prior detection) system of YOLOv3 reuses the classifier or locator for performing the detection task. Applying the model to multiple locations and scales of the feature map improves the accuracy of the identification of small objects. And further performing boundary box regression on the anchor boxes with higher scores by a target scoring method. Furthermore, the network uses a completely different approach to other object detection methods. A neural network is applied to the entire image, which divides the image into different regions, thus predicting the bounding box and probability of each block region, which will be weighted by the predicted probability. In contrast to classifier-based systems, it looks at the entire image under test, so its prediction exploits global information in the image. Unlike R-CNN, which requires thousands of single target images, it predicts through a single network evaluation. This makes Yolov3 very Fast, typically 1000 times faster than R-CNN and 100 times faster than Fast R-CNN. It is also more accurate than SSD single-stage detectors and is about three times faster than SSD. In view of its excellent performance, and excellent real-time performance.

Firstly, acquiring position information and category information of a target object

In the invention, a model based on a YOLOv3 target detection network is adopted for target identification, and the identified objects are various types of fruits (including 13 types of fruits such as bananas, apples, carambola, cherries, grapes and strawberries) and the specific steps can be as follows:

1. various fruits are photographed using a kinect camera and different combinations of fruits are photographed.

2. Making a data set conforming to the YOLOv3 network, and arranging the data set according to the following ratio of 5: 1: the scale of 1 is divided into a training set, a validation set, and a test set.

3. Model training is carried out on the YOLOv3 target detection network by utilizing a training set and a verification set in a data set, firstly, a deep CNN network Darknet-53 is passed through, downsampling is carried out on the network by changing the step length of a convolution kernel, and simultaneously, upsampling results of a middle layer and a later network layer of the network are spliced, so that three characteristic graphs with different scales are obtained, and the estimation of objects with different degrees is realized. The method mainly improves the identification precision and positioning precision of the objects such as strawberries, cherries and the like.

4. The three feature maps are divided into small grids (cells) of corresponding sizes, and three boxes (bounding boxes) are predicted for each grid.

5. Before prediction, logistic regression is used for scoring the target of each box, namely the possibility of predicting the position of the block to be a target is high, partial unnecessary anchor frames are eliminated, and the optimal anchor frame is selected for carrying out next boundary frame regression, so that the calculation amount is reduced.

6. Each box contains five basic parameters (x, y, w, h, confidence) and category information. Output (t)_x,t_y,t_w,t_h,t_o) The (x, y, w, h, confidence) of the object can be calculated by the following formula

In formula 1:

b_hthe predicted Y-direction width of the object bounding box for YOLOV 3;

t_xthe predicted X-coordinate offset value of the target object for YOLOV 3;

t_ythe predicted Y-coordinate offset value of the target object for YOLOV 3;

t_wa lateral scale scaling value of the target object predicted for YOLOV 3;

p_hfor preset anchor frame in characteristic diagramA longitudinal dimension of;

sigma is sigmoid function.

7. And establishing a message in the ROS, and issuing the position information, the category information and the confidence coefficient of the picture after being identified by the YOLO V3.

Secondly, acquiring a point cloud picture of the target object

After the picture is identified by the YOLOV3 algorithm, the position information and the category information of the object in the picture are output, the specified positions in the RGB map and the depth map are intercepted by combining the RGB map and the depth map of the picture, namely, the target object is intercepted from the RGB map and the depth map, and the point cloud picture of the area where the target object is located is calculated by utilizing the pixel information on the RGB map and the depth information on the depth map. Before the point cloud picture is obtained, camera internal reference calibration and depth picture and RGB picture registration are needed to be carried out on the Kinect camera. The specific steps can be as follows:

1. and performing internal reference calibration and RGB (red, green and blue) image and depth image registration on the Kinect camera by using a Kinect toolkit in the ROS system.

2. In the ROS, subscribing to RGB map and depth map topics of a kinect camera and receiving published message messages.

3. And extracting the corresponding RGB map and depth map according to the received ROS message. And obtaining an RGB (red, green and blue) image and a depth image corresponding to the target object.

4. And traversing each point in the RGB map and the depth map, and calculating the position information of each point in the image by combining the formula (2). Wherein u is₀、v₀F is camera reference, u and v are pixel coordinates. And adding the obtained space coordinates of each pixel point into the point cloud, thereby constructing point cloud information of the target object.

In formula 2:

z_wis the Z-direction coordinate of an object under a camera coordinate system;

z_cdepth information of an object under a camera coordinate system;

u₀the pixel coordinate of the image center in the horizontal direction;

f is the focal length of the camera;

And thirdly, estimating six degrees of freedom of the target object based on PCA.

Pca (principal Component analysis), a principal Component analysis method, is one of the most widely used data dimension reduction algorithms. The main idea of PCA is to map n-dimensional features onto k-dimensions, which are completely new orthogonal features, also called principal components, and k-dimensional features reconstructed on the basis of the original n-dimensional features. The PCA algorithm utilizes the covariance matrix to calculate the degree of dispersion of the sample set in different directions. The task of PCA is to find a set of mutually orthogonal axes in order from the original space, the selection of new axes being strongly dependent on the data itself. The first new coordinate axis is selected to be the direction with the largest square difference in the original data, the second new coordinate axis is selected to be the plane which is orthogonal to the first coordinate axis and enables the square difference to be the largest, and the third axis is the plane which is orthogonal to the 1 st axis and the 2 nd axis and enables the square difference to be the largest. By analogy, n such coordinates can be obtained. The Principal Component Analysis (PCA) method can be used for extracting the principal direction of the point cloud to obtain the principal direction of the point cloud object. And then converting the main direction of the obtained point cloud into quaternion information required by the mechanical arm to grab. The specific steps can be as follows:

1. and denoising the obtained point cloud picture, and filtering outliers and noise points in the point cloud.

2. And performing conditional filtering on the point cloud, removing the point cloud of the plane where the object is located, and keeping the point cloud information of the object.

3. Point cloud is sparse, one point is reserved in each voxel to represent other points through a method of voxel grid, the down sampling of point cloud data is realized, and the size of the grid is adjusted to change the down sampling proportion. By the point cloud sparse method, the data volume of the point cloud is reduced on the premise of keeping object characteristics, the calculated amount is reduced, and the calculation efficiency is improved.

4. And performing PCA calculation on the point clouds in the two steps to obtain a main direction coordinate system of the object point cloud.

5. And calculating the minimum bounding box of the point cloud data, and calculating the six-degree-of-freedom pose of the target point cloud under a camera coordinate system.

6. And converting the point cloud pose under the camera coordinate system into the robot coordinate system to obtain the six-degree-of-freedom pose of the target object under the robot coordinate system.

The above-mentioned embodiments are only for illustrating the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and to carry out the same, and the present invention shall not be limited to the embodiments, i.e. the equivalent changes or modifications made within the spirit of the present invention shall fall within the scope of the present invention.

Claims

1. A robot autonomous classification grabbing method based on YOLOv3 is characterized by comprising the steps of collecting and constructing a sample data set of a target object; training a YOLOv3 target detection network by using the sample data set to obtain a target object recognition model; collecting a color image and a depth image of a target object to be grabbed; processing the color image by adopting a trained YOLOv3 target detection network to obtain the category information and the position information of a target object to be grabbed, introducing a depth image for further processing to obtain point cloud information of the target object to be grabbed; and (3) solving the point cloud information by a minimum bounding box, calculating the main direction of the point cloud by combining a PCA algorithm, calibrating X, Y, Z-axis coordinate data of the target object to be grabbed, and calculating the six-degree-of-freedom pose of the target object to be grabbed relative to a robot coordinate system.

2. The YOLOv 3-based robot autonomous classified grabbing method of claim 1, wherein the step of collecting and constructing the target object sample dataset is as follows:

3. The YOLOv 3-based robot autonomous classification capture method according to claim 1, wherein the training of the YOLOv3 target detection network comprises the steps of: the method comprises the steps of constructing a YOLOv3 target detection network comprising a Darknet-53 network, firstly inputting a sample data set into the Darknet-53 neural network, carrying out down-sampling by changing the step length of a convolution kernel in the Darknet-53 neural network, and simultaneously splicing up-sampling results of a middle layer and an output layer of the YOLOv3 target detection network to obtain three feature maps with different scales.

4. The YOLOv 3-based robot autonomous classified capture method according to claim 1, wherein when processing image information of a combination of multiple target objects, class labeling and position labeling are performed on the objects in the image by an image labeling tool.

5. The YOLOv 3-based robot autonomous classified grabbing method of claim 1, wherein the position information of the target object is represented by a rectangular frame; the rectangular box calculation method is as follows:

in the formula:

b_hthe predicted Y-direction width of the object bounding box for YOLOV 3;

t_xthe predicted X-coordinate offset value of the target object for YOLOV 3;

t_ythe predicted Y-coordinate offset value of the target object for YOLOV 3;

t_wa lateral scale scaling value of the target object predicted for YOLOV 3;

sigma is sigmoid function.

6. The YOLOv 3-based robot autonomous classification grabbing method according to claim 5, wherein point cloud information of a target object is calculated by traversing a pixel mask of the target object in a rectangular frame in combination with a depth image; the calculation formula is as follows:

in the formula:

z_wis the Z-direction coordinate of an object under a camera coordinate system;

z_cfor objects in the camera coordinate systemDepth information of (2);

u₀the pixel coordinate of the image center in the horizontal direction;

f is the focal length of the camera;

7. The YOLOv 3-based robot autonomous classified capture method of claim 1, wherein a vision sensor is used to capture RGB-D image information of a target object.

8. The YOLOv 3-based robot autonomous classified grabbing method according to claim 1, wherein the depth image of the target object is acquired by a Kinect depth camera.

9. The YOLOv 3-based robot autonomous classified grabbing method according to claim 7 or 8, wherein the pose of the target object in six degrees of freedom relative to the camera coordinate system is calculated according to the obtained X, Y and Z axis coordinates of the target object, and the pose of the target object in six degrees of freedom in the robot base coordinate is obtained by combining the hand-eye calibration result.