CN114155301A

CN114155301A - Robot target positioning and grabbing method based on Mask R-CNN and binocular camera

Info

Publication number: CN114155301A
Application number: CN202111401496.XA
Authority: CN
Inventors: 周登科; 史凯特; 汤鹏; 于傲; 郑开元; 张亚平; 李哲
Original assignee: China Three Gorges Corp
Current assignee: China Three Gorges Corp
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-08

Abstract

A robot target positioning and grabbing method based on Mask R-CNN and a binocular camera comprises the following steps: step 1: calibrating the camera; step 2: identifying and segmenting the target; and step 3: positioning a target; and 4, step 4: calculating the pose of the target; and 5: grabbing the target; in step 2, the RGB images collected by the binocular camera are sent into a convolutional neural network Mask R-CNN model which is trained in advance, and the position of the target object in the images and the Mask are output through the model. In step 3, aligning the RGB image and the DEPTH image collected by the camera, and calculating the average distance between the pixel points in the Mask area of the target object and the binocular camera to be the distance between the target object and the camera. In step 4, pixel coordinates and depth information are converted into a robot base coordinate system through robot hand-eye calibration, the posture of each joint angle is solved through robot inverse kinematics, and the robot is driven to move to complete a grabbing and carrying task.

Description

Robot target positioning and grabbing method based on Mask R-CNN and binocular camera

Technical Field

The invention belongs to the technical field of robots, and particularly relates to a robot target positioning and grabbing method based on Mask R-CNN and a binocular camera.

Background

The robot is a bionic intelligent machine, generally has the moving capability, the sensing capability, the action capability and the coordination capability, can complete specific recognition tasks and action tasks by being provided with a sensing lens and a mechanical arm, and can recognize and grab specified target objects. The current common target grabbing method is to search for a target through a camera carried by a robot, extract a target object through a template matching method or other image processing methods after the target is found, calculate the distance between the target object and the camera through a binocular camera or a laser radar carried by the robot, and finally control a mechanical arm to grab the target object.

In the method, different sensors are used for detecting and ranging the target object respectively, equipment requirements and robot load are increased, in the target detection, the traditional image processing method is large in error, the edge information of the target object cannot be accurately determined, and meanwhile, the processing speed is low, so that the action of the mechanical arm is delayed, and the requirement for grabbing the target in real time is difficult to meet. In the target ranging, background information is complex, noise points are not completely filtered, and the distance measurement precision is poor.

Disclosure of Invention

The invention aims to solve the technical problems of large target detection error and low distance measurement precision of the existing robot in a complex environment, and provides a method which can realize accurate positioning and grabbing of the robot to a target object and reduce the error rate of grabbing the target object by a mechanical arm of the robot.

A robot target positioning and grabbing method based on Mask R-CNN and a binocular camera comprises the following steps:

step 1: calibrating the camera;

step 2: identifying and segmenting the target;

and step 3: positioning a target;

and 4, step 4: calculating the pose of the target;

and 5: grabbing the target;

in step 2, the RGB image collected by the binocular camera is used as an input image and sent into a pre-trained convolutional neural network Mask R-CNN model, a detection frame and a Mask of a target object in the image are output through the model, a target area is extracted by carrying out pixel point segmentation on the target object, and background interference information is filtered.

In step 2, identifying and segmenting the target object by using a Mask R-CNN model, wherein the model construction and identification steps are as follows:

2-1) acquiring a target object data set, and acquiring images of the target object from different environments, different angles, different brightness and different postures according to the type of the target object to be captured;

2-2) data enhancement, because the available data set samples are less, the data set is expanded by adopting a method of combining traditional image geometric transformation data enhancement and generation type data enhancement by utilizing GAN. For a traditional image geometric transformation method, the data set is expanded by operations of random cutting, horizontal turning, image inclination, noise addition, image scaling and the like on the acquired data set through brightness transformation, noise addition, shearing, rotation and the like;

2-3) labeling the target data set, and labeling the acquired image by using an image labeling tool;

2-4) optimizing a Mask RCNN network model;

2-5) carrying out model transfer learning training, loading the self-made data set into the optimized network model by using a transfer learning method, and simultaneously loading a model pre-trained by using a COCO data set, so as to improve model convergence, and carrying out iterative training on the model by parameter optimization to generate a target detection and segmentation model;

2-6) detecting and segmenting the target, intercepting an RGB image from a video stream shot by a binocular camera, transmitting the RGB image into a Mask R-CNN model, identifying the type and the position of a target object to be captured through the model, segmenting the target, and outputting a Mask region of the target.

In the step 2-4), when optimizing the Mask R-CNN network model, the following steps are adopted:

(1) modifying the feature extraction network, and improving the speed of target identification by reducing network layers;

(2) modifying the RPN area proposed network, modifying the size of an anchor point, concentrating the model in a specified proportion for calculation, eliminating anchor blocks exceeding the size of the original image, and further screening by a non-maximum suppression (NMS) method to obtain an interested area;

(3) modifying the loss function; mask R-CNN loss function of

L＝L_cls+L_box+L_mask，

Wherein L is_clsAs a function of classification loss, L_boxTo detect the loss function, L_maskFor dividing the loss function, at L_maskAdding a boundary loss function, and regularizing the segmented position, shape and continuity by using distance loss to enable the segmented position, shape and continuity to be closer to a target boundary; l is_mask-edgeThe optimized loss function is

Wherein L is_edgeFor the boundary loss function, y denotes the annotated target edge,

representing the prediction boundary, alpha is a weight coefficient, B is the boundary of the segmentation result, M_distDistance transformation for the group-treth segmentation boundary.

In step 1, calibrating a camera to acquire three-dimensional spatial position information through two-dimensional image information; the method specifically comprises the following steps:

1-1) making a calibration plate;

1-2) collecting images, changing the position and the angle of a calibration plate relative to a camera, and shooting a plurality of pictures of the calibration plate from different angles, different positions and different postures by using the camera to be calibrated;

1-3) detecting the calibration board angular point to obtain the pixel coordinate value of the calibration board angular point, and calculating the physical coordinate value of the calibration board angular point according to the known size of the checkerboard and the origin of a world coordinate system;

1-4) solving the internal parameters and the external parameters of the camera.

In step 3, when the target is positioned, aligning a DEPTH image acquired by the binocular camera with an RGB image, and calculating an average distance from a target Mask area pixel point in the RGB image to an infrared lens of the binocular camera, thereby obtaining a distance from the target object to the camera. Meanwhile, the width of the target is measured through a binocular camera ranging principle, and whether the target object can be grabbed by the opening and closing width of the mechanical arm grabber is judged.

In step 3, the method specifically comprises the following steps:

3-1) acquiring an RGB image and a DEPTH image;

3-2) aligning the DEPTH image and the RGB image, so that pixel points in the RGB image correspond to the DEPTH image target points one by one;

3-3) filtering the background information of the target image, and filtering the background area of the image according to the Mask area of the target object output in the step 2-4);

and 3-4) calculating a target distance, calculating the distance from a pixel point in a Mask area of the target object to the binocular camera, and calculating the average value of the pixel point distances to obtain the distance from the target object to the camera.

3-5) calculating the target width, and calculating the maximum width of the edge of the target object in the DEPTH image corresponding to the Mask region edge pixels as the width of the target object. And judging whether the target is in the grabbing range of the mechanical arm according to the calculated width value.

In step 4, to grab the target object, the pose of the target object in the camera coordinate system needs to be converted into the pose of the target object in the robot arm base coordinate system. The process is realized through hand-eye calibration, the hand-eye calibration is calibrated by a nine-point calibration method, and the coordinate conversion relation between a camera coordinate system and a robot base coordinate system can be determined, so that the position of a target workpiece relative to the robot base coordinate system is calculated, and the robot is guided to realize grabbing.

In the step 5, the robot is driven to reach the area where the target is located, the distance between the robot and the target object is adjusted, the target object can be captured by a clamping device at the tail end of the mechanical arm, the angular postures of the joints are solved through inverse kinematics of the robot, and finally the target object is captured by controlling the rotation angle of the joints of the mechanical arm.

Compared with the prior art, the invention has the following technical effects:

1) the robot simultaneously acquires the RGB image and the DEPTH image of the target object through the binocular camera, and the positioning and grabbing of the robot mechanical arm on the target object are realized by simultaneously processing the RGB image and the DEPTH image;

2) in the target object identification and segmentation, an improved MaskR-CNN-based target identification and segmentation method is provided, and the speed and the precision of target identification are improved by modifying a feature extraction network, a region suggestion network and a loss function of a model.

3) A binocular camera fusion DEPTH learning algorithm is provided in target positioning, automatic identification and pixel level segmentation of a target object are achieved, background noise information is filtered through Mask segmentation, pixel point distance information in a Mask area is calculated by combining a DEPTH image, target positioning precision is improved, and accuracy is improved for accurate grabbing of a subsequent mechanical arm.

Drawings

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flowchart of a method for constructing a Mask R-CNN model according to the present invention;

FIG. 3 is a flowchart of a depth camera-based target ranging method according to the present invention.

Detailed Description

As shown in fig. 1, a flowchart of a robot target positioning and grabbing method based on Mask R-CNN and a binocular camera specifically includes the following steps:

step 1, calibrating a camera. The method comprises the steps of obtaining three-dimensional space position information through two-dimensional image information, obtaining internal parameters of a vision model on the premise of knowing the internal parameters, and obtaining the internal and external parameters by calibrating a camera. The internal reference and the external reference of the camera can be directly obtained by using a Zhangyingyou calibration method. As shown in the following formula

Wherein, a is an internal parameter matrix of the camera, R is a rotation matrix of the external parameter, T is a translation vector, α ═ f/dx, β ═ f/dy, f is a focal length, and dx and dy are width and height of the pixel respectively; gamma represents the deviation of pixel point in x, y direction and (u)₀,v₀) Is a reference point.

And 2, identifying and segmenting the target. The method uses a RealSense D435i structured light camera as a visual perception camera of the robot, simultaneously acquires RGB images and DEPTH images by using RealSense D435i, inputs the RGB images into a pre-trained convolutional neural network Mask R-CNN model, outputs the position of a target object in the images through the model, and performs pixel point segmentation on the target object to extract a target area. The flow of the Mask R-CNN model construction method provided by the invention is shown in figure 2.

And 3, positioning the target. Aligning the DEPTH image with the RGB image, and calculating the average distance between a target Mask area pixel point in the RGB image and an infrared lens of a binocular camera, thereby obtaining the distance of a target object. The flow of the target positioning method based on the RealSense binocular camera is shown in fig. 3.

And 4, calculating the pose of the target. To know the coordinate position of the target object relative to the robot arm, it is first necessary to convert the robot coordinate system and the target object coordinate system. And establishing a space coordinate system by taking the arm shoulder of the robot mechanical arm as a coordinate origin, and converting the coordinate value of the target object in the coordinate system taking the camera as the origin into the coordinate value in the coordinate system. And performing hand-eye calibration by using a nine-point calibration method, and determining a coordinate conversion relation between a camera coordinate system and a robot base coordinate system so as to calculate the position of the target workpiece relative to the robot base coordinate system.

And 5, capturing the target. The mobile robot reaches the area where the target is located, the distance between the robot and the target object is adjusted, the target object can be in the capture range of the gripper at the tail end of the mechanical arm, and the mutual relation between every two connecting rods or the pose relation relative to the base reference coordinate system can be described through the homogeneous transformation matrix of the connecting rod coordinate systems. In the implementation scheme, an eye-in-hand system is adopted, a conversion matrix from a camera coordinate system to a manipulator end effector coordinate system is obtained by completing corresponding hand-eye calibration, the angular postures of all joints are solved through robot inverse kinematics, and finally the target object is grabbed by controlling the rotation angle of all joints of a Ur manipulator.

In step 2, the Mask R-CNN model construction and detection specifically comprises the following steps:

step 2.1, acquiring a target object data set, and acquiring images of the target object from different environments, different angles, different brightness and different postures according to the type of the target object to be captured;

and 2.2, enhancing data, namely expanding the data set by adopting a method of combining traditional image geometric transformation data enhancement and generation type data enhancement by utilizing GAN (gamma-N) because fewer data set samples can be obtained. For a traditional image geometric transformation method, the data set is expanded by operations of random cutting, horizontal turning, image inclination, noise addition, image scaling and the like on the acquired data set through brightness transformation, noise addition, shearing, rotation and the like;

and 2.3, labeling the target data set, and labeling the acquired image by using an image labeling tool LabelIme, wherein the purpose of labeling is to improve the detection precision of the model on the target object by using supervised training.

Step 2.4, optimizing a Mask R-CNN network model, (1) changing a feature extraction network into ResNet50, wherein a binocular camera carried by the robot is closer to a target object, the object is clearer, deep feature extraction network extraction features are not needed, and the speed of target identification is increased by reducing network levels; (2) the RPN area recommends network modification, the sizes of the anchor boxes adopt 32 × 32, 64 × 64, 128 × 128, 256 × 256 and 512 × 512, the network modification can adapt to target item identification with more shapes, and the aspect ratio is modified to be 1: 1. 1: 2. 1: and 3, considering the opening and closing size of the mechanical claw, combining the grabbed target objects which are mostly vertical articles, modifying the sizes of anchor points for better attaching the target proportion, and concentrating the model in the 3 proportions for calculation, so that the excessive calculation amount can be reduced, and the training and testing memory is saved. Then eliminating anchor boxes exceeding the size of the original image, and further screening by a non-maximum value suppression (NMS) method to obtain an interested area; (3) and modifying the loss function. In order to further improve the accuracy of the division of the mask, a method for adding edge loss in the mask branches is provided, so that the edges of the division result are more accurate. Mask R-CNN loss function of

L＝L_cls+L_box+L_mask，

In the Mask R-CNN segmentation task, L_clsAs a function of classification loss, L_boxTo detect the loss function, L_maskFor the average binary cross entropy loss function, the deficiency in the segmentation task is dependent on region information, so that the prediction of the boundary is ignored, the accuracy of boundary segmentation on the final segmentation result is not high, the distance precision of a Mask region for subsequently using the target is not high, and the method is implemented at L_maskAdding inAnd a boundary loss function which regularizes the position, shape and continuity of the segmentation by using distance loss so as to make the position, shape and continuity closer to the target boundary. L is_mask-edgeThe optimized loss function is

representing the prediction boundary, alpha is a weight coefficient, B is the boundary of the segmentation result, M_distAnd (5) distance transformation of the group-channel segmentation boundary.

Step 2.5, performing model transfer learning training, loading the self-made data set into the optimized network model by using a transfer learning method, and simultaneously loading a model pre-trained by using a COCO data set, so as to improve model convergence, and performing iterative training on the model through parameter optimization to generate a target detection and segmentation model;

and 2.6, detecting and segmenting the target, intercepting an RGB image from a video stream shot by a RealSense D435i binocular camera, transmitting the RGB image into a Mask R-CNN model, identifying the type and the position of a target object to be captured through the model, segmenting the target, and outputting a Mask region of the target.

In step 3, the target positioning based on the RealSenseD435i binocular depth camera specifically comprises the following steps:

step 3.1, acquire RGB image and DEPTH image from RealSenseD435i camera.

And 3.2, aligning the DEPTH image and the RGB image, so that the pixel points in the RGB image correspond to the DEPTH image target points one by one.

And 3.3, filtering the background information of the target image, and filtering the background area of the image according to the Mask area of the target object output in the step 2.6.

And 3.4, calculating a target distance, calculating the distance from a pixel point in a Mask area of the target object to the binocular camera, and calculating the average value of the pixel point distances to obtain the distance from the target object to the camera.

And 3.5, calculating the target width, and calculating the maximum width of the edge of the target object in the DEPTH image corresponding to the Mask region edge pixels as the width of the target object. And judging whether the target is in the grabbing range of the mechanical arm according to the calculated width value.

Claims

1. A robot target positioning and grabbing method based on Mask R-CNN and a binocular camera is characterized by comprising the following steps:

step 1: calibrating the camera;

step 2: identifying and segmenting the target;

and step 3: positioning a target;

and 4, step 4: calculating the pose of the target;

and 5: grabbing the target;

in step 2, the RGB image collected by the binocular structured light camera is used as an input image and sent into a pre-trained convolutional neural network Mask R-CNN model, a detection frame and a Mask of a target object in the image are output through the model, a target area is extracted by carrying out pixel point segmentation on the target object, and background interference information is filtered.

2. The method of claim 1, wherein in step 2, the target object is identified and segmented by using Mask R-CNN model, and the model construction and identification steps are as follows:

2-1) acquiring a target object data set, and acquiring the data set of the target object from different environments, different angles, different brightness and different postures according to the type of the target object to be captured;

2-2) data enhancement, namely, expanding the data set by adopting a method of combining traditional image geometric transformation data enhancement and generation type data enhancement by utilizing GAN (gamma-gamma), and expanding the data set by adopting operations of random cutting, horizontal overturning, image tilting, noise adding and image zooming on the acquired data set through brightness transformation, noise adding, shearing, rotation and the like for the traditional image geometric transformation method;

2-4) optimizing a Mask R-CNN network model;

2-5) carrying out model transfer learning training, loading the self-made data set into the optimized network model by using a transfer learning method, and simultaneously loading a model pre-trained by using a COCO data set, so as to improve model convergence, and carrying out iterative training on the model by parameter optimization to generate a target recognition and segmentation model;

2-6) identifying and segmenting the target, intercepting an RGB image from a video stream shot by a binocular camera, transmitting the RGB image into a Mask R-CNN model, identifying the type and the position of a target object to be captured through the model, segmenting the target, and outputting a Mask region of the target.

3. The method according to claim 2, wherein in step 2-4), when performing Mask R-CNN network model optimization, the following steps are adopted:

(3) modifying the loss function; mask R-CNN loss function of

L＝L_cls+L_box+L_mask，

4. The method according to claim 1, wherein in step 1, calibration of the camera is performed to obtain three-dimensional spatial position information from two-dimensional image information, and the method specifically comprises the following steps:

1-1) making a calibration plate;

1-4) solving the internal parameters and the external parameters of the camera.

5. The method of claim 1, wherein in step 3, the DEPTH image acquired by the binocular camera is aligned with the RGB image, the average distance between a target Mask area pixel point in the RGB image and an infrared lens of the binocular camera is calculated to obtain the distance of the target object, the maximum width of the edge of the target object in the DEPTH image corresponding to the Mask area edge pixel is calculated as the width of the target object, and the calculated width value is used for judging whether the target is in a mechanical arm clamping range.

6. The method according to claim 5, characterized in that in step 3, it comprises in particular the steps of:

4-1) acquiring an RGB image and a DEPTH image;

4-2) aligning the DEPTH image and the RGB image, so that pixel points in the RGB image correspond to the DEPTH image target points one by one;

4-3) filtering the background information of the target image, and filtering the background area of the image according to the Mask area of the target object output in the step 2-4);

4-4) calculating a target distance, calculating the distance between a pixel point in a Mask area of a target object and an infrared lens of a binocular camera, and solving the average value of the distances as the distance between the target object and the camera lens;

4-5) calculating the target width, calculating the maximum width of the edge of the target object in the DEPTH image corresponding to the Mask region edge pixels as the width of the target object, and judging whether the target is in the grabbing range of the mechanical arm according to the calculated width value.

7. The method according to claim 1, wherein in step 4, when calculating the target pose, a space coordinate system is established with the arm shoulder of the robot arm as the coordinate origin by using a hand-eye calibration method, and coordinate values of the target object in a coordinate system with the camera as the origin are converted into coordinate values in the coordinate system, wherein the hand-eye calibration method is used for calibration, and the conversion relationship between the camera coordinate system and the robot end coordinate system can be determined.

8. The method according to claim 1, wherein in step 5, the robot is driven to reach the area where the target is located, the distance between the robot and the target object is adjusted so that the target object is within a capturable range of a gripper at the end of the robot arm, the angular postures of the joints are solved through inverse kinematics of the robot, and finally the target object is grabbed by controlling the rotation angles of the joints of the robot arm.