CN212724028U

CN212724028U - Vision robot grasping system

Info

Publication number: CN212724028U
Application number: CN202021517844.0U
Authority: CN
Inventors: 高振清; 秦志民; 文博宇; 杜艳平
Original assignee: Beijing Institute of Graphic Communication
Current assignee: Beijing Institute of Graphic Communication
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2021-03-16
Anticipated expiration: 2030-07-28

Abstract

The utility model provides a vision robot grasping system, include: the robot comprises a workbench, an image acquisition device, a computing device and a robot, wherein the image acquisition device is positioned above the workbench and used for acquiring an initial image of a target to be grabbed; the computing device is in communication connection with the image acquisition device and is used for receiving the initial image, inputting the initial image into a pre-trained multi-scale feature extraction model, and computing to obtain a processed image so as to acquire the grabbing attitude information of the target to be grabbed; the robot is in communication connection with the computing device and used for receiving control signals sent by the computing device based on the grabbing attitude information and adjusting to corresponding grabbing positions to grab the target to be grabbed based on the control signals. The vision robot grabbing system can realize rapid identification, positioning and grabbing prediction of a target object, completes identification and grabbing tasks of the vision robot by combining the motion control of robotics, and improves the real-time performance and stability of the system.

Description

Vision robot grasping system

Technical Field

The utility model relates to the technical field of robot, especially, relate to a vision robot grasping system.

Background

The traditional vision robot uses feature description to process the target detection problem, and along with the increase of target object types, feature extraction becomes more and more troublesome, the calculation amount is exponentially increased, and the real-time performance of robot operation is influenced.

Traditional vision robot adopts "teaching" mode in the task such as the location of carrying out the target object and snatching, and this kind of mode does not possess the generalization ability, and when waiting to snatch the object position and shape and change, the robot can not carry out automatic adjustment and then lead to the operation failure. Therefore, poor flexibility and insufficient stability are problems to be solved by the traditional vision robot.

SUMMERY OF THE UTILITY MODEL

The to-be-solved technical problem of the utility model is how to improve the flexibility and the reliability that the robot snatched article, the utility model provides a vision robot grasping system.

According to the utility model discloses vision robot grasping system, include:

the workbench is used for placing an object to be grabbed;

the image acquisition device is positioned above the workbench and used for acquiring an initial image of the target to be grabbed;

the computing device is in communication connection with the image acquisition device and is used for receiving the initial image, inputting the initial image into a pre-trained multi-scale feature extraction model, and computing to obtain a processed image so as to acquire the grabbing posture information of the target to be grabbed;

and the robot is in communication connection with the computing device and is used for receiving a control signal sent by the computing device based on the grabbing attitude information and adjusting to a corresponding grabbing position to grab the target to be grabbed based on the control signal.

According to the utility model discloses vision robot grasping system, after the initial image input that obtains waiting to snatch the target with image acquisition device trains the multi-scale feature extraction model in advance, can handle the gesture information that snatchs that obtains the robot to control robot treats to snatch the target and carry out automation, high efficiency and snatch. The grabbing system can realize rapid identification, positioning and grabbing prediction of the target object, completes the identification and grabbing tasks of the visual robot by combining the kinematics control of the robot, and effectively improves the real-time performance and stability of the system.

According to some embodiments of the invention, the robot comprises:

a six-axis mechanical arm;

and the mechanical arm driver is in communication connection with the computing device and the six-axis mechanical arm, receives a control instruction sent by the computing device, and controls the six-axis mechanical arm to move to the grabbing position to grab the target to be grabbed based on the control instruction.

In some embodiments of the present invention, the six-axis mechanical arm is switched between the grabbing position and the initial position, and the six-axis mechanical arm automatically resets to the initial position when grabbing is completed once.

According to some embodiments of the invention, the computing device is a computer, the image acquisition device is a depth camera.

In some embodiments of the present invention, the workbench has a placement area for placing the object to be grabbed, and the image acquisition device is located directly above the placement area.

According to some embodiments of the invention, the grasping system further comprises: the adjusting device is arranged on the image acquisition device, and the height and the angle of the image acquisition device are adjusted by the adjusting device.

In some embodiments of the present invention, the target to be grabbed includes: annular cable, express delivery box, receiver, scissors, screwdriver, toothbrush and screw.

According to some embodiments of the invention, the computing device comprises:

the convolutional neural network feature extraction module is used for building a convolutional neural network feature extraction model based on a Darknet-53 skeleton;

the standard convolution separation module is in communication connection with the convolution neural network feature extraction module and is used for calling a convolution layer creation function to separate separable standard convolutions in the convolution neural network feature extraction module into unit convolutions to form a base model; the multi-scale feature extraction module is in communication connection with the standard convolution separation module and is used for connecting different convolution layers in the base model in a jump connection mode to construct a multi-scale feature extraction model;

and the model training module is in communication connection with the multi-scale feature extraction module and is used for training the multi-scale feature extraction model to obtain the pre-trained multi-scale feature extraction model.

According to some embodiments of the invention, the model training module comprises:

the data set creating module is used for acquiring a multi-scale data set;

and the training module is in communication connection with the data set creating module and is used for training the multi-scale feature extraction model by adopting at least one of transfer learning, parallel operation and GPU acceleration methods based on the multi-scale data set.

According to some embodiments of the present invention, the model training module further comprises:

and the model evaluation optimization module is in communication connection with the training module and is used for evaluating and optimizing the trained multi-scale feature extraction model.

Drawings

Fig. 1 is a schematic view of a vision robot gripping system according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for grabbing by a visual robot according to an embodiment of the present invention.

Reference numerals:

the gripping system (100) is provided with,

a workbench 10, an image acquisition device 20, a computing device 30 and a robot 40.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments.

As shown in fig. 1, the vision robot grasping system 100 according to the embodiment of the present invention includes: a table 10, an image acquisition device 20, a computing device 30, and a robot 40.

The workbench 10 is used for placing an object to be grabbed, and the image acquisition device 20 is located above the workbench 10 and used for acquiring an initial image of the object to be grabbed.

The calculating device 30 is in communication connection with the image obtaining device 20, and is configured to receive the initial image, input the initial image into a pre-trained multi-scale feature extraction model, and calculate to obtain a processed image, so as to obtain the grabbing posture information of the object to be grabbed.

The robot 40 is in communication connection with the computing device 30, and is configured to receive a control signal sent by the computing device 30 based on the grasping posture information, and adjust to a corresponding grasping position based on the control signal to grasp the object to be grasped.

The "acquiring the grabbing attitude information of the object to be grabbed based on the processed image" may be understood as acquiring coordinate information of a frame to be grabbed and a frame to be grabbed of the processed image, and converting and calculating the coordinate information through a coordinate system to acquire the grabbing attitude information of the robot.

For example, based on processing the image, four vertices of the frame to be grabbed are obtainedThe coordinate information of (2): (x)₁,y₁)，(x₂,y₂)， (x₃,y₃)，(x₄,y₄)；

Calculating the grabbing attitude information (X) of the robot according to the following formula₀,Y₀,H₀,W₀,θ₀)：

Wherein (X)₀,Y₀) Coordinates corresponding to the center of the frame to be grasped, H₀Is the maximum height, W, of two parallel fingers of the robot₀Finger width, θ, of a two finger parallel grip₀Is the angle of the frame to be grabbed relative to the horizontal plane.

Thereby, the grasping posture information (X) of the robot can be calculated based on the obtained information₀,Y₀,H₀,W₀,θ₀) And adjusting the robot to the corresponding grabbing position to realize grabbing of the object to be grabbed.

According to the utility model discloses vision robot grasping system 100 based on degree of depth learning, after the initial image input that obtains waiting to snatch the target that image acquisition device 20 acquireed trained many yardstick feature extraction model in advance, can handle the gesture information that snatchs that obtains robot 40 to control robot 40 treats to snatch the target and carry out automation, high-efficient snatching. The grasping system 100 can realize rapid identification, positioning and grasping prediction of the target object, completes the identification and grasping tasks of the vision robot 40 by combining the motion control of the robot 40, and effectively improves the real-time performance and stability of the system.

According to some embodiments of the present invention, as shown in fig. 1, the robot 40 includes: six-axis robotic arms and robotic arm drivers.

The mechanical arm driver is in communication connection with the computing device 30 and the six-axis mechanical arm, receives a control instruction sent by the computing device 30, and controls the six-axis mechanical arm to move to a grabbing position to grab an object to be grabbed based on the control instruction.

The utility model discloses an in some embodiments, six arms switch between snatching position and initial position, once accomplish and snatch, six arms automatic re-setting to initial position.

It should be noted that when the robot is controlled to perform a grabbing task, the information of the grabbing frame needs to be obtained through the obtained original image of the target to be grabbed, and then the information of the grabbing frame is calculated and converted into grabbing posture information of the robot, and the robot is reset to the original set position, which can facilitate the calculation and conversion of the coordinate information.

According to some embodiments of the present invention, the computing device is a computer and the image acquisition device is a depth camera.

In some embodiments of the present invention, the working table 10 has a placement area for placing the object to be grasped, and the image acquiring device 20 is located directly above the placement area. Thereby, image acquisition of the object to be captured by the image acquisition device 20 is facilitated.

According to some embodiments of the present invention, the grasping system 100 further includes: the adjusting device, the image acquisition device 20 is arranged on the adjusting device, and the adjusting device adjusts the height and the angle of the image acquisition device 20. For example, the adjusting device may be a stand capable of rotating up and down, the image capturing device 20 is disposed at an end of the stand, and the height and angle of the image capturing device 20 can be conveniently adjusted by the stand.

In some embodiments of the present invention, the object to be grabbed includes: annular cable, express delivery box, receiver, scissors, screwdriver, toothbrush and screw. That is, the grasping system 100 may be used to grasp looped cables, express cassettes, storage boxes, scissors, screwdrivers, toothbrushes, screws, and the like. It is to be understood that the above-mentioned objects to be grabbed are only for illustration and should not be construed as limiting the present invention.

According to some embodiments of the present invention, computing device 30, comprises:

and the model training module is in communication connection with the multi-scale feature extraction module and is used for training the multi-scale feature extraction model so as to obtain a pre-trained multi-scale feature extraction model.

According to some embodiments of the utility model, the model training module includes:

the data set creating module is used for acquiring a multi-scale data set;

According to some embodiments of the utility model, the model training module still includes:

Adopt the utility model discloses vision robot grasping system treats the process of snatching the target and snatching, include:

the method comprises the steps that an image acquisition device acquires an initial image of a target to be grabbed, wherein the target to be grabbed is placed on a workbench;

the image acquisition device sends the initial image to the computing device, so that the computing device inputs the initial image into a pre-trained multi-scale feature extraction model, a processed image is obtained through computing, and the grabbing attitude information of the target to be grabbed is acquired based on the processed image;

and the robot grabs the target to be grabbed based on the grabbing posture information.

It should be noted that the pre-trained multi-scale feature extraction model may be constructed by a system in the computing device 30, and specifically, the computing device 30 includes:

the convolutional neural network feature extraction module can build a convolutional neural network feature extraction model through a TensorFlow platform based on a Darknet-53 framework;

the standard convolution separation module can separate separable standard convolution in the convolution neural network feature extraction model into unit convolution by calling a convolution layer creating function to form a base model;

the multi-scale feature extraction module can adopt a jump connection mode to connect different convolution layers in the base model to construct a multi-scale feature extraction model;

the data set creating module can acquire a multi-scale data set; for example, a picture to be trained is acquired by using a depth camera or by using an existing data set in the cloud; that is to say, the picture to be trained can be obtained by shooting through the depth camera, and the picture to be trained can also be obtained through the existing data of the cloud. And then, carrying out real grabbing frame marking on the picture to be trained, and carrying out clipping and/or rotation processing on the picture to be trained to amplify the picture to be trained so as to obtain a multi-scale data set.

The training module can train the built multi-scale feature extraction model through a data set, and at least one of transfer learning, parallel operation and GPU acceleration methods is used in the training process to improve the model training speed; it should be noted that after the picture to be trained is obtained, the picture to be trained needs to be labeled with a real capture frame. In order to amplify the pictures to be trained, the pictures to be trained can be cut to obtain pictures to be trained with different sizes; or rotating the picture to be trained to obtain pictures to be trained with different rotation angles; of course, the image to be trained can be cut and rotated to increase the number of the images to be trained.

And the model evaluation optimization module can evaluate and optimize the trained multi-scale feature extraction model, and when the processed image obtained by calculation of the multi-scale feature extraction model meets the preset requirement, the training of the multi-scale feature extraction model is completed.

The built multi-scale feature extraction model can be subjected to transfer learning by adopting a common public data set Pascal VOC data set, and the Pascal VOC data set comprises 20 object types.

The model building process specifically comprises the following steps: building a convolutional neural network feature extraction model through a TensorFlow 1.15 deep learning platform, calling an input common _ conv2d _ fixed _ paging (input, filters 1,1) convolutional layer creation function to separate the standard convolution into unit convolutions, adding a convolution operation of 1 x 1 to a convolution operation of 3 x 3, i.e. common _ conv2d _ fixed _ paging (input, filters 2,3), and jointly building basic modules required by the network model, wherein the basic modules jointly form a base model.

And defining shortcuts as inputs by jumping and connecting on the basis of the base model, and defining the inputs as inputs of the next layer to realize the connection between different convolution layers. A base model generates a meta-feature as the input of a secondary model, model stacking is completed by setting the cycle number, input pixels, convolution kernel number and step length of a base module, 53 layers of convolution layers with different scales are accumulated and superposed to construct a required model frame, and the model frame is used for extracting image features and obtaining learning weights.

In the weight optimization process, cross entropy is selected as a loss function. The cross entropy loss function is defined as follows:

where C denotes loss, y denotes actual value, a denotes output value, n denotes total number of samples, and x denotes sample.

And (3) using an adaptive gradient descent optimizer as a main method for weight updating, and finally using a normalized exponential function softmax to output a maximum probability value as a final predicted value. Weights in the training process are placed in the chackpiont folder, and other folders include a dataset folder, a network model folder, and some configuration description class folders. During the use process, the required files and parameters can be modified according to the specific application scene, and the optimal configuration of the method can be carried out. Therefore, the construction of a multi-scale feature extraction model is realized by adding separation convolution and jump connection on the basis of the Darknet-53 network structure.

The built training process specifically comprises the following steps:

firstly, data sets of Pascal VOC 2007 and Pascal VOC 2012 are downloaded, secondly, a script provided by Darknet for processing the tags of the VOC data sets is used for generating tags of the VOC data sets, the data sets and the categories of the tags are replaced by modifying script files, and the data configuration files, the model configuration files, the tag configuration files and the data formats are correspondingly modified. And then downloading the pre-trained network model parameters and starting the migration training.

The vision robot gripping system 100 according to the present invention is described in detail in one specific embodiment with reference to the accompanying drawings. It is to be understood that the following description is only exemplary, and not restrictive of the invention.

The utility model provides a vision robot grasping system 100 mainly solves traditional vision detection method and is detecting and categorised in-process real-time insufficient at the target to and traditional robot is snatching the poor stability's in the prediction process problem.

As shown in fig. 1, the vision robot gripping system 100 includes: a table 10, an image acquisition device 20, a computing device 30, and a robot 40.

For example, based on the processed image, coordinate information of four vertices of the frame to be grasped is acquired: (x)₁,y₁)，(x₂,y₂)， (x₃,y₃)，(x₄,y₄)；

As shown in fig. 1, the robot 40 includes: six-axis robotic arms and robotic arm drivers.

The six-axis mechanical arm is switched between a grabbing position and an initial position, and the six-axis mechanical arm automatically resets to the initial position when grabbing is completed once.

The computing device is a computer, and the image acquisition device is a depth camera.

The table 10 has a placement area where an object to be grasped is placed, and the image pickup device 20 is located directly above the placement area. Thereby, image acquisition of the object to be captured by the image acquisition device 20 is facilitated.

The grasping system 100 further includes: the adjusting device, the image acquisition device 20 is arranged on the adjusting device, and the adjusting device adjusts the height and the angle of the image acquisition device 20. For example, the adjusting device may be a stand capable of rotating up and down, the image capturing device 20 is disposed at an end of the stand, and the height and angle of the image capturing device 20 can be conveniently adjusted by the stand.

The object to be grabbed comprises: annular cable, express delivery box, receiver, scissors, screwdriver, toothbrush and screw. That is, the grasping system 100 may be used to grasp looped cables, express cassettes, storage boxes, scissors, screwdrivers, toothbrushes, screws, and the like. It is to be understood that the above-mentioned objects to be grabbed are only for illustration and should not be construed as limiting the present invention.

A computing device 30, comprising:

The model training module comprises:

the data set creating module is used for acquiring a multi-scale data set;

The model training module further comprises:

Before a target object is captured, a training multi-scale feature extraction model is set up in advance. The method comprises the steps that the acquisition of a data set to be trained is completed by using an existing data set at the cloud end and automatically acquiring the data set through a depth camera, the preprocessing of the data set is marked by adopting LabelImg software, the data set is amplified through cutting and rotating, marked real value frames are clustered, and main types of capture frames are screened out to serve as candidate frames.

And constructing a multi-scale feature extraction model to extract features of the created data set, constructing a single-stage detection model framework by matching ideas such as separation convolution, rotary convolution, jump connection and the like on the basis of a Darknet-53 framework in the model construction process, setting an IOU value as a matching standard, and using a cross entropy function as a loss function.

In the training process, methods such as transfer learning, parallel operation, GPU acceleration and the like are used for accelerating the training speed.

Based on Cornell Dataset, five-fold crossing and ablation experiments are carried out to evaluate the recognition speed, the generalization and the robustness.

In order to realize object grabbing, the pose of the target object in a mechanical arm coordinate system needs to be determined, then the pose of the object is converted into a redefined grabbing frame through coordinate conversion, and the redefined grabbing frame only needs to regress a correction value of a frame.

Specifically, with reference to fig. 2, the step of inputting an initial image into a pre-trained multi-scale feature extraction model by a computer to obtain the grasping posture information of the target to be grasped is as follows:

the method comprises the following steps: multi-scale dataset creation. Firstly, the image information of a target to be captured is obtained by a depth camera, the images are constructed into a target detection data set, and the data set is amplified through cutting and rotation. And preprocessing the amplified data set, and manually labeling the name of the target object and the bounding box in the image by using labelImg software. The marked part in the image is defined as a positive sample, and the unmarked part is defined as a negative sample.

Step two: and (5) a multi-scale feature extraction model. And constructing a multi-scale feature extraction model to extract the features of the created data set.

Step three: and (5) training a model. And training the built multi-scale feature extraction model, and accelerating the training speed by using methods such as transfer learning, parallel operation, GPU acceleration and the like in the training process.

The method comprises the following steps that migration learning is conducted, wherein the format of an ImageNet data set is converted into a format required by training of the darknet by utilizing darknet/script/ImageNet _ laber, so that label of each picture is generated and stored in a labels folder; creating a file data/ImageNet/ImageNet.name, and writing 1000 classes of ImageNet into the file; creating a file data/ImageNet/ImageNet.data, specifying the number of categories, the positions of a training set and a test set, and the like; modifying the network structure, and newly building yolov 3-ImageNet.cfg; modify utils/dataset.py; and downloading the weight weights weight/yolov3. pt of the pre-training, and starting the migration training.

Step four: and evaluating optimization. After the model training is finished, taking a Cornell Dataset (Cornell Dataset) as a reference, performing a five-fold intersection and ablation experiment, and evaluating the recognition speed, the generalization and the robustness of the optimized model.

The five-fold cross validation algorithm comprises the following steps:

a. dividing a data set Cornell Dataset into 5 packets randomly;

b. one of the packets is used as a test set each time, and the remaining 4 packets are used as a training set for training;

c. and finally, calculating the average value of the classification rates obtained for 5 times as the real resolution of the model.

Step five: and (5) carrying out feasibility analysis. And if the result is valid, the next step is carried out, otherwise, the second step is returned.

Step six: and (5) pose conversion. In order to realize object grabbing, the pose of the target object in a mechanical arm coordinate system needs to be determined, and then the pose of the object is converted into a redefined grabbing frame through coordinate conversion.

The coordinate transformation and capture box redefinition calculation is as follows:

(1) and (5) coordinate conversion. And establishing a transformation relation from a pixel coordinate system to a camera coordinate system through camera calibration, and establishing a transformation relation from the camera coordinate system to a mechanical arm coordinate system through hand-eye calibration. The concrete implementation steps are as follows:

a. acquiring an internal reference matrix and distortion parameters of the camera by using a Zhangyingyou calibration method;

b. calibrating an external parameter matrix for converting the image coordinate and the world coordinate;

c. setting N characteristic points (N >3), calculating world coordinates of the characteristic points, moving the working tail end of the mechanical arm to the characteristic points, and recording tail end coordinates to obtain N groups of data;

d. calculating a rotation matrix and a translation matrix of the two groups of data, wherein the world coordinates of the characteristic points are A group of data, and the coordinates of the tail end are B group of data;

(2) and the grabbing frame is defined again.

In order to obtain the grabbing pose of the target object, the coordinates of four vertexes of a regular grabbing frame given in the training data set are converted into five parameters (X) by adopting the method₀,Y₀,H₀,W₀,θ₀) The representation, the five parameter representation, gives the position and orientation of the two parallel fingers when performing a grip on an object.

Step seven: multi-scale real-time object detection.

To sum up, utilize the utility model provides a grasping system 100 can realize the quick discernment location and the prediction of snatching of target object, combines correlation technique such as the motion control of robotics to accomplish the discernment of vision robot and snatch the task to the real-time and the stability of system have effectively been improved.

The technical means and functions of the present invention to achieve the intended purpose will be understood more deeply and concretely through the description of the embodiments, however, the attached drawings are only for reference and illustration, and are not intended to limit the present invention.

Claims

1. A visual robotic grasping system, comprising:

the workbench is used for placing an object to be grabbed;

2. The visual robotic grasping system according to claim 1, wherein the robot includes:

a six-axis mechanical arm;

3. The visual robotic gripper system of claim 2, wherein the six-axis robotic arm switches between a gripping position and an initial position to which the six-axis robotic arm automatically resets each time a grip is completed.

4. The visual robotic grasping system according to claim 1, wherein the computing device is a computer and the image acquisition device is a depth camera.

5. The vision robot gripping system according to claim 1, wherein the table has a placement area where the object to be gripped is placed, and the image acquisition device is located directly above the placement area.

6. The visual robotic grasping system according to claim 1, characterized in that the grasping system further includes: the adjusting device is arranged on the image acquisition device, and the height and the angle of the image acquisition device are adjusted by the adjusting device.

7. The visual robotic grasping system according to claim 1, wherein the object to be grasped includes: annular cable, express delivery box, receiver, scissors, screwdriver, toothbrush and screw.

8. The visual robotic gripper system according to any one of claims 1-7, wherein the computing device comprises:

the standard convolution separation module is in communication connection with the convolution neural network feature extraction module and is used for calling a convolution layer creation function to separate separable standard convolutions in the convolution neural network feature extraction module into unit convolutions to form a base model;

the multi-scale feature extraction module is in communication connection with the standard convolution separation module and is used for connecting different convolution layers in the base model in a jump connection mode to construct a multi-scale feature extraction model;

9. The visual robotic grasping system according to claim 8, wherein the model training module includes:

the data set creating module is used for acquiring a multi-scale data set;

10. The visual robotic grasping system according to claim 9, wherein the model training module further includes: