WO2022142297A1

WO2022142297A1 - A robot grasping system and method based on few-shot learning

Info

Publication number: WO2022142297A1
Application number: PCT/CN2021/108568
Authority: WO
Inventors: Qujiang LEI; Guangchao GUI; Xiuhao LI; Yuhe Wang; Jintao JIN; Rongqiang LIU; Zhonghui DENG; Weijun Wang
Original assignee: Guangzhou Institute Of Advanced Technology, Chinese Academy Of Sciences
Priority date: 2021-01-04
Filing date: 2021-07-27
Publication date: 2022-07-07
Also published as: CN114723775A

Abstract

The disclosure relates to a robot grasping system and method based on few-shot learning. The disclosure includes: an image acquisition module, an image processing module, and an action processing module; the image acquisition module includes a depth-of-field camera for capturing images; the image processing module includes U-net3+ network structure for processing images; the action processing module includes a ROS system and a corresponding package for converting image information into control information for controlling the motor. The disclosure proposes a grasping algorithm for segmentation networks optimised for occlusion situations, using U-net3+ networks enhanced characteristics performed by dropblock and Mish activation functions, designs a grasping system for few-shot learning and the occlusion situations, resulting in improved accuracy and reduced training time across different networks and situations; allowing the network to perform better in computing.

Description

A Robot Grasping System and Method Based on Few-Shot Learning

Technical Field

The disclosure belongs to the field of robot learning technology and particularly relates to a robot grasping system and method based on Few-Shot Learning.

Background Technique

With the progress of technology, robots have become an indispensable part of daily production life. In the intelligent industry, robot grasping, as the most basic action in robot function, has been well realized in the problems of unobstructed, single object positioning and grasping. However, objects to be grasped often obscure each other in actual production, for example, the grasping of apples for crating during fruit production and transportation. The current intelligent robots do not handle this situation well. Therefore, the grasping of intelligent robots in the case of mutual obstruction of target objects becomes an urgent problem to be solved.

The key aspect of the vision-based robot grasping task is image segmentation, which can achieve fast and accurate grasping only if the target position is accurately and effectively located. Image segmentation methods are mainly divided into traditional methods and deep learning methods. Traditional methods are influenced by the quality of the captured image, and require high requirements for the image, requiring a large distinction between background and object, proper image contrast, and segmentation by color features or texture features.

With the research and development of deep learning, deep learning gradually shows excellent ability and adaptability in the field of machine vision. Problems that are difficult to solve by traditional methods can often be obtained satisfactorily by deep learning. Commonly used deep learning segmentation methods are U-net, R-CNN, and other neural network methods, but deep learning often requires a large amount of data to train the network, and the labeling of the training set consumes a lot of time, and the production tasks do not allow too long debugging time, so achieving better results with as small a data set as possible has become the focus of research. In summary, there are few studies on robot grasping for multi-target, occluded scenes. However, multiple objects obscuring each other is a common situation in production environments, and solving this problem using deep learning methods requires networks that implement highly accurate image segmentation.

Summary of The Invention

Because of the above, it is necessary to provide a robot grasping system and method based on small sample learning, using a U-net3+ network and enhancing the network with the characteristics of dropblock and Mish activation function for enhancing the segmentation effect of the occlusion situation, thus enabling the robot to better handle the grasping task of the occlusion scene.

To achieve the above purpose, the present invention is realized according to the following technical solutions.

On the one hand, the present invention provides a robot grasping system based on small sample learning, comprising an image acquisition module, an image processing module, and an action processing module.

the image acquisition module includes a depth-of-field camera for capturing images.

the image processing module includes U-net3+ network structure; for processing images, completing tasks about recognition, localization, segmentation.

The action processing module includes a ROS (Robot Operating System) system and a corresponding package for converting image information into control information for controlling the motor.

Further, the U-net3+ network structure includes an encoding part and a decoding part, the decoding part is used to achieve the extraction of contextual information, and the encoding part is used to achieve precise localization of the target according to the extracted results; the encoding part consists of a stack of convolutional and pooling layers, and the decoding part consists of a stack of up-sampling, convolutional and BN layers; the input of the convolutional layer of the decoding part is the entire lead-in of the encoding part The fusion of the results, before fusing the input of each layer, the lead-in results of the coding section need to be adjusted to the same size as this layer by up-sampling or pooling, and then fused and fed to the convolution layer as the input of this layer for convolution.

Further, the U-net3+ network structure further comprises a dropblock module for enhancing the network's recognition capability against occluded objects.

Further, the activation function of the U-net3+ network structure is a Mish activation function for optimizing the network to obtain better accuracy and generalization capability; the Mish function is expressed by the formula

Mis h=x×tan h (ln (1+e ^x) ) (1)

Mis h is the activation function, tan h is the hyperbolic tangent function, ln (x) is the natural logarithm, and e ^x is the exponential function with e as the base.

On the other hand, a robot grasping method of the present invention based on small sample learning, comprising the following steps.

step S1: an image acquisition module performing image acquisition of objects with different degrees of occlusion as well as different objects

Step S2: the image processing module performs pre-processing of the acquired images, including image sharpening and image Gaussian filtering processing.

Step S3: the image processing module uses labelme to annotate the image, generates a json file, and then uses the official api to generate a mask image in png format; the obtained mask image data set is divided into two parts: the training set and the validation set, and the loss function Dice Loss is used to measure the similarity of the two sets.

Step S4: using the image training set, the U-net3+ network of the image processing module is trained using a learning rate with an initial value of 0.000001; the network is tested using the image test set to complete tasks such as recognition, localization and segmentation.

Step S5: the action processing module converts the image information into control information for controlling the robot and controls the robot to complete the grasping action.

Further, in step S3, the annotation of the image comprises classification annotation according to the type of object in the image, and classification annotation according to the occlusion of the object in the image; the classification annotation according to the occlusion of the object in the image comprises two different types of occluded and unoccluded.

The advantages and positive effects of the present invention compared to the prior art include at least the following.

The present invention proposes a grasping algorithm for segmentation networks optimized for occlusion situations, using U-net3+ networks, and proposes to enhance the network using the characteristics of dropblock and Mish activation functions for enhancing the segmentation of occlusion situations, designs a segmentation neural network for small sample situations, designs a grasping system for multiple targets in occlusion situations, uses the Mish activation function is used to optimise the network, resulting in improved accuracy and reduced training time across different networks and situations; allowing the network to perform better in computing, and allowing better input information to be passed into the network, thus allowing the network to achieve better accuracy and generalization.

Description of the accompanying figures

To illustrate more clearly the technical solutions in the embodiments of the present invention, the following is a brief description of the accompanying drawings which are required for the description of the embodiments. It is obvious that the accompanying drawings in the following description are only some embodiments of the present invention, and those other accompanying drawings may be obtained based on these drawings without any creative effort on the part of a person of ordinary skill in the art.

Fig. 1 shows a schematic diagram of the robotic gripping system of the present invention.

Fig. 2 shows a schematic diagram of the third layer structure of the decoding part of a U-net3+ network of an embodiment of the present invention.

Figure 3 shows a schematic diagram of the Mish function curve of the U-net3+network of one embodiment of the present invention.

FIG. 4 is a schematic diagram of the network improvement comparison of one embodiment of the present invention.

Specific embodiments

To make the above-mentioned objects, features, and advantages of the present invention more obvious and understandable, the technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments. It is to be noted that the embodiments described are only a part of the embodiments of the present invention and not all of them, and based on the embodiments in the present invention, all other embodiments obtained without creative labour by a person of ordinary skill in the art fall within the scope of protection of the present invention.

It should be noted that the specific parameters or quantities used in the embodiments of the invention are only a few possible or preferable sets of combinations used in the embodiments of the invention, but this should not be understood as a limitation of the scope of the patent of the invention; it should be noted that for a person of ordinary skill in the art, there are several variations and improvements that can be made without departing from the conception of the invention, and these fall within the scope of protection. Therefore, the scope of protection of the present invention shall be governed by the appended claims.

Example 1

The present invention proposes a grasping algorithm for segmentation networks optimised for occlusion situations. The main body of the algorithm is a U-net3+network that has achieved good results in the field of medical segmentation and proposes to enhance the network using the properties of dropblock and Mish activation functions for enhancing the segmentation of occlusion situations.

The main elements of the invention are: the design of a segmentation neural network for the small sample case; the design of a grasping system for multiple targets in the occlusion case; and the optimisation of the network using the Mish activation function.

Figure 1 provides a schematic diagram of the robot grasping system of the present invention, as shown in Figure 1. On the one hand, the present invention provides a robot grasping system based on small sample learning, comprising an image acquisition module, an image processing module, and an action processing module.

the image acquisition module comprising a depth-of-field camera for capturing images

the image processing module comprising a U-net3+ network structure; for processing images

the action processing module comprises a ROS (Robot Operating System) system and a corresponding package for converting the image information into control information for controlling the motor.

In order to better solve the robot grasping problem in the case of object occlusion, the invention adds a dropblock layer to the U-net3+ structure to optimize the performance of the network in the case of object occlusion. The main structure of the U-net3+ coding part is a stack of convolutional and pooling layers, with 3×3 convolutional kernels in each layer, which are activated using the ReLU function, with 64, 128, 256, 512, and 1024 convolutional kernels respectively. The decoding section consists of a stack of up-sampling, convolutional, and BN layers, each with 3 x 3 convolutional kernels, activated using the ReLU function, with the number of convolutional kernels corresponding to the encoding section, followed by the BN layer. The input to the decoding part of the convolutional layer is a fusion of all the results of the coding part. Before fusing the input of each layer, the results of the coding section are up-sampled or pooled to the same size as the layer, then fused and fed to the convolution layer as the input of this layer for convolution.

Figure 2 gives a schematic diagram of the structure of the third layer of the decoding part of the U-net3+ network of one embodiment of the present invention, as shown in Figure 2, we may take the example of the third layer of the decoding part, where the input of this layer is the fusion of the output of the previous layer with the result of the coding part, and the result of the coding part needs to be adjusted to the size of the design input of this layer. The fused result is fed into the network as an input to this layer for computation.

In addition, the original network structure does not have a dropblock module, so in order to address the case of target occlusion, the dropblock layer is proposed to be introduced to enhance the network's ability to recognise objects against occlusion. The dropblock is added after each convolutional layer in the encoder model, and the main function of the dropblock module is to randomly discard the information in the region to improve the feature extraction ability of the whole network, while the data after the dropblock processing is basically the same as the data of the object being obscured. The network structure is thus optimised for the segmentation of obscured objects.

The Mish function curve of the U-net3+ network of one embodiment of the invention is given in Figure 3. As shown in Figure 3, in order to optimise the network, the invention also modifies the activation function in the network by introducing the Mish activation function in the network, which results in improved accuracy and reduced training time in different networks and under different circumstances. The formula for the Mish function is expressed as

Mis h=x×tan h (ln (1+e ^x) ) (1)

Because there is no boundary on the Mish function (i.e. positive values can reach any height) avoiding saturation due to capping, and also due to the presence of small negative values in the operation is has better performance, theoretically a slight allowance for negative values allows for a better gradient flow rather than a hard zero boundary as in ReLU, and most critically the Mish function is smooth, allowing for better input information to be passed into the network, resulting in better accuracy and generalisation capability.

By adding the dropblock module to the U-net3 network and replacing the activation function of the network with the Mish activation function, a comparison of the partial structure of the improved network is shown in Figure 4, where the original U-net3 network structure is shown on the left and the improved network structure is shown on the right.

Example 2

The present invention also provides a robot grasping method based on small sample learning, comprising the following steps.

Step S1: The image acquisition module performs image acquisition of objects with different degrees of occlusion as well as different objects; the device platform uses Daheng GigE Vision TL HD camera to perform image acquisition of the stacked objects, and to restore the reality as much as possible, different objects are used as subjects in the images respectively.

Step S2: To make the images clearer and easier to use, the image processing module pre-processes the captured images, including image sharpening and image Gaussian filtering processing.

Step S3: The image processing module uses labelme to annotate the images, generates json files, and then uses the official api to generate mask images in png format; the resulting mask image data set is divided into two parts: the training set and the validation set, and the loss function Dice Loss is used to measure the similarity of the two sets.

In order to enable the robot to better handle the grasping task under occlusion, a new approach was adopted for the annotation of the dataset, not only according to the type of object, but also according to the classification of the occlusion situation, with each different type being classified into occluded and unobscured.

The resulting mask image dataset was divided into two parts, one of which was a training set of 60 images and the other a validation set of 20 images. In the experiments, it was found that the commonly used loss function for classification did not effectively describe the real situation of network training, and there were often cases where the accuracy and loss value data were good, but the segmentation results were not satisfactory. After several experiments, it was decided to use the loss function Dice Loss, which is commonly used in medical image segmentation. The value domain of Dice Loss is [0, 1] , which is a function used to measure the similarity of two sets, and the smaller the value represents the more similar the two sets are, the specific formula is shown below:

X is the predicted classification value of the image pixel and Y is the true classification value of the image pixel

Step S4: Using the image training set, the U-net3+ network of the image processing module is trained using a learning rate with an initial value of 0.000001; the network is tested using the image test set. In order to obtain accurate classification results and to avoid the situation where the network calculates a high accuracy rate while the actual results have a large error, a very small learning rate is used and the network is trained with an initial value of 0.000001 and the network is trained for 100 epochs.

Step S5: The action processing module converts the image information into control information for controlling the robot, and controls the robot to complete the grasping action.

The advantages and positive effects of the present invention over the prior art include at least the following.

The present invention proposes a grasping algorithm for segmentation networks optimised for occlusion situations, utilises the U-net3+ network, and proposes to enhance the network using the characteristics of dropblock and Mish activation functions for enhancing the segmentation effect in occlusion situations, designs a segmentation neural network for small sample situations, designs a grasping system for multiple targets in occlusion situations, uses the Mish activation function is used to optimise the network, resulting in improved accuracy and reduced training time for different networks and situations; allowing the network to perform better in computing; and allowing better input information to be passed into the network, thus allowing the network to achieve better accuracy and generalisation.

The above-described embodiments express only several embodiments of the present invention, which are described in more specific and detailed terms, but should not be construed as limiting the scope of the patent of the present invention. It should be noted that, for a person of ordinary skill in the art, several variations and improvements can be made without departing from the conception of the present invention, and these fall within the scope of protection of the present invention. Therefore, the scope of protection of the present invention shall be governed by the appended claims.

Claims

A robot grasping system based on few-shot learning, comprising:

an image acquisition module, an image processing module, and an action processing module;

the image acquisition module includes a depth-of-field camera for capturing images;

the image processing module includes U-net3+ network structure; for processing images, completing tasks about recognition, localization, segmentation;

the action processing module includes a ROS system and a corresponding package for converting image information into control information for controlling the motor.
A robot grasping system based on few-shot learning as claimed in claim 1, wherein, the U-net3+ network structure includes an encoding part and a decoding part, the decoding part is used to achieve extraction of contextual information, and the encoding part is used to achieve precise localization of the target according to the extracted results; the encoding part consists of a stack of convolutional and pooling layers, and the decoding part consists of a stack of up-sampling, convolutional and BN layers; the input of the convolutional layer of the decoding part is the entire lead-in of the encoding part The fusion of the results, before fusing the input of each layer, the lead-in results of the coding section need to be adjusted to the same size as this layer by up-sampling or pooling, and then fused and fed to the convolution layer as the input of this layer for convolution.
A robot grasping system based on few-shot learning as claimed in claim 2, wherein, the U-net3+ network structure further comprises a dropblock module for enhancing the network's recognition capability against occluded objects.
A robot grasping system based on few-shot learning as claimed in claim 3, wherein, the activation function of the U-net3+ network structure is a Mish activation function for optimizing the network to obtain better accuracy and generalization capability; the Mish function is expressed by the formula

Mish=x×tanh (ln (1+e ^x) ) (1)

mish is the activation function, tanh is the hyperbolic tangent function, ln (x) is the natural logarithm, and e ^x is the exponential function with e as the base.
A robot grasping method based on few-shot learning, comprising:

step S1: an image acquisition module performing image acquisition of objects with different degrees of occlusion as well as different objects;

Step S2: the image processing module performs pre-processing of the acquired images, including image sharpening and image Gaussian filtering processing;

Step S3: the image processing module uses labelme to annotate the image, generates a json file, and then uses the official api to generate a mask image in png format; the obtained mask image data set is divided into two parts: the training set and the validation set, and the loss function Dice Loss is used to measure the similarity of the two sets;

Step S4: using the image training set, the U-net3+ network of the image processing module is trained using a learning rate with an initial value of 0.000001; the network is tested using the image test set to complete tasks such as recognition, localization and segmentation;

Step S5: the action processing module converts the image information into control information for controlling the robot and controls the robot to complete the grasping action.
A robot grasping method based on few-shot learning as claimed in claim 5, wherein, in step S3, the annotation of the image comprises classification annotation according to the type of object in the image, and classification annotation according to the occlusion of the object in the image; the classification annotation according to the occlusion of the object in the image comprises two different types of occluded and unoccluded.