WO2022142297A1 - A robot grasping system and method based on few-shot learning - Google Patents

A robot grasping system and method based on few-shot learning Download PDF

Info

Publication number
WO2022142297A1
WO2022142297A1 PCT/CN2021/108568 CN2021108568W WO2022142297A1 WO 2022142297 A1 WO2022142297 A1 WO 2022142297A1 CN 2021108568 W CN2021108568 W CN 2021108568W WO 2022142297 A1 WO2022142297 A1 WO 2022142297A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
network
processing module
few
net3
Prior art date
Application number
PCT/CN2021/108568
Other languages
French (fr)
Inventor
Qujiang LEI
Guangchao GUI
Xiuhao LI
Yuhe Wang
Jintao JIN
Rongqiang LIU
Zhonghui DENG
Weijun Wang
Original Assignee
Guangzhou Institute Of Advanced Technology, Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Institute Of Advanced Technology, Chinese Academy Of Sciences filed Critical Guangzhou Institute Of Advanced Technology, Chinese Academy Of Sciences
Publication of WO2022142297A1 publication Critical patent/WO2022142297A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J15/00Gripping heads and other end effectors
    • B25J15/08Gripping heads and other end effectors having finger members
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/08Programme-controlled manipulators characterised by modular constructions
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1694Programme controls characterised by use of sensors other than normal servo-feedback from position, speed or acceleration sensors, perception control, multi-sensor controlled systems, sensor fusion
    • B25J9/1697Vision controlled systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30204Marker

Definitions

  • the disclosure belongs to the field of robot learning technology and particularly relates to a robot grasping system and method based on Few-Shot Learning.
  • robots have become an indispensable part of daily production life.
  • robot grasping as the most basic action in robot function, has been well realized in the problems of unobstructed, single object positioning and grasping.
  • objects to be grasped often obscure each other in actual production, for example, the grasping of apples for crating during fruit production and transportation.
  • the current intelligent robots do not handle this situation well. Therefore, the grasping of intelligent robots in the case of mutual obstruction of target objects becomes an urgent problem to be solved.
  • image segmentation The key aspect of the vision-based robot grasping task is image segmentation, which can achieve fast and accurate grasping only if the target position is accurately and effectively located.
  • Image segmentation methods are mainly divided into traditional methods and deep learning methods.
  • Traditional methods are influenced by the quality of the captured image, and require high requirements for the image, requiring a large distinction between background and object, proper image contrast, and segmentation by color features or texture features.
  • the present invention provides a robot grasping system based on small sample learning, comprising an image acquisition module, an image processing module, and an action processing module.
  • the image acquisition module includes a depth-of-field camera for capturing images.
  • the image processing module includes U-net3+ network structure; for processing images, completing tasks about recognition, localization, segmentation.
  • the action processing module includes a ROS (Robot Operating System) system and a corresponding package for converting image information into control information for controlling the motor.
  • ROS Robot Operating System
  • the U-net3+ network structure includes an encoding part and a decoding part, the decoding part is used to achieve the extraction of contextual information, and the encoding part is used to achieve precise localization of the target according to the extracted results;
  • the encoding part consists of a stack of convolutional and pooling layers, and the decoding part consists of a stack of up-sampling, convolutional and BN layers;
  • the input of the convolutional layer of the decoding part is the entire lead-in of the encoding part
  • the fusion of the results, before fusing the input of each layer, the lead-in results of the coding section need to be adjusted to the same size as this layer by up-sampling or pooling, and then fused and fed to the convolution layer as the input of this layer for convolution.
  • the U-net3+ network structure further comprises a dropblock module for enhancing the network's recognition capability against occluded objects.
  • the activation function of the U-net3+ network structure is a Mish activation function for optimizing the network to obtain better accuracy and generalization capability; the Mish function is expressed by the formula
  • Mis h is the activation function
  • tan h is the hyperbolic tangent function
  • ln (x) is the natural logarithm
  • e x is the exponential function with e as the base.
  • a robot grasping method of the present invention based on small sample learning comprising the following steps.
  • step S1 an image acquisition module performing image acquisition of objects with different degrees of occlusion as well as different objects
  • Step S2 the image processing module performs pre-processing of the acquired images, including image sharpening and image Gaussian filtering processing.
  • Step S3 the image processing module uses labelme to annotate the image, generates a json file, and then uses the official api to generate a mask image in png format; the obtained mask image data set is divided into two parts: the training set and the validation set, and the loss function Dice Loss is used to measure the similarity of the two sets.
  • Step S4 using the image training set, the U-net3+ network of the image processing module is trained using a learning rate with an initial value of 0.000001; the network is tested using the image test set to complete tasks such as recognition, localization and segmentation.
  • Step S5 the action processing module converts the image information into control information for controlling the robot and controls the robot to complete the grasping action.
  • the annotation of the image comprises classification annotation according to the type of object in the image, and classification annotation according to the occlusion of the object in the image; the classification annotation according to the occlusion of the object in the image comprises two different types of occluded and unoccluded.
  • the present invention proposes a grasping algorithm for segmentation networks optimized for occlusion situations, using U-net3+ networks, and proposes to enhance the network using the characteristics of dropblock and Mish activation functions for enhancing the segmentation of occlusion situations, designs a segmentation neural network for small sample situations, designs a grasping system for multiple targets in occlusion situations, uses the Mish activation function is used to optimise the network, resulting in improved accuracy and reduced training time across different networks and situations; allowing the network to perform better in computing, and allowing better input information to be passed into the network, thus allowing the network to achieve better accuracy and generalization.
  • Fig. 1 shows a schematic diagram of the robotic gripping system of the present invention.
  • Fig. 2 shows a schematic diagram of the third layer structure of the decoding part of a U-net3+ network of an embodiment of the present invention.
  • Figure 3 shows a schematic diagram of the Mish function curve of the U-net3+network of one embodiment of the present invention.
  • FIG. 4 is a schematic diagram of the network improvement comparison of one embodiment of the present invention.
  • the present invention proposes a grasping algorithm for segmentation networks optimised for occlusion situations.
  • the main body of the algorithm is a U-net3+network that has achieved good results in the field of medical segmentation and proposes to enhance the network using the properties of dropblock and Mish activation functions for enhancing the segmentation of occlusion situations.
  • the main elements of the invention are: the design of a segmentation neural network for the small sample case; the design of a grasping system for multiple targets in the occlusion case; and the optimisation of the network using the Mish activation function.
  • Figure 1 provides a schematic diagram of the robot grasping system of the present invention, as shown in Figure 1.
  • the present invention provides a robot grasping system based on small sample learning, comprising an image acquisition module, an image processing module, and an action processing module.
  • the image acquisition module comprising a depth-of-field camera for capturing images
  • the image processing module comprising a U-net3+ network structure; for processing images
  • the action processing module comprises a ROS (Robot Operating System) system and a corresponding package for converting the image information into control information for controlling the motor.
  • ROS Robot Operating System
  • the invention adds a dropblock layer to the U-net3+ structure to optimize the performance of the network in the case of object occlusion.
  • the main structure of the U-net3+ coding part is a stack of convolutional and pooling layers, with 3 ⁇ 3 convolutional kernels in each layer, which are activated using the ReLU function, with 64, 128, 256, 512, and 1024 convolutional kernels respectively.
  • the decoding section consists of a stack of up-sampling, convolutional, and BN layers, each with 3 x 3 convolutional kernels, activated using the ReLU function, with the number of convolutional kernels corresponding to the encoding section, followed by the BN layer.
  • the input to the decoding part of the convolutional layer is a fusion of all the results of the coding part.
  • the results of the coding section are up-sampled or pooled to the same size as the layer, then fused and fed to the convolution layer as the input of this layer for convolution.
  • Figure 2 gives a schematic diagram of the structure of the third layer of the decoding part of the U-net3+ network of one embodiment of the present invention, as shown in Figure 2, we may take the example of the third layer of the decoding part, where the input of this layer is the fusion of the output of the previous layer with the result of the coding part, and the result of the coding part needs to be adjusted to the size of the design input of this layer. The fused result is fed into the network as an input to this layer for computation.
  • the original network structure does not have a dropblock module, so in order to address the case of target occlusion, the dropblock layer is proposed to be introduced to enhance the network's ability to recognise objects against occlusion.
  • the dropblock is added after each convolutional layer in the encoder model, and the main function of the dropblock module is to randomly discard the information in the region to improve the feature extraction ability of the whole network, while the data after the dropblock processing is basically the same as the data of the object being obscured.
  • the network structure is thus optimised for the segmentation of obscured objects.
  • the Mish function curve of the U-net3+ network of one embodiment of the invention is given in Figure 3.
  • the invention also modifies the activation function in the network by introducing the Mish activation function in the network, which results in improved accuracy and reduced training time in different networks and under different circumstances.
  • the formula for the Mish function is expressed as
  • the present invention also provides a robot grasping method based on small sample learning, comprising the following steps.
  • Step S1 The image acquisition module performs image acquisition of objects with different degrees of occlusion as well as different objects; the device platform uses Daheng GigE Vision TL HD camera to perform image acquisition of the stacked objects, and to restore the reality as much as possible, different objects are used as subjects in the images respectively.
  • Step S2 To make the images clearer and easier to use, the image processing module pre-processes the captured images, including image sharpening and image Gaussian filtering processing.
  • Step S3 The image processing module uses labelme to annotate the images, generates json files, and then uses the official api to generate mask images in png format; the resulting mask image data set is divided into two parts: the training set and the validation set, and the loss function Dice Loss is used to measure the similarity of the two sets.
  • the resulting mask image dataset was divided into two parts, one of which was a training set of 60 images and the other a validation set of 20 images.
  • Dice Loss which is commonly used in medical image segmentation.
  • the value domain of Dice Loss is [0, 1] , which is a function used to measure the similarity of two sets, and the smaller the value represents the more similar the two sets are, the specific formula is shown below:
  • X is the predicted classification value of the image pixel and Y is the true classification value of the image pixel
  • Step S4 Using the image training set, the U-net3+ network of the image processing module is trained using a learning rate with an initial value of 0.000001; the network is tested using the image test set.
  • a very small learning rate is used and the network is trained with an initial value of 0.000001 and the network is trained for 100 epochs.
  • Step S5 The action processing module converts the image information into control information for controlling the robot, and controls the robot to complete the grasping action.
  • the present invention proposes a grasping algorithm for segmentation networks optimised for occlusion situations, utilises the U-net3+ network, and proposes to enhance the network using the characteristics of dropblock and Mish activation functions for enhancing the segmentation effect in occlusion situations, designs a segmentation neural network for small sample situations, designs a grasping system for multiple targets in occlusion situations, uses the Mish activation function is used to optimise the network, resulting in improved accuracy and reduced training time for different networks and situations; allowing the network to perform better in computing; and allowing better input information to be passed into the network, thus allowing the network to achieve better accuracy and generalisation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Automation & Control Theory (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Manipulator (AREA)

Abstract

The disclosure relates to a robot grasping system and method based on few-shot learning. The disclosure includes: an image acquisition module, an image processing module, and an action processing module; the image acquisition module includes a depth-of-field camera for capturing images; the image processing module includes U-net3+ network structure for processing images; the action processing module includes a ROS system and a corresponding package for converting image information into control information for controlling the motor. The disclosure proposes a grasping algorithm for segmentation networks optimised for occlusion situations, using U-net3+ networks enhanced characteristics performed by dropblock and Mish activation functions, designs a grasping system for few-shot learning and the occlusion situations, resulting in improved accuracy and reduced training time across different networks and situations; allowing the network to perform better in computing.

Description

A Robot Grasping System and Method Based on Few-Shot Learning Technical Field
The disclosure belongs to the field of robot learning technology and particularly relates to a robot grasping system and method based on Few-Shot Learning.
Background Technique
With the progress of technology, robots have become an indispensable part of daily production life. In the intelligent industry, robot grasping, as the most basic action in robot function, has been well realized in the problems of unobstructed, single object positioning and grasping. However, objects to be grasped often obscure each other in actual production, for example, the grasping of apples for crating during fruit production and transportation. The current intelligent robots do not handle this situation well. Therefore, the grasping of intelligent robots in the case of mutual obstruction of target objects becomes an urgent problem to be solved.
The key aspect of the vision-based robot grasping task is image segmentation, which can achieve fast and accurate grasping only if the target position is accurately and effectively located. Image segmentation methods are mainly divided into traditional methods and deep learning methods. Traditional methods are influenced by the quality of the captured image, and require high requirements for the image, requiring a large distinction between background and object, proper image contrast, and segmentation by color features or texture features.
With the research and development of deep learning, deep learning gradually shows excellent ability and adaptability in the field of machine vision. Problems that are difficult to solve by traditional methods can often be obtained satisfactorily by  deep learning. Commonly used deep learning segmentation methods are U-net, R-CNN, and other neural network methods, but deep learning often requires a large amount of data to train the network, and the labeling of the training set consumes a lot of time, and the production tasks do not allow too long debugging time, so achieving better results with as small a data set as possible has become the focus of research. In summary, there are few studies on robot grasping for multi-target, occluded scenes. However, multiple objects obscuring each other is a common situation in production environments, and solving this problem using deep learning methods requires networks that implement highly accurate image segmentation.
Summary of The Invention
Because of the above, it is necessary to provide a robot grasping system and method based on small sample learning, using a U-net3+ network and enhancing the network with the characteristics of dropblock and Mish activation function for enhancing the segmentation effect of the occlusion situation, thus enabling the robot to better handle the grasping task of the occlusion scene.
To achieve the above purpose, the present invention is realized according to the following technical solutions.
On the one hand, the present invention provides a robot grasping system based on small sample learning, comprising an image acquisition module, an image processing module, and an action processing module.
the image acquisition module includes a depth-of-field camera for capturing images.
the image processing module includes U-net3+ network structure; for processing images, completing tasks about recognition, localization, segmentation.
The action processing module includes a ROS (Robot Operating System) system and a corresponding package for converting image information into control information for controlling the motor.
Further, the U-net3+ network structure includes an encoding part and a decoding part, the decoding part is used to achieve the extraction of contextual information, and the encoding part is used to achieve precise localization of the target according to the extracted results; the encoding part consists of a stack of convolutional and pooling layers, and the decoding part consists of a stack of up-sampling, convolutional and BN layers; the input of the convolutional layer of the decoding part is the entire lead-in of the encoding part The fusion of the results, before fusing the input of each layer, the lead-in results of the coding section need to be adjusted to the same size as this layer by up-sampling or pooling, and then fused and fed to the convolution layer as the input of this layer for convolution.
Further, the U-net3+ network structure further comprises a dropblock module for enhancing the network's recognition capability against occluded objects.
Further, the activation function of the U-net3+ network structure is a Mish activation function for optimizing the network to obtain better accuracy and generalization capability; the Mish function is expressed by the formula
Mis h=x×tan h (ln (1+e x) )            (1)
Mis h is the activation function, tan h is the hyperbolic tangent function, ln (x) is the natural logarithm, and e x is the exponential function with e as the base.
On the other hand, a robot grasping method of the present invention based on small sample learning, comprising the following steps.
step S1: an image acquisition module performing image acquisition of objects with different degrees of occlusion as well as different objects
Step S2: the image processing module performs pre-processing of the acquired images, including image sharpening and image Gaussian filtering processing.
Step S3: the image processing module uses labelme to annotate the image, generates a json file, and then uses the official api to generate a mask image in png format; the obtained mask image data set is divided into two parts: the training set and the validation set, and the loss function Dice Loss is used to measure the similarity of the two sets.
Step S4: using the image training set, the U-net3+ network of the image processing module is trained using a learning rate with an initial value of 0.000001; the network is tested using the image test set to complete tasks such as recognition, localization and segmentation.
Step S5: the action processing module converts the image information into control information for controlling the robot and controls the robot to complete the grasping action.
Further, in step S3, the annotation of the image comprises classification annotation according to the type of object in the image, and classification annotation according to the occlusion of the object in the image; the classification annotation according to the occlusion of the object in the image comprises two different types of occluded and unoccluded.
The advantages and positive effects of the present invention compared to the prior art include at least the following.
The present invention proposes a grasping algorithm for segmentation networks optimized for occlusion situations, using U-net3+ networks, and proposes to enhance the network using the characteristics of dropblock and Mish activation functions for enhancing the segmentation of occlusion situations, designs a segmentation neural  network for small sample situations, designs a grasping system for multiple targets in occlusion situations, uses the Mish activation function is used to optimise the network, resulting in improved accuracy and reduced training time across different networks and situations; allowing the network to perform better in computing, and allowing better input information to be passed into the network, thus allowing the network to achieve better accuracy and generalization.
Description of the accompanying figures
To illustrate more clearly the technical solutions in the embodiments of the present invention, the following is a brief description of the accompanying drawings which are required for the description of the embodiments. It is obvious that the accompanying drawings in the following description are only some embodiments of the present invention, and those other accompanying drawings may be obtained based on these drawings without any creative effort on the part of a person of ordinary skill in the art.
Fig. 1 shows a schematic diagram of the robotic gripping system of the present invention.
Fig. 2 shows a schematic diagram of the third layer structure of the decoding part of a U-net3+ network of an embodiment of the present invention.
Figure 3 shows a schematic diagram of the Mish function curve of the U-net3+network of one embodiment of the present invention.
FIG. 4 is a schematic diagram of the network improvement comparison of one embodiment of the present invention.
Specific embodiments
To make the above-mentioned objects, features, and advantages of the present invention more obvious and understandable, the technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments. It is to be noted that the embodiments described are only a part of the embodiments of the present invention and not all of them, and based on the embodiments in the present invention, all other embodiments obtained without creative labour by a person of ordinary skill in the art fall within the scope of protection of the present invention.
It should be noted that the specific parameters or quantities used in the embodiments of the invention are only a few possible or preferable sets of combinations used in the embodiments of the invention, but this should not be understood as a limitation of the scope of the patent of the invention; it should be noted that for a person of ordinary skill in the art, there are several variations and improvements that can be made without departing from the conception of the invention, and these fall within the scope of protection. Therefore, the scope of protection of the present invention shall be governed by the appended claims.
Example 1
The present invention proposes a grasping algorithm for segmentation networks optimised for occlusion situations. The main body of the algorithm is a U-net3+network that has achieved good results in the field of medical segmentation and proposes to enhance the network using the properties of dropblock and Mish activation functions for enhancing the segmentation of occlusion situations.
The main elements of the invention are: the design of a segmentation neural network for the small sample case; the design of a grasping system for multiple targets in the occlusion case; and the optimisation of the network using the Mish  activation function.
Figure 1 provides a schematic diagram of the robot grasping system of the present invention, as shown in Figure 1. On the one hand, the present invention provides a robot grasping system based on small sample learning, comprising an image acquisition module, an image processing module, and an action processing module.
the image acquisition module comprising a depth-of-field camera for capturing images
the image processing module comprising a U-net3+ network structure; for processing images
the action processing module comprises a ROS (Robot Operating System) system and a corresponding package for converting the image information into control information for controlling the motor.
In order to better solve the robot grasping problem in the case of object occlusion, the invention adds a dropblock layer to the U-net3+ structure to optimize the performance of the network in the case of object occlusion. The main structure of the U-net3+ coding part is a stack of convolutional and pooling layers, with 3×3 convolutional kernels in each layer, which are activated using the ReLU function, with 64, 128, 256, 512, and 1024 convolutional kernels respectively. The decoding section consists of a stack of up-sampling, convolutional, and BN layers, each with 3 x 3 convolutional kernels, activated using the ReLU function, with the number of convolutional kernels corresponding to the encoding section, followed by the BN layer. The input to the decoding part of the convolutional layer is a fusion of all the results of the coding part. Before fusing the input of each layer, the results of the coding section are up-sampled or pooled to the same size as the layer, then fused and  fed to the convolution layer as the input of this layer for convolution.
Figure 2 gives a schematic diagram of the structure of the third layer of the decoding part of the U-net3+ network of one embodiment of the present invention, as shown in Figure 2, we may take the example of the third layer of the decoding part, where the input of this layer is the fusion of the output of the previous layer with the result of the coding part, and the result of the coding part needs to be adjusted to the size of the design input of this layer. The fused result is fed into the network as an input to this layer for computation.
In addition, the original network structure does not have a dropblock module, so in order to address the case of target occlusion, the dropblock layer is proposed to be introduced to enhance the network's ability to recognise objects against occlusion. The dropblock is added after each convolutional layer in the encoder model, and the main function of the dropblock module is to randomly discard the information in the region to improve the feature extraction ability of the whole network, while the data after the dropblock processing is basically the same as the data of the object being obscured. The network structure is thus optimised for the segmentation of obscured objects.
The Mish function curve of the U-net3+ network of one embodiment of the invention is given in Figure 3. As shown in Figure 3, in order to optimise the network, the invention also modifies the activation function in the network by introducing the Mish activation function in the network, which results in improved accuracy and reduced training time in different networks and under different circumstances. The formula for the Mish function is expressed as
Mis h=x×tan h (ln (1+e x) )          (1)
Because there is no boundary on the Mish function (i.e. positive values can  reach any height) avoiding saturation due to capping, and also due to the presence of small negative values in the operation is has better performance, theoretically a slight allowance for negative values allows for a better gradient flow rather than a hard zero boundary as in ReLU, and most critically the Mish function is smooth, allowing for better input information to be passed into the network, resulting in better accuracy and generalisation capability.
By adding the dropblock module to the U-net3 network and replacing the activation function of the network with the Mish activation function, a comparison of the partial structure of the improved network is shown in Figure 4, where the original U-net3 network structure is shown on the left and the improved network structure is shown on the right.
Example 2
The present invention also provides a robot grasping method based on small sample learning, comprising the following steps.
Step S1: The image acquisition module performs image acquisition of objects with different degrees of occlusion as well as different objects; the device platform uses Daheng GigE Vision TL HD camera to perform image acquisition of the stacked objects, and to restore the reality as much as possible, different objects are used as subjects in the images respectively.
Step S2: To make the images clearer and easier to use, the image processing module pre-processes the captured images, including image sharpening and image Gaussian filtering processing.
Step S3: The image processing module uses labelme to annotate the images, generates json files, and then uses the official api to generate mask images in png format; the resulting mask image data set is divided into two parts: the training set  and the validation set, and the loss function Dice Loss is used to measure the similarity of the two sets.
In order to enable the robot to better handle the grasping task under occlusion, a new approach was adopted for the annotation of the dataset, not only according to the type of object, but also according to the classification of the occlusion situation, with each different type being classified into occluded and unobscured.
The resulting mask image dataset was divided into two parts, one of which was a training set of 60 images and the other a validation set of 20 images. In the experiments, it was found that the commonly used loss function for classification did not effectively describe the real situation of network training, and there were often cases where the accuracy and loss value data were good, but the segmentation results were not satisfactory. After several experiments, it was decided to use the loss function Dice Loss, which is commonly used in medical image segmentation. The value domain of Dice Loss is [0, 1] , which is a function used to measure the similarity of two sets, and the smaller the value represents the more similar the two sets are, the specific formula is shown below:
Figure PCTCN2021108568-appb-000001
X is the predicted classification value of the image pixel and Y is the true classification value of the image pixel
Step S4: Using the image training set, the U-net3+ network of the image processing module is trained using a learning rate with an initial value of 0.000001; the network is tested using the image test set. In order to obtain accurate classification results and to avoid the situation where the network calculates a high accuracy rate while the actual results have a large error, a very small learning rate is used and the network is trained with an initial value of 0.000001 and the network is trained for 100  epochs.
Step S5: The action processing module converts the image information into control information for controlling the robot, and controls the robot to complete the grasping action.
The advantages and positive effects of the present invention over the prior art include at least the following.
The present invention proposes a grasping algorithm for segmentation networks optimised for occlusion situations, utilises the U-net3+ network, and proposes to enhance the network using the characteristics of dropblock and Mish activation functions for enhancing the segmentation effect in occlusion situations, designs a segmentation neural network for small sample situations, designs a grasping system for multiple targets in occlusion situations, uses the Mish activation function is used to optimise the network, resulting in improved accuracy and reduced training time for different networks and situations; allowing the network to perform better in computing; and allowing better input information to be passed into the network, thus allowing the network to achieve better accuracy and generalisation.
The above-described embodiments express only several embodiments of the present invention, which are described in more specific and detailed terms, but should not be construed as limiting the scope of the patent of the present invention. It should be noted that, for a person of ordinary skill in the art, several variations and improvements can be made without departing from the conception of the present invention, and these fall within the scope of protection of the present invention. Therefore, the scope of protection of the present invention shall be governed by the appended claims.

Claims (6)

  1. A robot grasping system based on few-shot learning, comprising:
    an image acquisition module, an image processing module, and an action processing module;
    the image acquisition module includes a depth-of-field camera for capturing images;
    the image processing module includes U-net3+ network structure; for processing images, completing tasks about recognition, localization, segmentation;
    the action processing module includes a ROS system and a corresponding package for converting image information into control information for controlling the motor.
  2. A robot grasping system based on few-shot learning as claimed in claim 1, wherein, the U-net3+ network structure includes an encoding part and a decoding part, the decoding part is used to achieve extraction of contextual information, and the encoding part is used to achieve precise localization of the target according to the extracted results; the encoding part consists of a stack of convolutional and pooling layers, and the decoding part consists of a stack of up-sampling, convolutional and BN layers; the input of the convolutional layer of the decoding part is the entire lead-in of the encoding part The fusion of the results, before fusing the input of each layer, the lead-in results of the coding section need to be adjusted to the same size as this layer by up-sampling or pooling, and then fused and fed to the convolution layer as the input of this layer for convolution.
  3. A robot grasping system based on few-shot learning as claimed in claim 2, wherein, the U-net3+ network structure further comprises a dropblock module for enhancing the network's recognition capability against occluded objects.
  4. A robot grasping system based on few-shot learning as claimed in claim 3, wherein, the activation function of the U-net3+ network structure is a Mish activation function for optimizing the network to obtain better accuracy and generalization capability; the Mish function is expressed by the formula
    Mish=x×tanh (ln (1+e x) )  (1)
    mish is the activation function, tanh is the hyperbolic tangent function, ln (x) is the natural logarithm, and e x is the exponential function with e as the base.
  5. A robot grasping method based on few-shot learning, comprising:
    step S1: an image acquisition module performing image acquisition of objects with different degrees of occlusion as well as different objects;
    Step S2: the image processing module performs pre-processing of the acquired images, including image sharpening and image Gaussian filtering processing;
    Step S3: the image processing module uses labelme to annotate the image, generates a json file, and then uses the official api to generate a mask image in png format; the obtained mask image data set is divided into two parts: the training set and the validation set, and the loss function Dice Loss is used to measure the similarity of the two sets;
    Step S4: using the image training set, the U-net3+ network of the image processing module is trained using a learning rate with an initial value of 0.000001; the network is tested using the image test set to complete tasks such as recognition, localization and segmentation;
    Step S5: the action processing module converts the image information into control information for controlling the robot and controls the robot to complete the grasping action.
  6. A robot grasping method based on few-shot learning as claimed in claim 5, wherein, in step S3, the annotation of the image comprises classification annotation according to the type of object in the image, and classification annotation according to the occlusion of the object in the image; the classification annotation according to the occlusion of the object in the image comprises two different types of occluded and unoccluded.
PCT/CN2021/108568 2021-01-04 2021-07-27 A robot grasping system and method based on few-shot learning WO2022142297A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110004574.6 2021-01-04
CN202110004574.6A CN114723775A (en) 2021-01-04 2021-01-04 Robot grabbing system and method based on small sample learning

Publications (1)

Publication Number Publication Date
WO2022142297A1 true WO2022142297A1 (en) 2022-07-07

Family

ID=82234481

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/108568 WO2022142297A1 (en) 2021-01-04 2021-07-27 A robot grasping system and method based on few-shot learning

Country Status (2)

Country Link
CN (1) CN114723775A (en)
WO (1) WO2022142297A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452936A (en) * 2023-04-22 2023-07-18 安徽大学 Rotation target detection method integrating optics and SAR image multi-mode information

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631401A (en) * 2022-12-22 2023-01-20 广东省科学院智能制造研究所 Robot autonomous grabbing skill learning system and method based on visual perception

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109584298A (en) * 2018-11-07 2019-04-05 上海交通大学 Object manipulator picks up the automatic measure on line method of task from master object
US20200086483A1 (en) * 2018-09-15 2020-03-19 X Development Llc Action prediction networks for robotic grasping
CN111898699A (en) * 2020-08-11 2020-11-06 海之韵(苏州)科技有限公司 Automatic detection and identification method for hull target
CN112136505A (en) * 2020-09-07 2020-12-29 华南农业大学 Fruit picking sequence planning method based on visual attention selection mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200086483A1 (en) * 2018-09-15 2020-03-19 X Development Llc Action prediction networks for robotic grasping
CN109584298A (en) * 2018-11-07 2019-04-05 上海交通大学 Object manipulator picks up the automatic measure on line method of task from master object
CN111898699A (en) * 2020-08-11 2020-11-06 海之韵(苏州)科技有限公司 Automatic detection and identification method for hull target
CN112136505A (en) * 2020-09-07 2020-12-29 华南农业大学 Fruit picking sequence planning method based on visual attention selection mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HUIMIN HUANG; LANFEN LIN; RUOFENG TONG; HONGJIE HU; QIAOWEI ZHANG; YUTARO IWAMOTO; XIANHUA HAN; YEN-WEI CHEN; JIAN WU: "UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 April 2020 (2020-04-19), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081647951 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452936A (en) * 2023-04-22 2023-07-18 安徽大学 Rotation target detection method integrating optics and SAR image multi-mode information
CN116452936B (en) * 2023-04-22 2023-09-29 安徽大学 Rotation target detection method integrating optics and SAR image multi-mode information

Also Published As

Publication number Publication date
CN114723775A (en) 2022-07-08

Similar Documents

Publication Publication Date Title
WO2020108362A1 (en) Body posture detection method, apparatus and device, and storage medium
CN109344701B (en) Kinect-based dynamic gesture recognition method
Makhmudkhujaev et al. Facial expression recognition with local prominent directional pattern
CN110032925B (en) Gesture image segmentation and recognition method based on improved capsule network and algorithm
WO2022142297A1 (en) A robot grasping system and method based on few-shot learning
JP2009211178A (en) Image processing apparatus, image processing method, program and storage medium
CN108648216B (en) Visual odometer implementation method and system based on optical flow and deep learning
US20210279453A1 (en) Methods and systems for computerized recognition of hand gestures
CN109977834B (en) Method and device for segmenting human hand and interactive object from depth image
Rao et al. Neural network classifier for continuous sign language recognition with selfie video
CN112001317A (en) Lead defect identification method and system based on semantic information and terminal equipment
Han et al. Pupil center detection based on the UNet for the user interaction in VR and AR environments
CN112766028A (en) Face fuzzy processing method and device, electronic equipment and storage medium
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
WO2024060909A1 (en) Expression recognition method and apparatus, and device and medium
Wang Automatic and robust hand gesture recognition by SDD features based model matching
CN112329510A (en) Cross-domain metric learning system and method
Tsai et al. Deep Learning Based AOI System with Equivalent Convolutional Layers Transformed from Fully Connected Layers
CN117036658A (en) Image processing method and related equipment
Ravinder et al. An approach for gesture recognition based on a lightweight convolutional neural network
CN117274761B (en) Image generation method, device, electronic equipment and storage medium
Bong et al. Application of Fixed-Radius Hough Transform In Eye Detection.
CN117372437B (en) Intelligent detection and quantification method and system for facial paralysis
Jiang et al. Unsupervised Deep Homography Estimation based on Transformer
Li et al. Scalenet-improve cnns through recursively rescaling objects

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913048

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21913048

Country of ref document: EP

Kind code of ref document: A1