CN116091571A

CN116091571A - Method and equipment for carrying out instance segmentation on picture

Info

Publication number: CN116091571A
Application number: CN202111305773.7A
Authority: CN
Inventors: 丁凯
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2023-05-09
Also published as: WO2023078686A1

Abstract

The invention relates to a method for training a neural network to conduct instance segmentation on pictures, which comprises the following steps: based on different settings of cameras, acquiring multi-frame gray level images of the same scene; combining the multi-frame gray level images into a multi-channel image; and training the neural network for instance segmentation using the multi-channel image. The invention also relates to a device for training the neural network to segment the picture, a method and a device for training the neural network to segment the picture, a robot grabbing method and a system, a computer storage medium and a computer program product.

Description

Method and equipment for carrying out instance segmentation on picture

Technical Field

The present invention relates to the field of image segmentation, and more particularly, to a method and apparatus for training a neural network for instance segmentation of a picture or for instance segmentation of a picture using a trained neural network, a robot gripping method and system, a computer storage medium, and a computer program product.

Background

In the robot gripping application scenario, objects to be gripped are randomly placed or stacked on a tray. In order for a robot to grasp an object, it is necessary to detect the pose of the object with a three-dimensional camera, for example. Three-dimensional cameras are of many kinds: time of flight (ToF) cameras, structured light cameras, stereo cameras, etc. To determine the pose (e.g., six-dimensional pose) of an object, an instance segmentation (instance segmentation) of a two-dimensional image (e.g., RGB image) of a scene containing the object is required.

However, some three-dimensional cameras, such as ToF cameras, provide only gray scale images. It will be appreciated that an RGB image has three channels, red, green and blue, while a greyscale image has only one channel. Instance segmentation using a deep learning method performs poorly on grayscale images with only a single channel. This in turn has an adverse effect on the determination of the pose of the subsequent object, robotic grasping, etc.

Disclosure of Invention

According to an aspect of the present invention, there is provided a method of training a neural network to perform instance segmentation on a picture, the method comprising: based on different settings of cameras, acquiring multi-frame gray level images of the same scene; combining the multi-frame gray level images into a multi-channel image; and training the neural network for instance segmentation using the multi-channel image.

Additionally or alternatively to the above, the method may further comprise: and acquiring depth information of the scene by using the camera.

Additionally or alternatively to the above, in the above method, combining the multi-frame gray scale map into a multi-channel image includes: and combining the multi-frame gray scale map and the depth information into a multi-channel image.

In addition or alternatively, in the above method, the neural network is a Mask region convolutional neural network Mask R-CNN, and the multi-channel image is a four-channel image, and the four-channel image is composed of three grayscale images corresponding to three different settings of the camera and the depth information.

Additionally or alternatively to the above, in the above method, the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein the reflectivities f1< f2, t1> t2, and g1> g2.

According to another aspect of the present invention, there is provided an apparatus for training a neural network to perform instance segmentation on a picture, the apparatus comprising: the first acquisition device is used for acquiring multi-frame gray level images of the same scene based on different settings of the cameras; combining means for combining the multi-frame gray scale images into a multi-channel image; and training means for training the neural network for instance segmentation using the multi-channel image.

Additionally or alternatively to the above, the apparatus further comprises: and the second acquisition device is used for acquiring the depth information of the scene by using the camera.

Additionally or alternatively to the above, in the above apparatus, the combining means is configured to combine the multi-frame gray scale map and the depth information into a multi-channel image.

In addition or alternatively to the above, in the above apparatus, the neural network is a Mask area convolutional neural network Mask R-CNN, and the multi-channel image is a four-channel image composed of three frame gray-scale images corresponding to three different settings of the camera, respectively, and the depth information.

Additionally or alternatively to the above, in the above device, the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein the reflectances f1< f2, t1> t2 and g1> g2.

According to yet another aspect of the present invention, there is provided a method of instance segmentation of an image, the method comprising: based on different settings of cameras, acquiring multi-frame gray level images of the same scene; combining the multi-frame gray level images into a multi-channel image; and utilizing a trained neural network for instance segmentation for the multi-channel image.

Additionally or alternatively to the above, the method may further comprise: and acquiring depth information of the scene by using the camera, for example, a ToF camera.

Additionally or alternatively to the above, in the above method, the trained neural network is a Mask region convolutional neural network (Mask R-CNN), and the multi-channel image is a four-channel image composed of three frames of gray-scale images corresponding to three different settings of the camera, respectively, and the depth information.

Additionally or alternatively to the above, in the above method, the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein the reflectivities f1< f2, t1> t2, and g1> g2. That is, for example, when imaging a dark object of low reflectance, a longer exposure time and high gain should be used. While shorter exposure times and normal gain are suitable for highly reflective objects.

According to still another aspect of the present invention, there is provided a robot gripping method, the method including: performing the method for performing instance segmentation on the image of the scene containing the object to be grabbed by using a three-dimensional camera installed on the robot so as to acquire a point cloud segmentation result of the object to be grabbed; and performing a grabbing task based on the point cloud segmentation result.

According to still another aspect of the present invention, there is provided an apparatus for instance segmentation of an image, the apparatus comprising: the first acquisition device is used for acquiring multi-frame gray level images of the same scene based on different settings of the cameras; combining means for combining the multi-frame gray scale images into a multi-channel image; and segmentation means for instance segmentation using a trained neural network for the multi-channel image.

Additionally or alternatively to the above, in the above apparatus, the combining means is configured to combine the multi-frame gray scale map from the first acquiring means and the depth information from the second acquiring means into a multi-channel image.

In addition or alternatively to the above, in the above apparatus, the dividing means may perform example division using a Mask region convolutional neural network (Mask R-CNN), and the multi-channel image may be a four-channel image composed of three frames of gray-scale images corresponding to three different settings of the camera, respectively, and the depth information.

Additionally or alternatively to the above, in the above device, the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein the reflectances f1< f2, t1> t2 and g1> g2. That is, for example, when imaging a dark object of low reflectance, a longer exposure time and high gain should be used. While shorter exposure times and normal gain are suitable for highly reflective objects.

According to yet another aspect of the present invention, there is provided a robotic grasping system including: an apparatus for performing instance segmentation on an image as described above, the apparatus being configured to perform instance segmentation on a scene image containing an object to be grabbed using a three-dimensional camera mounted on the robot so as to obtain a point cloud segmentation result of the object to be grabbed; and a grabbing device for executing grabbing tasks based on the point cloud segmentation results.

According to yet another aspect of the present invention there is provided a computer storage medium comprising instructions which, when executed, perform a method as previously described.

According to a further aspect of the invention there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

According to the embodiment of the invention, the multiple-frame gray level images of the same scene are obtained based on different settings of the cameras, and are combined into the multi-channel image to further perform the instance segmentation, so that the dynamic range of the captured image is enlarged, and the accuracy of the instance segmentation is greatly improved. This may further ensure the success rate of the robot gripping the object based on this example segmentation scheme.

Drawings

The above and other objects and advantages of the present invention will become more fully apparent from the following detailed description taken in conjunction with the accompanying drawings, in which identical or similar elements are designated by the same reference numerals.

FIG. 1 shows a flow diagram of a method of training a neural network to segment a picture in examples, according to one embodiment of the invention;

FIG. 2 shows a schematic structural diagram of an apparatus for training a neural network to segment a picture by example, according to one embodiment of the invention;

FIG. 3 shows a flow diagram of a method for example segmentation of a picture using a trained neural network, according to one embodiment of the invention;

FIG. 4 shows a schematic structural diagram of an apparatus for example segmentation of a picture using a trained neural network, according to one embodiment of the invention;

FIG. 5 shows a flow diagram of a robotic grasping method according to one embodiment of the invention; and

fig. 6 shows a schematic structural view of a robotic grasping system according to an embodiment of the invention.

Detailed Description

Hereinafter, a scheme of example division and a robot gripping scheme according to various exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 shows a flow diagram of a method 1000 of training a neural network to segment an instance of a picture, according to one embodiment of the invention. As shown in fig. 1, a method 1000 for training a neural network to segment a picture by instance includes the steps of:

in step S110, based on different settings of the cameras, a multi-frame gray scale map of the same scene is obtained;

in step S120, the multi-frame gray scale images are combined into a multi-channel image; and

in step S130, the neural network is trained for instance segmentation using the multi-channel image.

The term "instance segmentation" (instance segmentation), also called image instance segmentation, is to further refine and separate the foreground and the background of an object based on object detection, and to achieve object separation at the pixel level. Image instance segmentation is further enhanced based on object detection. Image instance segmentation has application in scenes such as object detection, face detection, expression recognition, medical image processing and disease auxiliary diagnosis, video monitoring and object tracking, and shelf vacancy recognition of retail scenes. In one embodiment, "instance segmentation" is the automatic framing of different instances from an image with a target detection method, followed by pixel-by-pixel labeling within the different instance areas with a semantic segmentation method. Unlike "instance segmentation," semantic segmentation does not distinguish between different instances belonging to the same class. For example, when there are multiple cats in the image, the semantic segmentation predicts all pixels of the two cats as a whole as a "cat" class, while the instance segmentation needs to distinguish which pixels belong to the first cat and which pixels belong to the second cat.

In one embodiment, although not shown in FIG. 1, the method 1000 described above may further include: and acquiring depth information of the scene by using the camera. In one or more embodiments, the camera is a ToF (time of flight) camera.

ToF is a shorthand for Time of flight, and is interpreted as Time of flight. The time-of-flight 3D imaging is to continuously transmit light pulses to a target, then receive light returned from the object by a sensor, and obtain the distance of the target by detecting the flight (round trip) time of the light pulses. The technology is basically similar to the principle of a 3D laser sensor, but only the 3D laser sensor scans point by point, and the ToF camera obtains depth information of the whole image at the same time. The ToF camera is similar to the common machine vision imaging process, and consists of a light source, an optical component, a sensor, a control circuit, a processing circuit and other units. Compared to the very similar binocular vision system, which is a non-invasive three-dimensional detection, the ToF camera has a radically different 3D imaging mechanism. Binocular stereo measurement is performed by performing stereo detection by a triangulation method after matching left and right stereo pairs, and a ToF camera is obtained by obtaining a target distance by incident and reflected light detection.

The ToF technology adopts an active light detection mode, unlike the general illumination requirement, the objective of the ToF illumination unit is not illumination, but distance measurement is performed by utilizing the change of an incident light signal and a reflected light signal, so that the ToF illumination unit performs high-frequency modulation on light and then emits the light, for example, pulse light emitted by an LED or a laser diode is adopted, and the pulse can reach 100MHz.

Compared with a stereo camera or a triangulation system, the ToF camera is small in size and is very suitable for occasions needing a portable and small-size camera. The ToF camera can also calculate depth information quickly in real time, up to tens to 100fps. In addition, since the depth calculation of the ToF is not affected by the object surface gradation and features, three-dimensional detection can be performed very accurately. Moreover, the depth calculation accuracy of the ToF does not change with distance, and can be basically stabilized on the order of cm, which is significant for some applications with a large range of motion.

Despite the advantages described above, toF cameras provide only gray scale images with a single channel, which is detrimental to training a neural network as a training dataset for instance segmentation. Therefore, in step S110, based on different settings of the cameras, a multi-frame gray scale map of the same scene is acquired. In one or more embodiments, the different settings of the camera include different exposure times and amplitude gain (amplitude gain). For example, when imaging dark objects of low reflectivity, longer exposure times and high gains should be used. While shorter exposure times and normal gain are suitable for highly reflective objects. In robotic gripping applications, there may be many different objects to be gripped, including dark objects or objects of high and low reflectivity. Thus, a combination of different exposure times and amplitude gains are employed to acquire multiple frame images (e.g., 3 frame images) of the same scene, which may facilitate dynamic range improvement (e.g., as compared to a single acquired image frame).

In one embodiment, step S120 includes: and combining the multi-frame gray scale map and the depth information into a multi-channel image. For example, when the neural network to be trained is a Mask region convolutional neural network (Mask R-CNN), the multi-channel image may be a four-channel image, wherein the four-channel image is composed of three frame gray-scale images corresponding to three different settings of the camera, respectively, and the depth information. That is, three frames of gray maps correspond to three channels, respectively, and depth information corresponds to one channel. In one embodiment, the gray values of the multi-frame gray map and depth information may be constructed and stored in a four-channel image format for use in training a training dataset for a subsequent training neural network.

Of course, it will be appreciated by those skilled in the art that the multi-channel image is not limited to a four-channel image. In one embodiment, the multi-channel image may be a three-channel image (e.g., two-frame gray scale map + depth information, etc.).

In step S130, the neural network is trained for instance segmentation using the multi-channel image. In one embodiment, the neural network to be trained is a Mask region convolutional neural network (Mask R-CNN).

Specifically, mask R-CNN is a two-stage framework, the first stage scanning the image and generating suggestions (i.e., regions that may contain an object), and the second stage classifying the suggestions and generating bounding boxes and masks. Mask R-CNN extends from Faster R-CNN, which is a popular target detection framework that Mask R-CNN extends into an instance segmentation framework.

Turning to fig. 2, fig. 2 shows an apparatus 2000 for training a neural network to segment a picture by instance, the apparatus 2000 comprising: a first acquisition means 210, a combining means 220 and a training means 230. The first obtaining device 210 is configured to obtain a multi-frame gray scale map of the same scene based on different settings of the camera; combining means 220 for combining the multi-frame gray scale images into a multi-channel image; and training means 230 for training the neural network for instance segmentation using the multi-channel image.

In one embodiment, although not shown in FIG. 2, the apparatus 2000 may further include: and the second acquisition device is used for acquiring the depth information of the scene by using the camera. Based on this depth information, the training data set may be better constructed for training of the neural network.

In one or more embodiments, the camera is a ToF (time of flight) camera. ToF is a shorthand for Time of flight, and is interpreted as Time of flight. The time-of-flight 3D imaging is to continuously transmit light pulses to a target, then receive light returned from the object by a sensor, and obtain the distance of the target by detecting the flight (round trip) time of the light pulses. The technology is basically similar to the principle of a 3D laser sensor, but only the 3D laser sensor scans point by point, and the ToF camera obtains depth information of the whole image at the same time. The ToF camera is similar to the common machine vision imaging process, and consists of a light source, an optical component, a sensor, a control circuit, a processing circuit and other units. Compared to the very similar binocular vision system, which is a non-invasive three-dimensional detection, the ToF camera has a radically different 3D imaging mechanism. Binocular stereo measurement is performed by performing stereo detection by a triangulation method after matching left and right stereo pairs, and a ToF camera is obtained by obtaining a target distance by incident and reflected light detection.

Despite the advantages described above, toF cameras only provide gray scale images with a single channel, which is detrimental to neural network learning. Thus, the first acquiring device 210 acquires the multi-frame gray-scale map of the same scene based on different settings of the camera. In one or more embodiments, the different settings of the camera include different exposure times and amplitude gain (amplitude gain). For example, when imaging dark objects of low reflectivity, longer exposure times and high gains should be used. While shorter exposure times and normal gain are suitable for highly reflective objects. In robotic gripping applications, there may be many different objects to be gripped, including dark objects or objects of high and low reflectivity. Thus, a combination of different exposure times and amplitude gains are employed to acquire multi-frame images (e.g., 3-frame images) of the same scene, which may be advantageous for improving dynamic range (dynamic range) over a single acquired image frame.

In one embodiment, the combining means 220 is configured to combine the multi-frame gray scale map and the depth information into a multi-channel image. In one embodiment, the multi-channel image is a four-channel image, wherein the four-channel image is composed of three frames of gray scale images corresponding to three different settings of the camera, respectively, and the depth information. That is, three frames of gray maps correspond to three channels, respectively, and depth information corresponds to one channel. The combining means 220 may be configured to construct and save the gray values of the multi-frame gray map and the depth information in a four-channel image format for subsequent deep learning (instance segmentation).

Training means 230 is used to train the neural network for instance segmentation using the multi-channel image. In one embodiment, the training device 230 may be configured to train a Mask region convolutional neural network (Mask R-CNN) using the multi-channel image for instance segmentation. Specifically, mask R-CNN is a two-stage framework, the first stage scanning the image and generating suggestions (i.e., regions that may contain an object), and the second stage classifying the suggestions and generating bounding boxes and masks. Mask R-CNN extends from Faster R-CNN, which is a popular target detection framework that Mask R-CNN extends into an instance segmentation framework.

Referring to fig. 3, a flow diagram of a method 3000 for instance segmentation of a picture using a trained neural network is shown, according to one embodiment of the invention. As shown in fig. 3, the method 3000 includes the steps of:

in step S310, based on different settings of the cameras, a multi-frame gray scale map of the same scene is obtained;

in step S320, the multi-frame gray scale images are combined into a multi-channel image; and

in step S330, an instance segmentation is performed with a trained neural network for the multi-channel image.

In one embodiment, although not shown in fig. 3, the method 3000 may further include: and acquiring depth information of the scene by using the camera. In one or more embodiments, the camera is a ToF (time of flight) camera. ToF is a shorthand for Time of flight, and is interpreted as Time of flight. The time-of-flight 3D imaging is to continuously transmit light pulses to a target, then receive light returned from the object by a sensor, and obtain the distance of the target by detecting the flight (round trip) time of the light pulses. Compared with a stereo camera or a triangulation system, the ToF camera is small in size and is very suitable for occasions needing a portable and small-size camera. The ToF camera can also calculate depth information quickly in real time, up to tens to 100fps. In addition, since the depth calculation of the ToF is not affected by the object surface gradation and features, three-dimensional detection can be performed very accurately. Moreover, the depth calculation accuracy of the ToF does not change with distance, and can be basically stabilized on the order of cm, which is significant for some applications with a large range of motion.

Despite the advantages described above, toF cameras provide only gray scale images with a single channel, which is detrimental to instance segmentation using neural networks. Therefore, in step S310, based on different settings of the cameras, a multi-frame gray scale map of the same scene is acquired. In one or more embodiments, the different settings of the camera include different exposure times and amplitude gain (amplitude gain). For example, when imaging dark objects of low reflectivity, longer exposure times and high gains should be used. While shorter exposure times and normal gain are suitable for highly reflective objects. In robotic gripping applications, there may be many different objects to be gripped, including dark objects or objects of high and low reflectivity. Thus, a combination of different exposure times and amplitude gains are employed to acquire multiple frame images (e.g., 3 frame images) of the same scene, which may facilitate dynamic range improvement (e.g., as compared to a single acquired image frame).

In one embodiment, step S320 includes: and combining the multi-frame gray scale map and the depth information into a multi-channel image. For example, when the trained neural network is a Mask region convolutional neural network (Mask R-CNN), the multi-channel image may be a four-channel image, wherein the four-channel image is composed of three frames of gray-scale images corresponding to three different settings of a camera, respectively, and the depth information. That is, three frames of gray maps correspond to three channels, respectively, and depth information corresponds to one channel. In one embodiment, the gray values and depth information of a multi-frame gray scale map may be constructed and stored in a four-channel image format to facilitate subsequent instance segmentation using a trained neural network for the multi-channel image.

In step S330, an instance segmentation is performed with a trained neural network for the multi-channel image. In one embodiment, the trained neural network is a Mask region convolutional neural network (Mask R-CNN). Specifically, mask R-CNN is a two-stage framework, the first stage scanning the image and generating suggestions (i.e., regions that may contain an object), and the second stage classifying the suggestions and generating bounding boxes and masks. Mask R-CNN extends from Faster R-CNN, which is a popular target detection framework that Mask R-CNN extends into an instance segmentation framework.

In addition, the method for training the neural network may be performed according to fig. 1 and the corresponding description, and will not be described herein.

Turning to fig. 4, an apparatus 4000 for instance segmentation of an image is shown in accordance with one embodiment of the present invention. The apparatus 4000 comprises: first acquisition means 410, combining means 420 and dividing means 430. The first obtaining device 410 is configured to obtain a multi-frame gray scale map of the same scene based on different settings of the camera; combining means 420 for combining the multi-frame gray scale images into a multi-channel image; and segmentation means 430 for instance segmentation with trained neural networks for the multi-channel image.

In one embodiment, although not shown in fig. 4, the apparatus 4000 may further include: and the second acquisition device is used for acquiring the depth information of the scene by using the camera. The trained neural network can be better utilized for instance segmentation based on the depth information. In one or more embodiments, the camera is a ToF (time of flight) camera.

In one embodiment, the combining means 420 is configured to combine the multi-frame gray scale map and the depth information into a multi-channel image. In one embodiment, the multi-channel image is a four-channel image, wherein the four-channel image is composed of three frames of gray scale images corresponding to three different settings of the camera, respectively, and the depth information. That is, three frames of gray maps correspond to three channels, respectively, and depth information corresponds to one channel. The combining means 420 may be configured to construct and save the gray values of the multi-frame gray map and the depth information in a four-channel image format for subsequent deep learning (instance segmentation).

Segmentation means 430 is used to perform instance segmentation using a trained neural network for the multi-channel image. In one embodiment, segmentation apparatus 430 may be configured to perform instance segmentation using a trained Mask region convolutional neural network (Mask R-CNN). That is, in this embodiment, the trained neural network is a Mask region convolutional neural network (Mask R-CNN). Mask R-CNN is a two-stage framework, the first stage scanning the image and generating suggestions (i.e., regions that may contain an object), the second stage classifying the suggestions and generating bounding boxes and masks. Mask R-CNN extends from Faster R-CNN, which is a popular target detection framework that Mask R-CNN extends into an instance segmentation framework.

Referring to fig. 5, a flow diagram of a robotic grasping method 5000 according to one embodiment of the invention is shown. As shown in fig. 5, the robot gripping method 5000 includes the steps of:

in step S510, performing instance segmentation on a scene image containing an object to be grabbed by using a three-dimensional camera mounted on the robot so as to obtain a point cloud segmentation result of the object to be grabbed; and

in step S520, a grab task is performed based on the point cloud segmentation result.

In one or more embodiments, the method for performing instance segmentation on the scene image in step S510 may employ the method 3000 for instance segmentation as described above in connection with fig. 3, which is not described herein. In one embodiment, a segmentation mask (segmentation mask) of the object to be grabbed may be obtained by instance segmentation of the scene image captured by the three-dimensional camera. Then, by mapping between the image and the depth map, a point cloud segmentation result of the object can be obtained.

The term "point cloud" refers to a collection of points obtained after the spatial coordinates of each sample point of the object surface are acquired. In one embodiment, the "point cloud data" may include two-dimensional coordinates (XY) or three-dimensional coordinates (XYZ), laser reflection Intensity (Intensity), color information (RGB), and the like.

In step S520, a grab task is performed based on the point cloud segmentation result. For example, the control scheme of the mechanical arm of the robot is obtained by further processing the point cloud segmentation result. The robotic arm is then controlled to perform a gripping task based on the control scheme.

Fig. 6 provides a robotic grasping system 6000. As shown in fig. 6, the robot gripping system 6000 includes: an instance segmentation device 610 and a grabbing device 620, wherein the instance segmentation device 610 is configured to perform instance segmentation on a scene image containing an object to be grabbed by using a three-dimensional camera installed on the robot so as to obtain a point cloud segmentation result of the object to be grabbed; the grabbing device 620 is configured to perform a grabbing task based on the point cloud segmentation result.

In one embodiment, the instance segmentation device 610 may obtain a segmentation mask (segmentation mask) of the object to be grabbed by instance segmentation of the scene image captured by the three-dimensional camera. The example segmented device 610 may then obtain a point cloud segmentation result of the object by mapping between the image and the depth map. In the context of the present invention, the term "point cloud" refers to a collection of points obtained after acquiring the spatial coordinates of each sample point of the object surface. In one embodiment, the "point cloud data" may include two-dimensional coordinates (XY) or three-dimensional coordinates (XYZ), laser reflection Intensity (Intensity), color information (RGB), and the like.

The grabbing device 620 is configured to perform grabbing tasks based on the point cloud segmentation results. In one embodiment, the gripping device 620 is configured to obtain a control scheme of the robotic arm of the robot by further processing the point cloud segmentation result, and then control the robotic arm to perform the gripping task based on the control scheme.

In addition, it is easily understood by those skilled in the art that the method 1000 for training the neural network to perform instance segmentation on the picture, the method 3000 for performing instance segmentation on the picture by using the trained neural network, or the robot gripping method 5000 provided in one or more embodiments of the present invention may be implemented by a computer program. For example, the computer program is embodied in a computer program product that when executed by a processor implements a method 1000 of training a neural network to instance segment a picture, a method 3000 of instance segment a picture with a trained neural network, or a robotic grasping method 5000 of one or more embodiments of the invention. For another example, when a computer storage medium (e.g., a usb disk) storing the computer program is connected to a computer, the computer program can be executed to perform the method 1000 of training a neural network to segment a picture, the method 3000 of using a trained neural network to segment a picture, or the robot capture method 5000 of one or more embodiments of the present invention.

In summary, the method or the device for instance segmentation according to the embodiments of the present invention obtains multiple frames of gray maps of the same scene based on different settings (e.g., exposure time and amplitude gain) of the camera, and combines the multiple frames of gray maps into a multi-channel image to perform instance segmentation, thereby expanding the dynamic range of the captured image and greatly improving the accuracy of instance segmentation. This may further ensure the success rate of gripping objects in a robotic gripping solution based on the example segmentation method or apparatus.

For example, employing different settings (combination of exposure time and amplitude gain) to acquire multiple frames of images (e.g., 3 frames of images) of the same scene may facilitate dynamic range improvement (e.g., as compared to a single acquired image frame). When imaging dark objects of low reflectivity, longer exposure times and high gains should be used. While shorter exposure times and normal gain are suitable for highly reflective objects.

While the above description describes only some of the embodiments of the present invention, those of ordinary skill in the art will appreciate that the present invention can be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and the invention is intended to cover various modifications and substitutions without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of training a neural network to segment a picture by instance, the method comprising:

based on different settings of cameras, acquiring multi-frame gray level images of the same scene;

combining the multi-frame gray level images into a multi-channel image; and

the neural network is trained for instance segmentation using the multi-channel image.

2. The method of claim 1, further comprising:

and acquiring depth information of the scene by using the camera.

3. The method of claim 2, wherein combining the multi-frame gray scale map into a multi-channel image comprises:

and combining the multi-frame gray scale map and the depth information into a multi-channel image.

4. The method of claim 3, wherein the neural network is a Mask area convolutional neural network Mask R-CNN, and the multi-channel image is a four-channel image composed of three frame gray-scale images corresponding to three different settings of a camera, respectively, and the depth information.

5. The method of claim 1, wherein the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein reflectivities f1< f2, t1> t2, and g1> g2.

6. An apparatus for training a neural network to segment an instance of a picture, the apparatus comprising:

the first acquisition device is used for acquiring multi-frame gray level images of the same scene based on different settings of the cameras;

combining means for combining the multi-frame gray scale images into a multi-channel image; and

training means for training the neural network for instance segmentation using the multi-channel image.

7. The apparatus of claim 6, further comprising:

and the second acquisition device is used for acquiring the depth information of the scene by using the camera.

8. The apparatus of claim 7, wherein the combining means is configured to combine the multi-frame gray scale map and the depth information into a multi-channel image.

9. The apparatus of claim 8, wherein the neural network is a Mask area convolutional neural network Mask R-CNN, and the multi-channel image is a four-channel image composed of three frame gray-scale images corresponding to three different settings of a camera, respectively, and the depth information.

10. The apparatus of claim 6, wherein the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein reflectivities f1< f2, t1> t2, and g1> g2.

11. A method of instance segmentation of an image, the method comprising:

combining the multi-frame gray level images into a multi-channel image; and

instance segmentation is performed with a trained neural network for the multi-channel image.

12. The method of claim 11, further comprising:

and acquiring depth information of the scene by using the camera.

13. The method of claim 12, wherein combining the multi-frame gray scale map into a multi-channel image comprises:

14. The method of claim 13, wherein the trained neural network is a Mask region convolutional neural network Mask R-CNN and the multi-channel image is a four-channel image consisting of three frame gray scale maps corresponding to three different settings of a camera, respectively, and the depth information.

15. The method of claim 11, wherein the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein reflectivities f1< f2, t1> t2, and g1> g2.

16. A robotic grasping method, the method comprising:

performing the method of any one of claims 11 to 15 on a scene image containing an object to be grabbed using a three-dimensional camera mounted on the robot so as to obtain a point cloud segmentation result of the object to be grabbed; and

and executing a grabbing task based on the point cloud segmentation result.

17. An apparatus for instance segmentation of an image, the apparatus comprising:

segmentation means for performing instance segmentation using a trained neural network for the multi-channel image.

18. The apparatus of claim 17, further comprising:

19. The apparatus of claim 18, wherein the combining means is configured to combine the multi-frame gray scale map from the first acquiring means and the depth information from the second acquiring means into a multi-channel image.

20. The apparatus of claim 19, wherein the division means performs instance division using a Mask region convolutional neural network Mask R-CNN, and the multi-channel image is a four-channel image composed of three frames of gray-scale images corresponding to three different settings of cameras, respectively, and the depth information.

21. The apparatus of claim 17, wherein the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein reflectivities f1< f2, t1> t2, and g1> g2.

22. A robotic grasping system, the system comprising:

the apparatus of any one of claims 17 to 21, configured to perform instance segmentation of a scene image containing an object to be grabbed using a three-dimensional camera mounted on the robot so as to obtain a point cloud segmentation result of the object to be grabbed; and

and the grabbing device is used for executing grabbing tasks based on the point cloud segmentation result.

23. A computer storage medium comprising instructions which, when executed, perform the method of any one of claims 1 to 5, 11 to 16.

24. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 5, 11 to 16.