CN116091571A - Method and equipment for carrying out instance segmentation on picture - Google Patents

Method and equipment for carrying out instance segmentation on picture Download PDF

Info

Publication number
CN116091571A
CN116091571A CN202111305773.7A CN202111305773A CN116091571A CN 116091571 A CN116091571 A CN 116091571A CN 202111305773 A CN202111305773 A CN 202111305773A CN 116091571 A CN116091571 A CN 116091571A
Authority
CN
China
Prior art keywords
channel image
neural network
camera
frame gray
depth information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111305773.7A
Other languages
Chinese (zh)
Inventor
丁凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Robert Bosch GmbH
Original Assignee
Robert Bosch GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Robert Bosch GmbH filed Critical Robert Bosch GmbH
Priority to CN202111305773.7A priority Critical patent/CN116091571A/en
Priority to PCT/EP2022/079193 priority patent/WO2023078686A1/en
Publication of CN116091571A publication Critical patent/CN116091571A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/12Details of acquisition arrangements; Constructional details thereof
    • G06V10/14Optical characteristics of the device performing the acquisition or on the illumination arrangements
    • G06V10/141Control of illumination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/16Image acquisition using multiple overlapping images; Image stitching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a method for training a neural network to conduct instance segmentation on pictures, which comprises the following steps: based on different settings of cameras, acquiring multi-frame gray level images of the same scene; combining the multi-frame gray level images into a multi-channel image; and training the neural network for instance segmentation using the multi-channel image. The invention also relates to a device for training the neural network to segment the picture, a method and a device for training the neural network to segment the picture, a robot grabbing method and a system, a computer storage medium and a computer program product.

Description

Method and equipment for carrying out instance segmentation on picture
Technical Field
The present invention relates to the field of image segmentation, and more particularly, to a method and apparatus for training a neural network for instance segmentation of a picture or for instance segmentation of a picture using a trained neural network, a robot gripping method and system, a computer storage medium, and a computer program product.
Background
In the robot gripping application scenario, objects to be gripped are randomly placed or stacked on a tray. In order for a robot to grasp an object, it is necessary to detect the pose of the object with a three-dimensional camera, for example. Three-dimensional cameras are of many kinds: time of flight (ToF) cameras, structured light cameras, stereo cameras, etc. To determine the pose (e.g., six-dimensional pose) of an object, an instance segmentation (instance segmentation) of a two-dimensional image (e.g., RGB image) of a scene containing the object is required.
However, some three-dimensional cameras, such as ToF cameras, provide only gray scale images. It will be appreciated that an RGB image has three channels, red, green and blue, while a greyscale image has only one channel. Instance segmentation using a deep learning method performs poorly on grayscale images with only a single channel. This in turn has an adverse effect on the determination of the pose of the subsequent object, robotic grasping, etc.
Disclosure of Invention
According to an aspect of the present invention, there is provided a method of training a neural network to perform instance segmentation on a picture, the method comprising: based on different settings of cameras, acquiring multi-frame gray level images of the same scene; combining the multi-frame gray level images into a multi-channel image; and training the neural network for instance segmentation using the multi-channel image.
Additionally or alternatively to the above, the method may further comprise: and acquiring depth information of the scene by using the camera.
Additionally or alternatively to the above, in the above method, combining the multi-frame gray scale map into a multi-channel image includes: and combining the multi-frame gray scale map and the depth information into a multi-channel image.
In addition or alternatively, in the above method, the neural network is a Mask region convolutional neural network Mask R-CNN, and the multi-channel image is a four-channel image, and the four-channel image is composed of three grayscale images corresponding to three different settings of the camera and the depth information.
Additionally or alternatively to the above, in the above method, the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein the reflectivities f1< f2, t1> t2, and g1> g2.
According to another aspect of the present invention, there is provided an apparatus for training a neural network to perform instance segmentation on a picture, the apparatus comprising: the first acquisition device is used for acquiring multi-frame gray level images of the same scene based on different settings of the cameras; combining means for combining the multi-frame gray scale images into a multi-channel image; and training means for training the neural network for instance segmentation using the multi-channel image.
Additionally or alternatively to the above, the apparatus further comprises: and the second acquisition device is used for acquiring the depth information of the scene by using the camera.
Additionally or alternatively to the above, in the above apparatus, the combining means is configured to combine the multi-frame gray scale map and the depth information into a multi-channel image.
In addition or alternatively to the above, in the above apparatus, the neural network is a Mask area convolutional neural network Mask R-CNN, and the multi-channel image is a four-channel image composed of three frame gray-scale images corresponding to three different settings of the camera, respectively, and the depth information.
Additionally or alternatively to the above, in the above device, the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein the reflectances f1< f2, t1> t2 and g1> g2.
According to yet another aspect of the present invention, there is provided a method of instance segmentation of an image, the method comprising: based on different settings of cameras, acquiring multi-frame gray level images of the same scene; combining the multi-frame gray level images into a multi-channel image; and utilizing a trained neural network for instance segmentation for the multi-channel image.
Additionally or alternatively to the above, the method may further comprise: and acquiring depth information of the scene by using the camera, for example, a ToF camera.
Additionally or alternatively to the above, in the above method, combining the multi-frame gray scale map into a multi-channel image includes: and combining the multi-frame gray scale map and the depth information into a multi-channel image.
Additionally or alternatively to the above, in the above method, the trained neural network is a Mask region convolutional neural network (Mask R-CNN), and the multi-channel image is a four-channel image composed of three frames of gray-scale images corresponding to three different settings of the camera, respectively, and the depth information.
Additionally or alternatively to the above, in the above method, the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein the reflectivities f1< f2, t1> t2, and g1> g2. That is, for example, when imaging a dark object of low reflectance, a longer exposure time and high gain should be used. While shorter exposure times and normal gain are suitable for highly reflective objects.
According to still another aspect of the present invention, there is provided a robot gripping method, the method including: performing the method for performing instance segmentation on the image of the scene containing the object to be grabbed by using a three-dimensional camera installed on the robot so as to acquire a point cloud segmentation result of the object to be grabbed; and performing a grabbing task based on the point cloud segmentation result.
According to still another aspect of the present invention, there is provided an apparatus for instance segmentation of an image, the apparatus comprising: the first acquisition device is used for acquiring multi-frame gray level images of the same scene based on different settings of the cameras; combining means for combining the multi-frame gray scale images into a multi-channel image; and segmentation means for instance segmentation using a trained neural network for the multi-channel image.
Additionally or alternatively to the above, the apparatus further comprises: and the second acquisition device is used for acquiring the depth information of the scene by using the camera.
Additionally or alternatively to the above, in the above apparatus, the combining means is configured to combine the multi-frame gray scale map from the first acquiring means and the depth information from the second acquiring means into a multi-channel image.
In addition or alternatively to the above, in the above apparatus, the dividing means may perform example division using a Mask region convolutional neural network (Mask R-CNN), and the multi-channel image may be a four-channel image composed of three frames of gray-scale images corresponding to three different settings of the camera, respectively, and the depth information.
Additionally or alternatively to the above, in the above device, the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein the reflectances f1< f2, t1> t2 and g1> g2. That is, for example, when imaging a dark object of low reflectance, a longer exposure time and high gain should be used. While shorter exposure times and normal gain are suitable for highly reflective objects.
According to yet another aspect of the present invention, there is provided a robotic grasping system including: an apparatus for performing instance segmentation on an image as described above, the apparatus being configured to perform instance segmentation on a scene image containing an object to be grabbed using a three-dimensional camera mounted on the robot so as to obtain a point cloud segmentation result of the object to be grabbed; and a grabbing device for executing grabbing tasks based on the point cloud segmentation results.
According to yet another aspect of the present invention there is provided a computer storage medium comprising instructions which, when executed, perform a method as previously described.
According to a further aspect of the invention there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.
According to the embodiment of the invention, the multiple-frame gray level images of the same scene are obtained based on different settings of the cameras, and are combined into the multi-channel image to further perform the instance segmentation, so that the dynamic range of the captured image is enlarged, and the accuracy of the instance segmentation is greatly improved. This may further ensure the success rate of the robot gripping the object based on this example segmentation scheme.
Drawings
The above and other objects and advantages of the present invention will become more fully apparent from the following detailed description taken in conjunction with the accompanying drawings, in which identical or similar elements are designated by the same reference numerals.
FIG. 1 shows a flow diagram of a method of training a neural network to segment a picture in examples, according to one embodiment of the invention;
FIG. 2 shows a schematic structural diagram of an apparatus for training a neural network to segment a picture by example, according to one embodiment of the invention;
FIG. 3 shows a flow diagram of a method for example segmentation of a picture using a trained neural network, according to one embodiment of the invention;
FIG. 4 shows a schematic structural diagram of an apparatus for example segmentation of a picture using a trained neural network, according to one embodiment of the invention;
FIG. 5 shows a flow diagram of a robotic grasping method according to one embodiment of the invention; and
fig. 6 shows a schematic structural view of a robotic grasping system according to an embodiment of the invention.
Detailed Description
Hereinafter, a scheme of example division and a robot gripping scheme according to various exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 shows a flow diagram of a method 1000 of training a neural network to segment an instance of a picture, according to one embodiment of the invention. As shown in fig. 1, a method 1000 for training a neural network to segment a picture by instance includes the steps of:
in step S110, based on different settings of the cameras, a multi-frame gray scale map of the same scene is obtained;
in step S120, the multi-frame gray scale images are combined into a multi-channel image; and
in step S130, the neural network is trained for instance segmentation using the multi-channel image.
The term "instance segmentation" (instance segmentation), also called image instance segmentation, is to further refine and separate the foreground and the background of an object based on object detection, and to achieve object separation at the pixel level. Image instance segmentation is further enhanced based on object detection. Image instance segmentation has application in scenes such as object detection, face detection, expression recognition, medical image processing and disease auxiliary diagnosis, video monitoring and object tracking, and shelf vacancy recognition of retail scenes. In one embodiment, "instance segmentation" is the automatic framing of different instances from an image with a target detection method, followed by pixel-by-pixel labeling within the different instance areas with a semantic segmentation method. Unlike "instance segmentation," semantic segmentation does not distinguish between different instances belonging to the same class. For example, when there are multiple cats in the image, the semantic segmentation predicts all pixels of the two cats as a whole as a "cat" class, while the instance segmentation needs to distinguish which pixels belong to the first cat and which pixels belong to the second cat.
In one embodiment, although not shown in FIG. 1, the method 1000 described above may further include: and acquiring depth information of the scene by using the camera. In one or more embodiments, the camera is a ToF (time of flight) camera.
ToF is a shorthand for Time of flight, and is interpreted as Time of flight. The time-of-flight 3D imaging is to continuously transmit light pulses to a target, then receive light returned from the object by a sensor, and obtain the distance of the target by detecting the flight (round trip) time of the light pulses. The technology is basically similar to the principle of a 3D laser sensor, but only the 3D laser sensor scans point by point, and the ToF camera obtains depth information of the whole image at the same time. The ToF camera is similar to the common machine vision imaging process, and consists of a light source, an optical component, a sensor, a control circuit, a processing circuit and other units. Compared to the very similar binocular vision system, which is a non-invasive three-dimensional detection, the ToF camera has a radically different 3D imaging mechanism. Binocular stereo measurement is performed by performing stereo detection by a triangulation method after matching left and right stereo pairs, and a ToF camera is obtained by obtaining a target distance by incident and reflected light detection.
The ToF technology adopts an active light detection mode, unlike the general illumination requirement, the objective of the ToF illumination unit is not illumination, but distance measurement is performed by utilizing the change of an incident light signal and a reflected light signal, so that the ToF illumination unit performs high-frequency modulation on light and then emits the light, for example, pulse light emitted by an LED or a laser diode is adopted, and the pulse can reach 100MHz.
Compared with a stereo camera or a triangulation system, the ToF camera is small in size and is very suitable for occasions needing a portable and small-size camera. The ToF camera can also calculate depth information quickly in real time, up to tens to 100fps. In addition, since the depth calculation of the ToF is not affected by the object surface gradation and features, three-dimensional detection can be performed very accurately. Moreover, the depth calculation accuracy of the ToF does not change with distance, and can be basically stabilized on the order of cm, which is significant for some applications with a large range of motion.
Despite the advantages described above, toF cameras provide only gray scale images with a single channel, which is detrimental to training a neural network as a training dataset for instance segmentation. Therefore, in step S110, based on different settings of the cameras, a multi-frame gray scale map of the same scene is acquired. In one or more embodiments, the different settings of the camera include different exposure times and amplitude gain (amplitude gain). For example, when imaging dark objects of low reflectivity, longer exposure times and high gains should be used. While shorter exposure times and normal gain are suitable for highly reflective objects. In robotic gripping applications, there may be many different objects to be gripped, including dark objects or objects of high and low reflectivity. Thus, a combination of different exposure times and amplitude gains are employed to acquire multiple frame images (e.g., 3 frame images) of the same scene, which may facilitate dynamic range improvement (e.g., as compared to a single acquired image frame).
In one embodiment, step S120 includes: and combining the multi-frame gray scale map and the depth information into a multi-channel image. For example, when the neural network to be trained is a Mask region convolutional neural network (Mask R-CNN), the multi-channel image may be a four-channel image, wherein the four-channel image is composed of three frame gray-scale images corresponding to three different settings of the camera, respectively, and the depth information. That is, three frames of gray maps correspond to three channels, respectively, and depth information corresponds to one channel. In one embodiment, the gray values of the multi-frame gray map and depth information may be constructed and stored in a four-channel image format for use in training a training dataset for a subsequent training neural network.
Of course, it will be appreciated by those skilled in the art that the multi-channel image is not limited to a four-channel image. In one embodiment, the multi-channel image may be a three-channel image (e.g., two-frame gray scale map + depth information, etc.).
In step S130, the neural network is trained for instance segmentation using the multi-channel image. In one embodiment, the neural network to be trained is a Mask region convolutional neural network (Mask R-CNN).
Specifically, mask R-CNN is a two-stage framework, the first stage scanning the image and generating suggestions (i.e., regions that may contain an object), and the second stage classifying the suggestions and generating bounding boxes and masks. Mask R-CNN extends from Faster R-CNN, which is a popular target detection framework that Mask R-CNN extends into an instance segmentation framework.
Turning to fig. 2, fig. 2 shows an apparatus 2000 for training a neural network to segment a picture by instance, the apparatus 2000 comprising: a first acquisition means 210, a combining means 220 and a training means 230. The first obtaining device 210 is configured to obtain a multi-frame gray scale map of the same scene based on different settings of the camera; combining means 220 for combining the multi-frame gray scale images into a multi-channel image; and training means 230 for training the neural network for instance segmentation using the multi-channel image.
The term "instance segmentation" (instance segmentation), also called image instance segmentation, is to further refine and separate the foreground and the background of an object based on object detection, and to achieve object separation at the pixel level. Image instance segmentation is further enhanced based on object detection. Image instance segmentation has application in scenes such as object detection, face detection, expression recognition, medical image processing and disease auxiliary diagnosis, video monitoring and object tracking, and shelf vacancy recognition of retail scenes. In one embodiment, "instance segmentation" is the automatic framing of different instances from an image with a target detection method, followed by pixel-by-pixel labeling within the different instance areas with a semantic segmentation method. Unlike "instance segmentation," semantic segmentation does not distinguish between different instances belonging to the same class. For example, when there are multiple cats in the image, the semantic segmentation predicts all pixels of the two cats as a whole as a "cat" class, while the instance segmentation needs to distinguish which pixels belong to the first cat and which pixels belong to the second cat.
In one embodiment, although not shown in FIG. 2, the apparatus 2000 may further include: and the second acquisition device is used for acquiring the depth information of the scene by using the camera. Based on this depth information, the training data set may be better constructed for training of the neural network.
In one or more embodiments, the camera is a ToF (time of flight) camera. ToF is a shorthand for Time of flight, and is interpreted as Time of flight. The time-of-flight 3D imaging is to continuously transmit light pulses to a target, then receive light returned from the object by a sensor, and obtain the distance of the target by detecting the flight (round trip) time of the light pulses. The technology is basically similar to the principle of a 3D laser sensor, but only the 3D laser sensor scans point by point, and the ToF camera obtains depth information of the whole image at the same time. The ToF camera is similar to the common machine vision imaging process, and consists of a light source, an optical component, a sensor, a control circuit, a processing circuit and other units. Compared to the very similar binocular vision system, which is a non-invasive three-dimensional detection, the ToF camera has a radically different 3D imaging mechanism. Binocular stereo measurement is performed by performing stereo detection by a triangulation method after matching left and right stereo pairs, and a ToF camera is obtained by obtaining a target distance by incident and reflected light detection.
The ToF technology adopts an active light detection mode, unlike the general illumination requirement, the objective of the ToF illumination unit is not illumination, but distance measurement is performed by utilizing the change of an incident light signal and a reflected light signal, so that the ToF illumination unit performs high-frequency modulation on light and then emits the light, for example, pulse light emitted by an LED or a laser diode is adopted, and the pulse can reach 100MHz.
Compared with a stereo camera or a triangulation system, the ToF camera is small in size and is very suitable for occasions needing a portable and small-size camera. The ToF camera can also calculate depth information quickly in real time, up to tens to 100fps. In addition, since the depth calculation of the ToF is not affected by the object surface gradation and features, three-dimensional detection can be performed very accurately. Moreover, the depth calculation accuracy of the ToF does not change with distance, and can be basically stabilized on the order of cm, which is significant for some applications with a large range of motion.
Despite the advantages described above, toF cameras only provide gray scale images with a single channel, which is detrimental to neural network learning. Thus, the first acquiring device 210 acquires the multi-frame gray-scale map of the same scene based on different settings of the camera. In one or more embodiments, the different settings of the camera include different exposure times and amplitude gain (amplitude gain). For example, when imaging dark objects of low reflectivity, longer exposure times and high gains should be used. While shorter exposure times and normal gain are suitable for highly reflective objects. In robotic gripping applications, there may be many different objects to be gripped, including dark objects or objects of high and low reflectivity. Thus, a combination of different exposure times and amplitude gains are employed to acquire multi-frame images (e.g., 3-frame images) of the same scene, which may be advantageous for improving dynamic range (dynamic range) over a single acquired image frame.
In one embodiment, the combining means 220 is configured to combine the multi-frame gray scale map and the depth information into a multi-channel image. In one embodiment, the multi-channel image is a four-channel image, wherein the four-channel image is composed of three frames of gray scale images corresponding to three different settings of the camera, respectively, and the depth information. That is, three frames of gray maps correspond to three channels, respectively, and depth information corresponds to one channel. The combining means 220 may be configured to construct and save the gray values of the multi-frame gray map and the depth information in a four-channel image format for subsequent deep learning (instance segmentation).
Of course, it will be appreciated by those skilled in the art that the multi-channel image is not limited to a four-channel image. In one embodiment, the multi-channel image may be a three-channel image (e.g., two-frame gray scale map + depth information, etc.).
Training means 230 is used to train the neural network for instance segmentation using the multi-channel image. In one embodiment, the training device 230 may be configured to train a Mask region convolutional neural network (Mask R-CNN) using the multi-channel image for instance segmentation. Specifically, mask R-CNN is a two-stage framework, the first stage scanning the image and generating suggestions (i.e., regions that may contain an object), and the second stage classifying the suggestions and generating bounding boxes and masks. Mask R-CNN extends from Faster R-CNN, which is a popular target detection framework that Mask R-CNN extends into an instance segmentation framework.
Referring to fig. 3, a flow diagram of a method 3000 for instance segmentation of a picture using a trained neural network is shown, according to one embodiment of the invention. As shown in fig. 3, the method 3000 includes the steps of:
in step S310, based on different settings of the cameras, a multi-frame gray scale map of the same scene is obtained;
in step S320, the multi-frame gray scale images are combined into a multi-channel image; and
in step S330, an instance segmentation is performed with a trained neural network for the multi-channel image.
In one embodiment, although not shown in fig. 3, the method 3000 may further include: and acquiring depth information of the scene by using the camera. In one or more embodiments, the camera is a ToF (time of flight) camera. ToF is a shorthand for Time of flight, and is interpreted as Time of flight. The time-of-flight 3D imaging is to continuously transmit light pulses to a target, then receive light returned from the object by a sensor, and obtain the distance of the target by detecting the flight (round trip) time of the light pulses. Compared with a stereo camera or a triangulation system, the ToF camera is small in size and is very suitable for occasions needing a portable and small-size camera. The ToF camera can also calculate depth information quickly in real time, up to tens to 100fps. In addition, since the depth calculation of the ToF is not affected by the object surface gradation and features, three-dimensional detection can be performed very accurately. Moreover, the depth calculation accuracy of the ToF does not change with distance, and can be basically stabilized on the order of cm, which is significant for some applications with a large range of motion.
Despite the advantages described above, toF cameras provide only gray scale images with a single channel, which is detrimental to instance segmentation using neural networks. Therefore, in step S310, based on different settings of the cameras, a multi-frame gray scale map of the same scene is acquired. In one or more embodiments, the different settings of the camera include different exposure times and amplitude gain (amplitude gain). For example, when imaging dark objects of low reflectivity, longer exposure times and high gains should be used. While shorter exposure times and normal gain are suitable for highly reflective objects. In robotic gripping applications, there may be many different objects to be gripped, including dark objects or objects of high and low reflectivity. Thus, a combination of different exposure times and amplitude gains are employed to acquire multiple frame images (e.g., 3 frame images) of the same scene, which may facilitate dynamic range improvement (e.g., as compared to a single acquired image frame).
In one embodiment, step S320 includes: and combining the multi-frame gray scale map and the depth information into a multi-channel image. For example, when the trained neural network is a Mask region convolutional neural network (Mask R-CNN), the multi-channel image may be a four-channel image, wherein the four-channel image is composed of three frames of gray-scale images corresponding to three different settings of a camera, respectively, and the depth information. That is, three frames of gray maps correspond to three channels, respectively, and depth information corresponds to one channel. In one embodiment, the gray values and depth information of a multi-frame gray scale map may be constructed and stored in a four-channel image format to facilitate subsequent instance segmentation using a trained neural network for the multi-channel image.
Of course, it will be appreciated by those skilled in the art that the multi-channel image is not limited to a four-channel image. In one embodiment, the multi-channel image may be a three-channel image (e.g., two-frame gray scale map + depth information, etc.).
In step S330, an instance segmentation is performed with a trained neural network for the multi-channel image. In one embodiment, the trained neural network is a Mask region convolutional neural network (Mask R-CNN). Specifically, mask R-CNN is a two-stage framework, the first stage scanning the image and generating suggestions (i.e., regions that may contain an object), and the second stage classifying the suggestions and generating bounding boxes and masks. Mask R-CNN extends from Faster R-CNN, which is a popular target detection framework that Mask R-CNN extends into an instance segmentation framework.
In addition, the method for training the neural network may be performed according to fig. 1 and the corresponding description, and will not be described herein.
Turning to fig. 4, an apparatus 4000 for instance segmentation of an image is shown in accordance with one embodiment of the present invention. The apparatus 4000 comprises: first acquisition means 410, combining means 420 and dividing means 430. The first obtaining device 410 is configured to obtain a multi-frame gray scale map of the same scene based on different settings of the camera; combining means 420 for combining the multi-frame gray scale images into a multi-channel image; and segmentation means 430 for instance segmentation with trained neural networks for the multi-channel image.
In one embodiment, although not shown in fig. 4, the apparatus 4000 may further include: and the second acquisition device is used for acquiring the depth information of the scene by using the camera. The trained neural network can be better utilized for instance segmentation based on the depth information. In one or more embodiments, the camera is a ToF (time of flight) camera.
In one embodiment, the combining means 420 is configured to combine the multi-frame gray scale map and the depth information into a multi-channel image. In one embodiment, the multi-channel image is a four-channel image, wherein the four-channel image is composed of three frames of gray scale images corresponding to three different settings of the camera, respectively, and the depth information. That is, three frames of gray maps correspond to three channels, respectively, and depth information corresponds to one channel. The combining means 420 may be configured to construct and save the gray values of the multi-frame gray map and the depth information in a four-channel image format for subsequent deep learning (instance segmentation).
Of course, it will be appreciated by those skilled in the art that the multi-channel image is not limited to a four-channel image. In one embodiment, the multi-channel image may be a three-channel image (e.g., two-frame gray scale map + depth information, etc.).
Segmentation means 430 is used to perform instance segmentation using a trained neural network for the multi-channel image. In one embodiment, segmentation apparatus 430 may be configured to perform instance segmentation using a trained Mask region convolutional neural network (Mask R-CNN). That is, in this embodiment, the trained neural network is a Mask region convolutional neural network (Mask R-CNN). Mask R-CNN is a two-stage framework, the first stage scanning the image and generating suggestions (i.e., regions that may contain an object), the second stage classifying the suggestions and generating bounding boxes and masks. Mask R-CNN extends from Faster R-CNN, which is a popular target detection framework that Mask R-CNN extends into an instance segmentation framework.
Referring to fig. 5, a flow diagram of a robotic grasping method 5000 according to one embodiment of the invention is shown. As shown in fig. 5, the robot gripping method 5000 includes the steps of:
in step S510, performing instance segmentation on a scene image containing an object to be grabbed by using a three-dimensional camera mounted on the robot so as to obtain a point cloud segmentation result of the object to be grabbed; and
in step S520, a grab task is performed based on the point cloud segmentation result.
In one or more embodiments, the method for performing instance segmentation on the scene image in step S510 may employ the method 3000 for instance segmentation as described above in connection with fig. 3, which is not described herein. In one embodiment, a segmentation mask (segmentation mask) of the object to be grabbed may be obtained by instance segmentation of the scene image captured by the three-dimensional camera. Then, by mapping between the image and the depth map, a point cloud segmentation result of the object can be obtained.
The term "point cloud" refers to a collection of points obtained after the spatial coordinates of each sample point of the object surface are acquired. In one embodiment, the "point cloud data" may include two-dimensional coordinates (XY) or three-dimensional coordinates (XYZ), laser reflection Intensity (Intensity), color information (RGB), and the like.
In step S520, a grab task is performed based on the point cloud segmentation result. For example, the control scheme of the mechanical arm of the robot is obtained by further processing the point cloud segmentation result. The robotic arm is then controlled to perform a gripping task based on the control scheme.
Fig. 6 provides a robotic grasping system 6000. As shown in fig. 6, the robot gripping system 6000 includes: an instance segmentation device 610 and a grabbing device 620, wherein the instance segmentation device 610 is configured to perform instance segmentation on a scene image containing an object to be grabbed by using a three-dimensional camera installed on the robot so as to obtain a point cloud segmentation result of the object to be grabbed; the grabbing device 620 is configured to perform a grabbing task based on the point cloud segmentation result.
In one embodiment, the instance segmentation device 610 may obtain a segmentation mask (segmentation mask) of the object to be grabbed by instance segmentation of the scene image captured by the three-dimensional camera. The example segmented device 610 may then obtain a point cloud segmentation result of the object by mapping between the image and the depth map. In the context of the present invention, the term "point cloud" refers to a collection of points obtained after acquiring the spatial coordinates of each sample point of the object surface. In one embodiment, the "point cloud data" may include two-dimensional coordinates (XY) or three-dimensional coordinates (XYZ), laser reflection Intensity (Intensity), color information (RGB), and the like.
The grabbing device 620 is configured to perform grabbing tasks based on the point cloud segmentation results. In one embodiment, the gripping device 620 is configured to obtain a control scheme of the robotic arm of the robot by further processing the point cloud segmentation result, and then control the robotic arm to perform the gripping task based on the control scheme.
In addition, it is easily understood by those skilled in the art that the method 1000 for training the neural network to perform instance segmentation on the picture, the method 3000 for performing instance segmentation on the picture by using the trained neural network, or the robot gripping method 5000 provided in one or more embodiments of the present invention may be implemented by a computer program. For example, the computer program is embodied in a computer program product that when executed by a processor implements a method 1000 of training a neural network to instance segment a picture, a method 3000 of instance segment a picture with a trained neural network, or a robotic grasping method 5000 of one or more embodiments of the invention. For another example, when a computer storage medium (e.g., a usb disk) storing the computer program is connected to a computer, the computer program can be executed to perform the method 1000 of training a neural network to segment a picture, the method 3000 of using a trained neural network to segment a picture, or the robot capture method 5000 of one or more embodiments of the present invention.
In summary, the method or the device for instance segmentation according to the embodiments of the present invention obtains multiple frames of gray maps of the same scene based on different settings (e.g., exposure time and amplitude gain) of the camera, and combines the multiple frames of gray maps into a multi-channel image to perform instance segmentation, thereby expanding the dynamic range of the captured image and greatly improving the accuracy of instance segmentation. This may further ensure the success rate of gripping objects in a robotic gripping solution based on the example segmentation method or apparatus.
For example, employing different settings (combination of exposure time and amplitude gain) to acquire multiple frames of images (e.g., 3 frames of images) of the same scene may facilitate dynamic range improvement (e.g., as compared to a single acquired image frame). When imaging dark objects of low reflectivity, longer exposure times and high gains should be used. While shorter exposure times and normal gain are suitable for highly reflective objects.
While the above description describes only some of the embodiments of the present invention, those of ordinary skill in the art will appreciate that the present invention can be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and the invention is intended to cover various modifications and substitutions without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (24)

1. A method of training a neural network to segment a picture by instance, the method comprising:
based on different settings of cameras, acquiring multi-frame gray level images of the same scene;
combining the multi-frame gray level images into a multi-channel image; and
the neural network is trained for instance segmentation using the multi-channel image.
2. The method of claim 1, further comprising:
and acquiring depth information of the scene by using the camera.
3. The method of claim 2, wherein combining the multi-frame gray scale map into a multi-channel image comprises:
and combining the multi-frame gray scale map and the depth information into a multi-channel image.
4. The method of claim 3, wherein the neural network is a Mask area convolutional neural network Mask R-CNN, and the multi-channel image is a four-channel image composed of three frame gray-scale images corresponding to three different settings of a camera, respectively, and the depth information.
5. The method of claim 1, wherein the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein reflectivities f1< f2, t1> t2, and g1> g2.
6. An apparatus for training a neural network to segment an instance of a picture, the apparatus comprising:
the first acquisition device is used for acquiring multi-frame gray level images of the same scene based on different settings of the cameras;
combining means for combining the multi-frame gray scale images into a multi-channel image; and
training means for training the neural network for instance segmentation using the multi-channel image.
7. The apparatus of claim 6, further comprising:
and the second acquisition device is used for acquiring the depth information of the scene by using the camera.
8. The apparatus of claim 7, wherein the combining means is configured to combine the multi-frame gray scale map and the depth information into a multi-channel image.
9. The apparatus of claim 8, wherein the neural network is a Mask area convolutional neural network Mask R-CNN, and the multi-channel image is a four-channel image composed of three frame gray-scale images corresponding to three different settings of a camera, respectively, and the depth information.
10. The apparatus of claim 6, wherein the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein reflectivities f1< f2, t1> t2, and g1> g2.
11. A method of instance segmentation of an image, the method comprising:
based on different settings of cameras, acquiring multi-frame gray level images of the same scene;
combining the multi-frame gray level images into a multi-channel image; and
instance segmentation is performed with a trained neural network for the multi-channel image.
12. The method of claim 11, further comprising:
and acquiring depth information of the scene by using the camera.
13. The method of claim 12, wherein combining the multi-frame gray scale map into a multi-channel image comprises:
and combining the multi-frame gray scale map and the depth information into a multi-channel image.
14. The method of claim 13, wherein the trained neural network is a Mask region convolutional neural network Mask R-CNN and the multi-channel image is a four-channel image consisting of three frame gray scale maps corresponding to three different settings of a camera, respectively, and the depth information.
15. The method of claim 11, wherein the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein reflectivities f1< f2, t1> t2, and g1> g2.
16. A robotic grasping method, the method comprising:
performing the method of any one of claims 11 to 15 on a scene image containing an object to be grabbed using a three-dimensional camera mounted on the robot so as to obtain a point cloud segmentation result of the object to be grabbed; and
and executing a grabbing task based on the point cloud segmentation result.
17. An apparatus for instance segmentation of an image, the apparatus comprising:
the first acquisition device is used for acquiring multi-frame gray level images of the same scene based on different settings of the cameras;
combining means for combining the multi-frame gray scale images into a multi-channel image; and
segmentation means for performing instance segmentation using a trained neural network for the multi-channel image.
18. The apparatus of claim 17, further comprising:
and the second acquisition device is used for acquiring the depth information of the scene by using the camera.
19. The apparatus of claim 18, wherein the combining means is configured to combine the multi-frame gray scale map from the first acquiring means and the depth information from the second acquiring means into a multi-channel image.
20. The apparatus of claim 19, wherein the division means performs instance division using a Mask region convolutional neural network Mask R-CNN, and the multi-channel image is a four-channel image composed of three frames of gray-scale images corresponding to three different settings of cameras, respectively, and the depth information.
21. The apparatus of claim 17, wherein the different settings of the camera include different exposure times and amplitude gains, wherein when imaging a dark object of reflectivity f1, an exposure time of t1 and an amplitude gain of g1 are used, and when imaging an object of reflectivity f2, an exposure time of t2 is used, and an amplitude gain of g2, wherein reflectivities f1< f2, t1> t2, and g1> g2.
22. A robotic grasping system, the system comprising:
the apparatus of any one of claims 17 to 21, configured to perform instance segmentation of a scene image containing an object to be grabbed using a three-dimensional camera mounted on the robot so as to obtain a point cloud segmentation result of the object to be grabbed; and
and the grabbing device is used for executing grabbing tasks based on the point cloud segmentation result.
23. A computer storage medium comprising instructions which, when executed, perform the method of any one of claims 1 to 5, 11 to 16.
24. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 5, 11 to 16.
CN202111305773.7A 2021-11-05 2021-11-05 Method and equipment for carrying out instance segmentation on picture Pending CN116091571A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111305773.7A CN116091571A (en) 2021-11-05 2021-11-05 Method and equipment for carrying out instance segmentation on picture
PCT/EP2022/079193 WO2023078686A1 (en) 2021-11-05 2022-10-20 Method and device for performing instance segmentation of picture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111305773.7A CN116091571A (en) 2021-11-05 2021-11-05 Method and equipment for carrying out instance segmentation on picture

Publications (1)

Publication Number Publication Date
CN116091571A true CN116091571A (en) 2023-05-09

Family

ID=84358280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111305773.7A Pending CN116091571A (en) 2021-11-05 2021-11-05 Method and equipment for carrying out instance segmentation on picture

Country Status (2)

Country Link
CN (1) CN116091571A (en)
WO (1) WO2023078686A1 (en)

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506305B (en) * 2021-06-09 2023-10-24 西交利物浦大学 Image enhancement method, semantic segmentation method and device for three-dimensional point cloud data

Also Published As

Publication number Publication date
WO2023078686A1 (en) 2023-05-11

Similar Documents

Publication Publication Date Title
CN108370438B (en) Range gated depth camera assembly
US20200334843A1 (en) Information processing apparatus, control method for same, non-transitory computer-readable storage medium, and vehicle driving support system
JP6858650B2 (en) Image registration method and system
JP7151805B2 (en) LEARNING DATA GENERATION DEVICE, LEARNING DATA GENERATION METHOD, AND PROGRAM
US9772405B2 (en) Backfilling clouds of 3D coordinates
EP4145338A1 (en) Target detection method and apparatus
US20180089501A1 (en) Computer implemented method of detecting the distance of an object from an image sensor
EP3709266A1 (en) Human-tracking methods, apparatuses, systems, and storage media
US10936900B2 (en) Color identification using infrared imaging
US10679369B2 (en) System and method for object recognition using depth mapping
US11143879B2 (en) Semi-dense depth estimation from a dynamic vision sensor (DVS) stereo pair and a pulsed speckle pattern projector
CN112189147A (en) Reduced power operation of time-of-flight cameras
CN108475429B (en) System and method for segmentation of three-dimensional microscope images
GB2562037A (en) Three-dimensional scene reconstruction
WO2021114776A1 (en) Object detection method, object detection device, terminal device, and medium
KR101696086B1 (en) Method and apparatus for extracting object region from sonar image
CN116091571A (en) Method and equipment for carrying out instance segmentation on picture
JP2010237976A (en) Light source information obtaining device, shading detection device, shading removal device, and those methods and programs
Pang et al. Generation of high speed CMOS multiplier-accumulators
US20230141945A1 (en) Quantifying biotic damage on plants, by separating plant-images and subsequently operating a convolutional neural network
CN108527366B (en) Robot following method and device based on depth of field distance
JP2021149691A (en) Image processing system and control program
CA3148404A1 (en) Information processing device, data generation method, and non-transitory computer-readable medium storing program
WO2022213288A1 (en) Depth image processing method and apparatus, and storage medium
US20230237735A1 (en) Method For Generating Point Cloud Data And Data Generating Apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication