WO2023124734A1 - Object grabbing point estimation method, apparatus and system, model training method, apparatus and system, and data generation method, apparatus and system - Google Patents

Object grabbing point estimation method, apparatus and system, model training method, apparatus and system, and data generation method, apparatus and system Download PDF

Info

Publication number
WO2023124734A1
WO2023124734A1 PCT/CN2022/135705 CN2022135705W WO2023124734A1 WO 2023124734 A1 WO2023124734 A1 WO 2023124734A1 CN 2022135705 W CN2022135705 W CN 2022135705W WO 2023124734 A1 WO2023124734 A1 WO 2023124734A1
Authority
WO
WIPO (PCT)
Prior art keywords
point
grasping
quality
image
model
Prior art date
Application number
PCT/CN2022/135705
Other languages
French (fr)
Chinese (zh)
Inventor
周韬
Original Assignee
广东美的白色家电技术创新中心有限公司
美的集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广东美的白色家电技术创新中心有限公司, 美的集团股份有限公司 filed Critical 广东美的白色家电技术创新中心有限公司
Publication of WO2023124734A1 publication Critical patent/WO2023124734A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Definitions

  • the present disclosure relates to but is not limited to artificial intelligence technology, and specifically relates to a method, device and system for object grasping point estimation, model training and data generation.
  • the challenge encountered by the robot vision system is to guide the robot to grab thousands of different stock keeping units (SKU for short).
  • SKU stock keeping units
  • These objects are usually unknown to the system, or due to the variety, it is too costly to maintain physical models or texture templates for all SKUs.
  • the simplest example is in the depalletizing application, although the objects to be grabbed are all rectangular objects (boxes or boxes), the texture and size of the objects will change according to different scenes. Therefore, the classic object localization or recognition scheme based on template matching is difficult to apply in such scenarios.
  • many objects have irregular shapes. The most common objects are box-like objects and bottle-like objects. Sorting one by one from the stacked state, performing subsequent scanning or identification operations and sending them into the appropriate target material box.
  • the robot vision system estimate the most suitable grasping point of the robot (it can be a suction point but not limited to this) based on the scene captured by the camera without prior knowledge of the object, and guide the robot to perform object grasping Taking actions is still a problem that needs to be solved.
  • An embodiment of the present disclosure provides a method for generating training data of an object grasping point estimation model, including:
  • a target grasping quality of pixels in the sample image is generated according to the grasping quality of the sampling points of the first object.
  • An embodiment of the present disclosure also provides a device for generating training data of an object grasping point estimation model, including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the implementation of the present disclosure is realized.
  • a device for generating training data for an object grasping point estimation model including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the implementation of the present disclosure is realized.
  • the method and device of the above-mentioned embodiments of the present disclosure realize automatic labeling of sample images, can generate training data efficiently and with high quality, and avoid problems such as heavy workload and unstable labeling quality caused by manual labeling.
  • An embodiment of the present disclosure provides a method for training an estimation model of object grasping points, including:
  • the training data includes a sample image and the target grasping quality of pixels in the sample image
  • the estimation model includes a backbone network using a semantic segmentation network architecture and a multi-branch network, and the multi-branch network adopts a multi-task learning network architecture.
  • An embodiment of the present disclosure also provides a training device for an estimation model of an object grasping point, including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, any one of the methods described in the present disclosure can be realized.
  • the method for training the estimation model of the object grasping point described in the embodiment is not limited to the embodiment.
  • the method and device of the above-mentioned embodiments of the present disclosure learn the capture quality of pixels in a 2D image through training, which has better accuracy and stability than the direct and optimal way of capturing points.
  • An embodiment of the present disclosure provides a method for estimating a grasping point of an object, including:
  • An embodiment of the present disclosure also provides a device for estimating the grasping point of an object, including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the method described in any embodiment of the present disclosure is implemented.
  • a device for estimating the grasping point of an object including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the method described in any embodiment of the present disclosure is implemented. The method for estimating object grasping points described above.
  • An embodiment of the present disclosure also provides a robot vision system, including:
  • a camera configured to shoot a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;
  • the control device includes the object grasping point estimation device according to the embodiment of the present disclosure, the control device is configured to determine the position of the grasping point of the object to be grasped according to the scene image captured by the camera ; and, controlling the grabbing action performed by the robot according to the position of the grabbing point;
  • a robot configured to perform said grasping action.
  • the estimation method, device and robot vision system of the above-mentioned embodiments of the present disclosure can improve the accuracy of object grasping point estimation, thereby improving the success rate of grasping.
  • An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program described in any embodiment of the present disclosure can be implemented.
  • the method for generating the training data of the object grasping point estimation model described above, or realize the training method of the object grasping point estimation model as described in any embodiment of the present disclosure, or realize the method as described in any embodiment of the present disclosure Method for estimating object grasp points.
  • FIG. 1 is a flowchart of a method for generating training data for an object grasping point estimation model according to an embodiment of the present disclosure
  • Fig. 2 is a flow chart of generating labeled data according to the grasping quality of sampling points in Fig. 1;
  • FIG. 3 is a schematic diagram of a device for generating training data according to an embodiment of the present disclosure
  • Fig. 4 is a flowchart of a training method for an estimation model of an object grasping point according to an embodiment of the present disclosure
  • Fig. 5 is a network structure diagram of an estimation model according to an embodiment of the present disclosure.
  • FIG. 6 is a flowchart of a method for estimating an object grasping point according to an embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram of a robot vision system according to an embodiment of the present disclosure.
  • words such as “exemplary” or “for example” are used to mean an example, illustration or illustration. Any embodiment described in this disclosure as “exemplary” or “for example” should not be construed as preferred or advantageous over other embodiments.
  • "And/or” in this article is a description of the relationship between associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations.
  • “A plurality” means two or more than two.
  • words such as “first” and “second” are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that words such as “first” and “second” do not limit the quantity and execution order, and words such as “first” and “second” do not necessarily limit the difference.
  • the point cloud is used for plane segmentation or segmentation based on Euclidean distance, so as to try to segment and detect different objects in the scene, and then based on the segmented points Find the center point as the grab point candidate, then use a series of heuristic rules to sort the grab point candidates, and finally guide the robot to grab the optimal grab point.
  • a feedback system is introduced to record the success or failure of each grab. If successful, the current object is used as a template to match the grab point of the next grab.
  • the problem with this scheme is that the performance of ordinary point cloud segmentation is relatively weak, there will be more wrong capture points, and when the objects are closely arranged, the point cloud segmentation scheme is easy to fail.
  • a deep learning framework is used to manually mark some limited data, mark the direction and area of its grasping point to obtain relevant training data, and train a neural network model based on these training data.
  • the vision system can process and count pictures similar to the training set, and estimate the grasping points of the objects.
  • the problem with this solution is that the cost of data collection and labeling is relatively high, especially at the level of data labeling. It is difficult to label the direction and area of the grabbing point, which requires the labeler to have strong technical capabilities. At the same time, there are many human factors in the labeling information. , the labeling quality cannot be systematically controlled, so it is impossible to produce a model with systematic quality assurance.
  • An embodiment of the present disclosure provides a method for generating training data of an object grasping point estimation model, as shown in FIG. 1 , including:
  • Step 110 acquiring the 3D model of the sample object, sampling the grabbing points based on the 3D model of the sample object and evaluating the grabbing quality of the sampling points;
  • the sample object can be various box-like objects, bottle-like objects, box-like objects and bottle-like objects, or objects of other shapes.
  • the sample object can usually be selected from the actual items to be grabbed, but it is not required to cover all types of the actual items to be grabbed.
  • items with typical geometric shapes among the items to be grasped can be selected as sample objects, but this disclosure does not require that the sample objects must cover the shapes of all items to be grasped.
  • the trained model is still Grasp point estimation can be performed for objects of other shapes.
  • Step 120 rendering the simulated scene loaded with the 3D model of the first object, and generating a sample image for training, the first object being selected from the sample objects;
  • the loaded first object may be randomly selected by the system from sample objects, or manually selected, or selected according to configured rules.
  • the selected first object may include one type of sample object, or may include multiple types of sample objects, may include one type of sample object, or may include multiple types of sample objects. This embodiment is not limited to this.
  • Step 130 generating a target grasping quality of pixels in the sample image according to the grasping quality of the sampling points of the first object.
  • the target capture quality of pixels in the sample image here can be the target capture quality of some pixels in the sample image, or the capture quality of all pixels in the sample image, which can be marked pixel by pixel , may also be to mark a set of multiple pixel points, such as an area including more than two pixel points in the sample image. Because the grasping quality of pixels in the labeled sample image is used as the target data during training, it is called the target grasping quality of pixels in this paper.
  • the 3D model of the sample object is obtained first, and the grabbing point sampling and evaluation of the grabbing quality of the sampling point are performed based on the 3D model. Because the geometric shape of the 3D model itself is accurate, the quality of grabbing can be completed with high quality. evaluation of. After loading the 3D model of the selected first object to generate the first simulation scene, because the position and attitude of the 3D model can be tracked during loading, the positional relationship between the sampling point and the pixel point in the sample image can be calculated, and the sampling point The quality of grabbing is passed to the corresponding pixel in the sample image.
  • the training data generated by the embodiments of the present disclosure includes sample images and annotation data (including but not limited to the target capture quality), so the embodiments of the present disclosure realize automatic annotation of sample images, and can generate training data efficiently and with high quality , avoiding the heavy workload and unstable labeling quality caused by manual labeling.
  • the acquiring the 3D model of the sample object includes: creating or collecting the 3D model of the sample object, and normalizing the center of mass of the sample object to be located at the center of the 3D model
  • the origin of the model coordinate system, the main axis of the sample object is consistent with the direction of a coordinate axis in the model coordinate system.
  • the so-called normalization can be embodied as a unified modeling rule, that is, the origin of the model coordinate system is established at the center of mass of the sample object, and one coordinate axis in the model coordinate system is consistent with the main axis direction of the object . If it is a collected 3D model that has been created, normalization can be achieved by translating and rotating the 3D model to meet the above requirements that the center of mass is located at the origin and the main axis is in the same direction as a coordinate axis.
  • the performing grab point sampling based on the 3D model of the sample object includes: performing point cloud sampling on the 3D model of the sample object, and determining the position of the sampling point in the 3D model The first position and grabbing direction are recorded; the first position is represented by the coordinates of the sampling point in the model coordinate system of the 3D model, and the grabbing direction is based on the coordinates of the sampling point in the 3D model
  • the normal vector is determined.
  • uniform sampling can be performed on the surface of the sample object, and the specific algorithm is not limited.
  • the sampling points on the surface of the sample object can have an appropriate density. , to avoid missing suitable grab points.
  • a normal vector of a plane fitted by all points within a set neighborhood range of a sampling point may be used as a normal vector of the sampling point.
  • the evaluation of the grasping quality of the sampling point includes: in the scenario where a single suction cup is used to pick up the sample object, estimating the sampling point according to the sealing quality and the confrontation quality of each sampling point The grasping quality; wherein, the sealing quality is based on the fact that the sucker sucks the sample object at the sampling point and the axial direction of the sucker is consistent with the grabbing direction of the sampling point.
  • the airtightness between the surfaces is determined, and the resisting mass is determined according to the gravitational moment of the sample object and the resisting degree of the gravitational moment against the gravitational moment that can be generated when the suction cup absorbs the object.
  • the gravitational moment will cause the sample object to rotate and fall (the mass of the sample object is assigned during configuration), and the suction force of the suction cup on the sample object and the distance between the end of the suction cup and the sample object
  • the friction force between them can provide a moment against the gravitational moment to prevent the sample object from falling.
  • the suction force and friction force can be used as configuration information or calculated according to configuration information (such as suction cup parameters, object material, etc.). Therefore, the degree of resistance reflects the stability of the object during absorption, which can be calculated according to relevant formulas.
  • the above closedness quality and confrontation quality can be scored separately, and then the sum of the two scores, or the average value, or the weighted average value can be used as the grasping quality of the sampling point.
  • the airtight quality and confrontation quality of the sampling point are determined by the local geometric characteristics of the 3D model, which can fully reflect the relationship between the local geometric information of the object and the quality of the grasping point, so the accurate evaluation of the grasping quality of the sampling point can be realized .
  • the embodiment of the present disclosure takes a single sucker to pick up an object as an example, the present disclosure is not limited thereto.
  • the grasping method of picking up an object through multiple points or clamping an object through multiple points it can also be achieved according to the grasping efficiency and the stability of the object.
  • the index of quality and probability of success is used to evaluate the grasping quality of sampling points.
  • the simulation scene is obtained by loading the 3D model of the first object into the initial scene, and the loading process includes:
  • the embodiment of the present disclosure can simulate various object stacking scenes through the above loading process, and the training data generated based on the scene makes the model trained based on the training data suitable for object grasping points in complex scenes of object stacking. Estimation, to solve the problem that the object grasping point is difficult to estimate in this complex scene.
  • a simulated material frame can be set in the initial scene, the 3D model of the first object is loaded into the material frame, and the collision process between the first objects and between the first object and the material frame is simulated through simulation, so that the final form The simulated scene of object stacking is closer to the real scene. But the material frame is not necessary.
  • a simulated scene in which the first objects are stacked in an orderly manner may also be loaded, depending on the need for simulating the actual working scene.
  • the first object may be loaded multiple times in different ways to obtain multiple simulation scenes.
  • the different manners may be, for example, different types and/or quantities of the loaded first objects, different initial positions and postures of the 3D models during loading, and the like.
  • the rendering of the simulated scene to generate sample images for training includes: rendering each simulated scene at least twice to obtain at least two sets of sample images for training ; Wherein, when rendering each time, add a simulated camera to the simulated scene, set the light source and add texture to the loaded first object, and render the 2D image and depth image as a set of sample images; in the multiple renderings At least one of the following parameters is different between any two renderings: object texture, simulated camera parameters, and light parameters.
  • the simulated environment is illuminated during the process of rendering pictures, by adjusting the parameters of the simulated camera (such as internal parameters, position, angle, etc.), light parameters (such as the color and intensity of lighting, etc.), the texture of the object, etc., can Strengthen the degree of data randomization, enrich the content of sample images, and increase the number of sample images, thereby improving the quality of training data, and then improving the performance of the trained estimation model.
  • the parameters of the simulated camera such as internal parameters, position, angle, etc.
  • light parameters such as the color and intensity of lighting, etc.
  • the texture of the object etc.
  • adding textures to the loaded first object each time of rendering includes: each time of rendering, for each first object loaded into the simulation scene, from various Randomly select one of the real textures to paste on the surface of the first object; or, each time rendering, for each type of first object loaded into the simulation scene, randomly select one of the collected multiple real textures Attached to the surface of the first object of this kind.
  • This example compensates for domain differences between real and simulated data through a randomization technique.
  • the real texture can be collected from an actual object image, an image of the real texture can be used, and the like. Randomly pasting the selected texture on the surface of the first object stacked randomly in the simulated scene can render multiple images with different textures.
  • the estimation model can use the local geometry Information to predict the grasping quality of the grasping point, so as to realize the generalization ability of the model for unknown objects.
  • the sample image includes a 2D image and a depth image; generating the target capture quality of the pixel points in the sample image according to the capture quality of the sampling points of the first object ,include:
  • Step 210 according to the internal parameters of the simulated camera during rendering and the rendered depth image, obtain the point cloud of the first visible object in the simulated scene;
  • Step 220 Determine the position of the target sampling point in the point cloud according to the first position of the target sampling point in the 3D model, the second position of the 3D model in the simulation scene, and the attitude change after loading , the target sampling point refers to the sampling point of the visible first object;
  • Step 230 according to the capture quality of the target sampling point and the position in the point cloud, determine the capture quality of the point in the point cloud and mark it as the target capture quality of the corresponding pixel point in the 2D image. Take the quality.
  • the determining the grasping quality of the point in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud includes:
  • the first one for each target sampling point, determine the grasping quality of the points adjacent to the target sampling point in the point cloud as the grasping quality of the target sampling point;
  • the second type for a point in the point cloud, obtain the grasping quality of the point according to the grasping quality interpolation of the target sampling point adjacent to the position of the point;
  • the third method for each target sampling point, determine the grasping quality of the points adjacent to the target sampling point in the point cloud as the grasping quality of the target sampling point, and determine the quality of all points adjacent to the target sampling point After the quality is grasped, the quality of grasping of other points in the point cloud is obtained by interpolation.
  • Embodiments of the present disclosure provide various methods for transferring the grasping quality of target sampling points to the points of the point cloud.
  • the first is to assign the grasping quality of the target sampling point to the adjacent points in the point cloud.
  • the adjacent point can be one or more points closest to the target sampling point in the point cloud, such as can be filtered out according to the set distance threshold, and the distance to the target sampling point in the point cloud is less than the set point.
  • the point with the above-mentioned distance threshold is taken as the point adjacent to the target sampling point.
  • the second is an interpolation method. A point in the point cloud can be interpolated according to the grasping quality of multiple nearby target sampling points.
  • an interpolation method based on Gaussian filtering can be used, or based on multiple target sampling points. Each distance to the point is given different weights for multiple target sampling points. The larger the distance, the smaller the weight. Based on the weight, the grasping quality of the multiple target sampling points is weighted and averaged to obtain the weight of the point. Grabbing quality, this embodiment can also use other interpolation methods.
  • the points adjacent to this point can also be filtered according to the set distance threshold. If only one adjacent target sampling point is found for this point, the grabbing quality of this target sampling point can be assigned to this point.
  • the third method is to determine the capture quality of the points adjacent to the target sampling point in the point cloud, and then obtain the capture quality of other points in the point cloud by interpolation according to the capture quality of some points in the point cloud.
  • Both the above second and third methods can obtain the capture quality of all points in the point cloud, and after mapping the capture quality of these points in the point cloud to the capture quality of the corresponding pixel points in the 2D image, you can draw Heatmap of grasping quality for 2D images.
  • using the first method it is also possible to obtain only the capture quality of some points in the point cloud, and then obtain the capture quality of some pixels in the 2D image through mapping. At this time, during training, only the predicted grasping quality of the part of pixels is compared with the target grasping quality, the loss is calculated, and then the model is optimized according to the loss.
  • the generating The method further includes: for each target sampling point, using the grabbing direction of the target sampling point as the grabbing direction of a point adjacent to the target sampling point in the point cloud, combining the visible first Relative positional relationship between objects, when it is determined that the grabbing space at a point adjacent to the target sampling point is smaller than the required grabbing space, the distance from the point cloud to the target sampling point is smaller than the set Grabbing quality is adjusted downwards for points with a certain distance threshold.
  • the embodiment of the present disclosure considers that in the stacked state, the grasping points with better quality of each object may not have enough space to complete the grasping operation due to the existence of adjacent objects, so the grasping quality of the points in the point cloud is determined Afterwards, the judgment of the grasping space is carried out, and the grasping quality of the point affected by the insufficient grasping space is adjusted downward, specifically, it can be adjusted below a certain set quality threshold to avoid being selected.
  • the sample image includes a 2D image
  • the generating method further includes: labeling the classification of each pixel in the 2D image, the classification includes foreground and background, where the foreground is The first object in the image.
  • Classification of pixel points can be used to train the estimation model to distinguish the ability of the foreground and background, and accurately select the points in the foreground (that is, the points on the first object) from the sample image input to the estimation model, so only the foreground points are needed. Estimates of predictive grasp quality are performed.
  • the classification of the 2D image point pixels can also be obtained based on the classification of the points on the point cloud. By mapping the boundary points between the first object in the simulation scene and the background in the simulation scene to the point cloud, it is possible to determine the classification of the points on the point cloud, that is, foreground points or background points.
  • An embodiment of the present disclosure also provides a method for generating training data of an object grasping point estimation model, including:
  • Step 1 Collect 3D models of various sample objects, and normalize the 3D models so that the origin of the model coordinate system is placed at the center of mass of the sample object, and the first coordinate axis of the model coordinate system is consistent with the main axis of the sample object.
  • a 3D model in a format such as STereoLithography (STL for short) can be used, and the position of the center of mass of the sample object can be obtained by calculating the center points of all vertices through the statistics of the vertex and surface information in the 3D model. Then translate the origin of the model coordinate system to the center of mass of the sample object.
  • the principal component analysis (PCA) method can be used to confirm the main axis direction of the sample object, and then the 3D model of the sample object is rotated so that the direction of a coordinate axis of the model coordinate system is in the same direction as the main axis of the sample object.
  • PCA principal component analysis
  • Step 2 sampling the grabbing points of the 3D model of the sample object, obtaining and recording the first position and grabbing direction of each sampling point;
  • the sampling process in this embodiment is to perform point cloud sampling on the object model, and use the sampled point cloud to estimate the normal vector in a fixed neighborhood, and each point and its normal vector represent a sampling point.
  • the voxel sampling method or other sampling methods such as the farthest point sampling
  • all points within a certain range of neighborhood where each sampling point is located are used to estimate the direction of the normal vector of the sampling point.
  • the method of estimating the normal vector can be to use the random sample consensus algorithm (RANSAC for short) to fit all points in the neighborhood of the sampling point to estimate a plane, and the normal vector of the plane is approximately the normal vector of the sampling point.
  • RANSAC random sample consensus algorithm
  • Step 3 assessing the quality of the sampling points
  • the quality assessment process includes calculating the closedness quality during suction and the counterweight to the gravitational moment during suction (it needs to be able to resist the gravitational moment to complete stable grasping), according to the closure of each sampling point Quality and Adversarial Quality Estimate the grasping quality at that sampling point.
  • the sampled suction point that is, the sampling point
  • it is necessary to evaluate whether the sampled suction point is a suction point that can stably pick up the sample object.
  • Evaluation includes two aspects, the first is the quality of closure.
  • the quality of closure can be measured by approximating the end of the suction cup with a set radius as a polygon, projecting this polygon onto the surface of the 3D model through the grab point direction of the sampling point, and then comparing the projected polygon population side length and original side length. If the overall side length after projection increases more than the original side length, the sealing performance is not good. On the contrary, if the change is not large, the sealing performance is better.
  • the increase degree can be expressed by the ratio of the increased value to the original side length value, The ratio can be given a corresponding score according to the ratio interval it falls into.
  • Another aspect is to calculate the counter mass against the gravitational moment when the suction cup sucks the sample object at the sampling point along the grasping direction (also called the suction point direction).
  • the confrontation quality can be calculated through the "wrench resistance” modeling scheme.
  • "wrench” is a six-dimensional vector, the first three dimensions are forces, and the last three dimensions are moments.
  • the space formed by the six-dimensional vectors is “wrench space", "wrench resistance” Indicates whether the combined wrench of the force and moment acting on a certain point has resistance. If the gravitational moment can be included in the wrench space provided by the suction force and the torque generated by the friction force, it can provide stable suction, but not vice versa.
  • closure quality and adversarial quality into scores between 0 and 1, and summing them up, the suction quality evaluation results for each suction point are obtained.
  • Step 4 Build the initial simulation data collection scene, that is, the initial scene, load multiple first objects selected from the sample objects into the initial simulation scene, and use the physics engine to simulate the falling dynamics and final stacking posture of the first objects .
  • the 3D model of the first object can be loaded into the simulation environment through random positions and postures, and a certain quality can be given to each 3D model.
  • the 3D model of the first object can randomly fall into the material frame by simulating the effect of gravity, and the physics engine will also calculate the collision information between different first objects at the same time, so that the first object forms a The stacking state is very close to the real scene. Based on such a scheme, the second position and posture of the first objects that are close to the real random stacking are obtained in the simulation scene.
  • Step 5 generating annotation data of the sample image rendered based on the simulated scene according to the capture quality of the sampling points.
  • This step needs to map the sampling points obtained by sampling the grab points on the 3D model and the grasping quality of each sampling point obtained through evaluation to the module scene based on stacked objects. Since the second position and attitude of the 3D model of the first object can be obtained during the simulation of the simulated scene, and the positions of the sampling points are expressed based on the model coordinate system of the 3D model, it is easy to calculate the positions of these sampling points in the simulated scene .
  • a simulated camera is added at a set position in the simulated environment, and a ray-tracing-based rendering engine is used to efficiently render the 2D image (such as a texture image) and depth map of the first object in the stacked scene.
  • the rendered depth image can be converted into a point cloud. Based on the calculated positions of the sampling points in the simulated scene and the rendered point cloud of the first object, the position of each sampling point in the point cloud of the first object to which it belongs can be determined.
  • a Gaussian filter may be performed on the grasping qualities of these sampling points based on the positions of the sampling points in the same first object.
  • the target grasping quality of other pixels in the 2D image may be obtained by interpolation according to the target grasping quality of the corresponding pixel.
  • the grabbing quality of pixels with insufficient grabbing space can be adjusted down, so that When selecting the optimal grab point, some low-quality grab points caused by collisions are filtered out.
  • the adjustment of the capture quality here may also be performed on corresponding pixels in the 2D image.
  • the capture quality heat map of the 2D image rendered by the simulated scene can be obtained.
  • the grasping quality heat map is output as annotation data of the sample image, but the mark data is not necessarily in the form of a heat map, as long as it contains information about the target grasping quality of pixels in the 2D image.
  • the estimation model may be driven to learn or fit the grasping quality heat map during training.
  • An embodiment of the present disclosure also provides a device for generating training data of an object grasping point estimation model, as shown in FIG. 3 , including a processor 60 and a memory 50 storing a computer program, wherein the processor 60 executes The computer program implements the method for generating training data of an object grasping point estimation model according to any embodiment of the present disclosure.
  • the processor in the embodiments of the present disclosure and other embodiments may be an integrated circuit chip, which has a signal processing capability.
  • Described processor can be general-purpose processor, as central processing unit (Central Processing Unit, be called for short CPU), network processor (Network Processor, be called for short NP) etc.; It can also be digital signal processor (DSP), application-specific integrated circuit ( ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the object physical model ie 3D model
  • object geometric information are used to evaluate the quality of the grab points based on basic physics principles, so as to ensure the rationality of the grab point labeling.
  • domain randomization technology is used to generate a large amount of synthetic data to train the estimation model by means of random texture, random illumination, random camera position, etc. Therefore, the estimation model can bridge the domain gap between synthetic data and real data, and learn the local geometric features of the object, so as to accurately complete the task of estimating the object's grasping point.
  • An embodiment of the present disclosure also provides a method for training an estimation model of an object grasping point, as shown in FIG. 4 , including:
  • Step 310 acquiring training data, the training data including sample images and target capture quality of pixels in the sample images;
  • Step 320 using the sample image as input data, using machine learning to train the estimation model of the object grasping point, during training, according to the predicted grasping quality of the pixel points in the sample image output by the estimation model and the described
  • the difference between target grasp qualities computes the loss.
  • the training method of the estimation model in the embodiment of the present disclosure learns the grasping quality of the pixels in the 2D image, and then selects the optimal grasping point according to the predicted grasping quality of the pixels in the 2D image.
  • the way of grabbing points has better accuracy and stability.
  • the machine learning in the embodiments of the present disclosure may be supervised deep learning, non-deep learning machine learning, and the like.
  • the training data is generated according to the method for generating training data of the object grasping point estimation model described in any embodiment of the present disclosure.
  • the network architecture of the estimation model is shown in Figure 5, including:
  • the backbone network (Backbone) 10 adopts a semantic segmentation network architecture (such as DeepLab, UNet, etc.), and is set to extract features from the input 2D image and depth image;
  • a semantic segmentation network architecture such as DeepLab, UNet, etc.
  • the multi-branch network 20 adopts a multi-task learning network architecture and is configured to perform prediction based on the extracted features, so as to output the predicted capture quality of pixels in the 2D image.
  • the multi-branch network (also referred to as a network head or a detection head) includes:
  • the first branch network 21 learns semantic segmentation information to distinguish the foreground and background, and is configured to output the classification confidence of each pixel in the 2D image, and the classification includes foreground and background;
  • the second branch network 23 learns the grasping quality information of pixels in the 2D image, and is configured to output the predicted grasping quality of pixels classified as foreground determined according to the classification confidence in the 2D image. For example, pixels classified as foreground with confidence greater than a set confidence threshold may be referred to as foreground pixels.
  • This example involves classification, so the training data needs to include classified data.
  • the sample image includes a 2D image and a depth image; both the backbone network 10 and the multi-branch network 20 include depth channels, and the convolutional layers therein may adopt a 3D convolutional structure.
  • the loss of the first branch network 21 is calculated based on the classification loss of all pixels in the 2D image; the loss of the second branch network 23 is based on the predicted capture of some or all pixels classified as foreground The difference between the quality and the target grasping quality is calculated; the loss of the backbone network 10 is calculated according to the total loss of the first branch network 21 and the second branch network 23 .
  • the parameters of each network can be optimized using the gradient descent algorithm until the loss is minimized and the model converges.
  • the depth image can also be randomly block-shaped, such as 64*64 pixels at a time, so that the network can better utilize the structured information in the depth image.
  • the verification data After using the training data to train the above evaluation module for multiple iterations, use the verification data to verify the accuracy of the trained estimation model.
  • the verification data can be generated in the same way as the training data. After the accuracy of the estimation model meets the requirements, the It is estimated that the model is trained well and can be used. If the accuracy does not meet the requirements, continue training.
  • the 2D image and the depth image containing the actual object to be grasped are input, and the predicted grasping quality of the pixels in the 2D image is output.
  • the embodiments of the present disclosure use a multi-task learning framework based on deep learning principles to build a grasping point estimation model, which can effectively solve the problems of high error rate and inability to distinguish adjacent objects in a simple point cloud segmentation scheme.
  • An embodiment of the present disclosure also provides a training device for an estimation model of an object grasping point, referring to FIG. 3 , which includes a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the following is implemented: A method for training an estimation model of an object grasping point described in any embodiment of the present disclosure.
  • the estimation model training method predicts the grasping quality of pixels in a 2D image through pixel-level dense prediction. Do pixel-level foreground and background classification predictions on one branch. In another branch, a grasp quality prediction value, ie predicted grasp quality, may be output for each pixel classified as foreground in the 2D image.
  • Both the backbone network and the branch network of the estimation model in the embodiment of the disclosure include depth channels. At the input end, the depth image containing the depth channel information is input into the backbone network, and then the features learned by the depth channel are fused into the color 2D from the channel dimension direction. The features of the image, and pixel-by-pixel multi-task prediction, can help the estimation model to better handle the grasping point estimation task in the scene where objects to be grasped are stacked.
  • An embodiment of the present disclosure also provides a method for estimating an object grasping point, as shown in FIG. 6 , including:
  • Step 410 acquiring a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;
  • Step 420 input the scene image into the estimation model of the object grasping point, wherein the estimation model is an estimation model trained by the training method described in any embodiment of the present disclosure;
  • Step 430 Determine the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model.
  • the embodiments of the present disclosure realize camera driving, and scene images of objects to be captured, such as 2D images and depth images, can be captured by depth cameras adapted to various industrial scenes. After acquiring the color 2D image and the depth image from the depth camera, they are cropped and scaled to the image size required by the estimation model input, and then input into the estimation model.
  • determining the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model includes:
  • the obtained candidate grasping points are sorted based on a predetermined rule, and an optimal candidate grasping point is determined as the grasping point of the object to be grasped according to the ranking.
  • the sorting when sorting the obtained grabbing points based on predetermined rules, the sorting may be based on predetermined heuristic rules, and the heuristic rules may be based, for example, on the relative The distance from the camera, whether the pick-up point is in the actual material frame, whether the pick-up point will bring collisions and other condition settings, use these information to sort the candidate grabbing points, and determine the best candidate grabbing point as the one to be grabbed Grab points for objects.
  • An embodiment of the present disclosure also provides a device for estimating the grasping point of an object.
  • a device for estimating the grasping point of an object includes a processor and a memory storing a computer program, wherein, when the processor executes the computer program, it implements any aspect of the present disclosure.
  • the above-mentioned embodiments of the present disclosure are based on the trained estimation model, send the 2D image and the depth image captured by the camera into the estimation model for forward reasoning, and output the prediction and evaluation quality of the pixels in the 2D image. If the number of pixels whose predicted evaluation quality is greater than the set quality threshold exceeds the set number, the set number of pixels with the best predicted capture quality, such as TOP50, TOP100, etc., can be selected. After clustering the selected pixels and calculating one or more class centers, the nearest pixel to the class center in the 2D image (it can be a pixel or a pixel in an area) can be used as The candidate grabbing points. Since the estimation model adopted can achieve better accuracy, the estimation method and device of this embodiment can improve the accuracy of object grasping point estimation, thereby improving the success rate of grasping.
  • An embodiment of the present disclosure also provides a robot vision system, as shown in FIG. 7 , including:
  • the camera 1 is configured to shoot a scene image containing an object to be captured, and the scene image includes a 2D image, or includes a 2D image and a depth image;
  • the control device 2 includes the estimation device of the object grasping point according to claim 20, the control device is configured to determine the position of the grasping point of the object to be grasped according to the scene image captured by the camera ; and, controlling the grabbing action performed by the robot according to the position of the grabbing point;
  • the robot 3 is configured to perform the grasping action.
  • the robot vision system of the embodiments of the present disclosure can improve the accuracy of object grasping point estimation, thereby improving the success rate of grasping.
  • An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program described in any embodiment of the present disclosure can be implemented.
  • the method for generating training data of the object grasping point estimation model described above, or the method for training the object grasping point estimation model described in any embodiment of the present disclosure, or the object grasping method described in any embodiment of the present disclosure Estimation method for taking points.
  • Computer-readable media may include computer-readable storage media that correspond to tangible media such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, eg, according to a communication protocol.
  • a computer-readable medium may generally correspond to a non-transitory tangible computer-readable storage medium or a communication medium such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure.
  • a computer program product may comprise a computer readable medium.
  • such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk or other magnetic storage, flash memory, or may be used to store instructions or data Any other medium that stores desired program code in the form of a structure and that can be accessed by a computer.
  • any connection could also be termed a computer-readable medium. For example, if a connection is made from a website, server or other remote source for transmitting instructions, coaxial cable, fiber optic cable, dual wire, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
  • disk and disc include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, or blu-ray disc, etc. where disks usually reproduce data magnetically, while discs use lasers to Data is reproduced optically. Combinations of the above should also be included within the scope of computer-readable media.
  • processors can be implemented by one or more processors such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • processors may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein.
  • the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements.
  • the technical solutions of the embodiments of the present disclosure may be implemented in a wide variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a set of ICs (eg, a chipset).
  • IC integrated circuit
  • Various components, modules, or units are described in the disclosed embodiments to emphasize functional aspects of devices configured to perform the described techniques, but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units (comprising one or more processors as described above) in combination with suitable software and/or firmware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

An object grabbing point estimation method, apparatus and system, a model training method, apparatus and system, and a data generation method, apparatus and system. The data generation method comprises: sampling a grabbing point on the basis of a 3D model of a sample object, and assessing the grabbing quality of the sampled point; and rendering a simulation scene, into which a 3D model of a first object is loaded, so as to generate a sample image for training and target grabbing quality of a pixel point therein. The sample image and the target grabbing quality are taken as training data to train an object grabbing point estimation model, and the trained model is used for estimating an object grabbing point. By means of the embodiments of the present disclosure, automatic labeling of a sample image can be realized, training data can be efficiently generated at high quality, and the estimation precision of a grabbing point is improved.

Description

物体抓取点估计、模型训练及数据生成方法、装置及系统Object grasping point estimation, model training and data generation method, device and system 技术领域technical field
本公开涉及但不限于人工智能技术,具体涉及一种物体抓取点估计、模型训练及数据生成方法、装置及系统。The present disclosure relates to but is not limited to artificial intelligence technology, and specifically relates to a method, device and system for object grasping point estimation, model training and data generation.
背景技术Background technique
在机器人视觉引导应用场景中,机器人视觉系统遇到的挑战是需要引导机器人抓取成千上万种不同的库存物品(stock keeping unit,简称SKU)。这些物体通常是系统未知的,或者由于种类过于繁多,维护所有SKU的物理模型或者纹理模板成本过高。最简单的例子就是在拆垛应用中,虽然要抓取的物体都是长方形的物体(盒子或者箱子),但是物体的纹理,尺寸等会根据场景的不同而发生变化。因此经典的基于模板匹配的物体定位或识别方案在这类场景中难以应用。在一些电商仓储的应用场景中,很多物体都具有不规则形状,其中最普遍的物体为类盒状物体和类瓶状物体,这些货物堆叠在一块,需要机器人视觉引导系统高效的将这些物体从堆叠的状态下一件一件分拣出来,进行后续的扫码或者识别操作并送入到合适的目标料框中。In the application scenario of robot vision guidance, the challenge encountered by the robot vision system is to guide the robot to grab thousands of different stock keeping units (SKU for short). These objects are usually unknown to the system, or due to the variety, it is too costly to maintain physical models or texture templates for all SKUs. The simplest example is in the depalletizing application, although the objects to be grabbed are all rectangular objects (boxes or boxes), the texture and size of the objects will change according to different scenes. Therefore, the classic object localization or recognition scheme based on template matching is difficult to apply in such scenarios. In some application scenarios of e-commerce warehousing, many objects have irregular shapes. The most common objects are box-like objects and bottle-like objects. Sorting one by one from the stacked state, performing subsequent scanning or identification operations and sending them into the appropriate target material box.
在这个过程中机器人视觉系统如何在没有物体先验知识的情况下根据相机拍摄到的场景,对机器人最合适的抓取点(可以是吸取点但不限于此)进行估计,引导机器人执行物体抓取动作,仍然是需要解决的问题。In this process, how does the robot vision system estimate the most suitable grasping point of the robot (it can be a suction point but not limited to this) based on the scene captured by the camera without prior knowledge of the object, and guide the robot to perform object grasping Taking actions is still a problem that needs to be solved.
发明内容Contents of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.
本公开一实施例提供了一种物体抓取点估计模型的训练数据的生成方法,包括:An embodiment of the present disclosure provides a method for generating training data of an object grasping point estimation model, including:
获取样本物体的3D模型,基于所述样本物体的3D模型进行抓取点采样并评估采样点的抓取质量;Acquiring a 3D model of the sample object, sampling grab points based on the 3D model of the sample object and evaluating the grab quality of the sample points;
对加载有第一物体的3D模型的模拟场景进行渲染,生成训练用的样本图像,所述第一物体是从所述样本物体中选取的;Rendering a simulated scene loaded with a 3D model of a first object to generate a sample image for training, the first object is selected from the sample objects;
根据所述第一物体的采样点的抓取质量生成所述样本图像中像素点的目标抓取质量。A target grasping quality of pixels in the sample image is generated according to the grasping quality of the sampling points of the first object.
本公开一实施例还提供了一种物体抓取点估计模型的训练数据的生成装置,包括处理器以及存储有计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的物体抓取点估计模型的训练数据的生成方法。An embodiment of the present disclosure also provides a device for generating training data of an object grasping point estimation model, including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the implementation of the present disclosure is realized. A method for generating training data for an object grasping point estimation model described in any one of the embodiments.
本公开上述实施例的方法和装置,实现了对样本图像的自动标注,可以高效、高质量地生成训练数据,避免了人工标注带来的工作量繁重、标注质量不稳定等问题。The method and device of the above-mentioned embodiments of the present disclosure realize automatic labeling of sample images, can generate training data efficiently and with high quality, and avoid problems such as heavy workload and unstable labeling quality caused by manual labeling.
本公开一实施例提供了一种物体抓取点的估计模型的训练方法,包括:An embodiment of the present disclosure provides a method for training an estimation model of object grasping points, including:
获取训练数据,所述训练数据包括样本图像和所述样本图像中像素点的目标抓取质量;Acquiring training data, the training data includes a sample image and the target grasping quality of pixels in the sample image;
以所述样本图像为输入数据,采用机器学习的方式对物体抓取点的估计模型进行训练,训练时根据所述估计模型输出的样本图像中像素点的预测抓取质量和所述目标抓取质量之间的差值计算损失;Taking the sample image as input data, using machine learning to train the estimation model of the object grasping point, during training, according to the predicted grasping quality of the pixel points in the sample image output by the estimated model and the grasping quality of the target The difference between the masses calculates the loss;
其中,所述估计模型包括采用语义分割网络架构的主干网络和多分支网络,所述多分支网络采用多任务学习网络架构。Wherein, the estimation model includes a backbone network using a semantic segmentation network architecture and a multi-branch network, and the multi-branch network adopts a multi-task learning network architecture.
本公开一实施例还提供了一种物体抓取点的估计模型的训练装置,包括处理器以及存储有计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的物体抓取点的估计模型的训练方法。An embodiment of the present disclosure also provides a training device for an estimation model of an object grasping point, including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, any one of the methods described in the present disclosure can be realized. The method for training the estimation model of the object grasping point described in the embodiment.
本公开上述实施例的方法和装置,通过训练学习的是2D图像中像素点的抓取质量,相对于直接最优的抓取点的方式,具有更好的精度和稳定性。The method and device of the above-mentioned embodiments of the present disclosure learn the capture quality of pixels in a 2D image through training, which has better accuracy and stability than the direct and optimal way of capturing points.
本公开一实施例提供了一种物体抓取点的估计方法,包括:An embodiment of the present disclosure provides a method for estimating a grasping point of an object, including:
获取包含待抓取物体的场景图像,所述场景图像包括2D图像,或包括2D图像和深度图像;Acquiring a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;
将所述场景图像输入物体抓取点的估计模型,其中,所述估计模型采用如本公开任一实施例所述的训练方法训练好的估计模型;Inputting the scene image into the estimation model of the object grasping point, wherein the estimation model adopts the estimation model trained by the training method described in any embodiment of the present disclosure;
根据所述估计模型输出的所述2D图像中像素点的预测抓取质量,确定所述待抓取物体的抓取点的位置。Determine the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model.
本公开一实施例还提供了一种物体抓取点的估计装置,包括处理器以及存储有计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的物体抓取点的估计方法。An embodiment of the present disclosure also provides a device for estimating the grasping point of an object, including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the method described in any embodiment of the present disclosure is implemented. The method for estimating object grasping points described above.
本公开一实施例还提供了一种机器人视觉系统,包括:An embodiment of the present disclosure also provides a robot vision system, including:
相机,设置为拍摄包含待抓取物体的场景图像,所述场景图像包括2D图像,或包括2D图像和深度图像;A camera, configured to shoot a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;
控制装置,包括如本公开实施例所述的物体抓取点的估计装置,所述控制装置设置为根据所述相机拍摄的所述场景图像,确定所述待抓取物体的抓取点的位置;及,根据所述抓取点的位置控制机器人执行的抓取动作;The control device includes the object grasping point estimation device according to the embodiment of the present disclosure, the control device is configured to determine the position of the grasping point of the object to be grasped according to the scene image captured by the camera ; and, controlling the grabbing action performed by the robot according to the position of the grabbing point;
机器人,设置为执行所述抓取动作。A robot configured to perform said grasping action.
本公开上述实施例的估计方法、装置和机器人视觉系统,可以提高物体抓取点估计的准确性,进而提升抓取的成功率。The estimation method, device and robot vision system of the above-mentioned embodiments of the present disclosure can improve the accuracy of object grasping point estimation, thereby improving the success rate of grasping.
本公开一实施例还提供了一种非瞬态计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序时被处理器执行时实现如本公开任一实施例所述的物体抓取点估计模型的训练数据的生成方法,或者实现如本公开任一实施例所述的物体抓取点的估计模型的训练方法,或者实现如本公开任一实施例所述的物体抓取点的估计方法。An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program described in any embodiment of the present disclosure can be implemented. The method for generating the training data of the object grasping point estimation model described above, or realize the training method of the object grasping point estimation model as described in any embodiment of the present disclosure, or realize the method as described in any embodiment of the present disclosure Method for estimating object grasp points.
在阅读并理解了附图和详细描述后,可以明白其他方面。Other aspects will be apparent to others upon reading and understanding the drawings and detailed description.
附图说明Description of drawings
附图用来提供对本公开实施例的理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开的技术方案,并不构成对本公开技术方案的 限制。The accompanying drawings are used to provide an understanding of the embodiments of the present disclosure, and constitute a part of the description, together with the embodiments of the present disclosure, are used to explain the technical solutions of the present disclosure, and do not constitute limitations to the technical solutions of the present disclosure.
图1是本公开一实施例物体抓取点估计模型的训练数据的生成方法的流程图;1 is a flowchart of a method for generating training data for an object grasping point estimation model according to an embodiment of the present disclosure;
图2是图1中根据采样点的抓取质量生成标注数据的流程图;Fig. 2 is a flow chart of generating labeled data according to the grasping quality of sampling points in Fig. 1;
图3是本公开一实施例训练数据的生成装置的示意图;3 is a schematic diagram of a device for generating training data according to an embodiment of the present disclosure;
图4是本公开一实施例物体抓取点的估计模型的训练方法的流程图;Fig. 4 is a flowchart of a training method for an estimation model of an object grasping point according to an embodiment of the present disclosure;
图5是本公开一实施例估计模型的网络结构图;Fig. 5 is a network structure diagram of an estimation model according to an embodiment of the present disclosure;
图6是本公开一实施例物体抓取点估计方法的流程图;FIG. 6 is a flowchart of a method for estimating an object grasping point according to an embodiment of the present disclosure;
图7是本公开一实施例的机器人视觉系统的结构示意图。FIG. 7 is a schematic structural diagram of a robot vision system according to an embodiment of the present disclosure.
具体实施方式Detailed ways
本公开描述了多个实施例,但是该描述是示例性的,而不是限制性的,并且对于本领域的普通技术人员来说显而易见的是,在本公开所描述的实施例包含的范围内可以有更多的实施例和实现方案。The present disclosure describes a number of embodiments, but the description is illustrative rather than restrictive, and it will be apparent to those of ordinary skill in the art that within the scope encompassed by the described embodiments of the present disclosure, There are many more embodiments and implementations.
本公开的描述中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本公开中被描述为“示例性的”或者“例如”的任何实施例不应被解释为比其他实施例更优选或更具优势。本文中的“和/或”是对关联对象的关联关系的一种描述,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。“多个”是指两个或多于两个。另外,为了便于清楚描述本公开实施例的技术方案,使用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。In the description of the present disclosure, words such as "exemplary" or "for example" are used to mean an example, illustration or illustration. Any embodiment described in this disclosure as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments. "And/or" in this article is a description of the relationship between associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. "A plurality" means two or more than two. In addition, in order to clearly describe the technical solutions of the embodiments of the present disclosure, words such as "first" and "second" are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that words such as "first" and "second" do not limit the quantity and execution order, and words such as "first" and "second" do not necessarily limit the difference.
在描述具有代表性的示例性实施例时,说明书可能已经将方法和/或过程呈现为特定的步骤序列。然而,在该方法或过程不依赖于本文所述步骤的特定顺序的程度上,该方法或过程不应限于所述的特定顺序的步骤。如本领域普通技术人员将理解的,其它的步骤顺序也是可能的。因此,说明书中阐述的步骤的特定顺序不应被解释为对权利要求的限制。此外,针对该方法和/ 或过程的权利要求不应限于按照所写顺序执行它们的步骤,本领域技术人员可以容易地理解,这些顺序可以变化,并且仍然保持在本公开实施例的精神和范围内。In describing representative exemplary embodiments, the specification may have presented a method and/or process as a particular sequence of steps. However, to the extent the method or process is not dependent on the specific order of steps described herein, the method or process should not be limited to the specific order of steps described. Other sequences of steps are also possible, as will be appreciated by those of ordinary skill in the art. Therefore, the specific order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, claims to the method and/or process should not be limited to performing their steps in the order written, those skilled in the art can readily understand that these orders can be changed and still remain within the spirit and scope of the disclosed embodiments Inside.
随着深度学习技术的发展,通过训练视觉神经网络模型已经可以完成各类检测(2维物体位置和大小估计),分割(像素级物体类别预测或者实例索引预测)等任务,物体抓取点估计也可以基于深度学习框架和适当的训练数据,实现基于数据驱动的抓取点估计方法和装备。With the development of deep learning technology, various tasks such as detection (2-dimensional object position and size estimation), segmentation (pixel-level object category prediction or instance index prediction) and object grabbing point estimation can be completed by training the visual neural network model. It is also possible to implement a data-driven grasp point estimation method and equipment based on a deep learning framework and appropriate training data.
在一种方案中,由相机拍摄了彩色图和深度图之后,使用点云做平面分割或者基于欧氏距离的分割,从而尝试在场景中分割和检测出不同的物体,之后基于分割出的点求取中心点作为抓取点候选,再使用一系列的启发式规则对抓取点候选进行排序,最后引导机器人抓取最优抓取点。同时引入反馈系统去记录每一次抓取的成败,如果成功,则使用当下物体作为模板去匹配下一次抓取的抓取点。这种方案的问题在于普通点云分割的性能相对较弱,会有较多的错误抓取点,并且当物体排列紧密的时候,点云分割方案很容易失效。In one scheme, after the color map and depth map are taken by the camera, the point cloud is used for plane segmentation or segmentation based on Euclidean distance, so as to try to segment and detect different objects in the scene, and then based on the segmented points Find the center point as the grab point candidate, then use a series of heuristic rules to sort the grab point candidates, and finally guide the robot to grab the optimal grab point. At the same time, a feedback system is introduced to record the success or failure of each grab. If successful, the current object is used as a template to match the grab point of the next grab. The problem with this scheme is that the performance of ordinary point cloud segmentation is relatively weak, there will be more wrong capture points, and when the objects are closely arranged, the point cloud segmentation scheme is easy to fail.
在另一种方案中,使用深度学习框架,通过手工标注一些有限的数据,标注其抓取点方向和区域从而获得相关训练数据,并基于这些训练数据训练神经网络模型。在系统运行过程中,视觉系统可以处理、统计和训练集类似的图片,估计出其中的物体抓取点。这种方案的问题在于数据采集和标注成本较高,尤其是在数据标注层面上,抓取点方向和区域较难标注,需要标注员有较强的技术能力,同时标注信息中人为因素较多,标注质量无法系统性控制,从而无法产出有系统性质量保证的模型。In another solution, a deep learning framework is used to manually mark some limited data, mark the direction and area of its grasping point to obtain relevant training data, and train a neural network model based on these training data. During the operation of the system, the vision system can process and count pictures similar to the training set, and estimate the grasping points of the objects. The problem with this solution is that the cost of data collection and labeling is relatively high, especially at the level of data labeling. It is difficult to label the direction and area of the grabbing point, which requires the labeler to have strong technical capabilities. At the same time, there are many human factors in the labeling information. , the labeling quality cannot be systematically controlled, so it is impossible to produce a model with systematic quality assurance.
本公开一实施例提供了一种物体抓取点估计模型的训练数据的生成方法,如图1的,包括:An embodiment of the present disclosure provides a method for generating training data of an object grasping point estimation model, as shown in FIG. 1 , including:
步骤110,获取样本物体的3D模型,基于所述样本物体的3D模型进行抓取点采样并评估采样点的抓取质量; Step 110, acquiring the 3D model of the sample object, sampling the grabbing points based on the 3D model of the sample object and evaluating the grabbing quality of the sampling points;
样本物体可以是各种盒状物品、瓶类物品,类盒状物体和类瓶状物体,也可以是其他形状的物体。样本物体通常可以从实际要抓取的物品中选取,但并不要求将实际要抓取的物品的种类全部覆盖。通常可以选择要抓取的物品中几何形状具有典型性的物品作为样本物体,但本公开并不要求样本物体必须覆盖所有要抓取的物品的形状,基于模型的泛化能力,训练的模型仍然能对其他形状的物品进行抓取点估计。The sample object can be various box-like objects, bottle-like objects, box-like objects and bottle-like objects, or objects of other shapes. The sample object can usually be selected from the actual items to be grabbed, but it is not required to cover all types of the actual items to be grabbed. Usually, items with typical geometric shapes among the items to be grasped can be selected as sample objects, but this disclosure does not require that the sample objects must cover the shapes of all items to be grasped. Based on the generalization ability of the model, the trained model is still Grasp point estimation can be performed for objects of other shapes.
步骤120,对加载有第一物体的3D模型的模拟场景进行渲染,生成训练用的样本图像,所述第一物体是从所述样本物体中选取的; Step 120, rendering the simulated scene loaded with the 3D model of the first object, and generating a sample image for training, the first object being selected from the sample objects;
加载的第一物体可以由系统从样本物体中随机选取,或者由人工选取,或者根据配置的规则进行选取。选择的第一物体可以包括一种样本物体,也可以包括多种样本物体,可以包括一种形状的样本物体,也可以包括多种形状的样本物体。本实施例对此并不局限。The loaded first object may be randomly selected by the system from sample objects, or manually selected, or selected according to configured rules. The selected first object may include one type of sample object, or may include multiple types of sample objects, may include one type of sample object, or may include multiple types of sample objects. This embodiment is not limited to this.
步骤130,根据所述第一物体的采样点的抓取质量生成所述样本图像中像素点的目标抓取质量。 Step 130, generating a target grasping quality of pixels in the sample image according to the grasping quality of the sampling points of the first object.
此处的样本图像中像素点的目标抓取质量,可以是样本图像中部分像素点的目标抓取质量,也可以是样本图像中全部像素点的抓取质量,可以是逐个像素点地加以标注,也可以是对多个像素点的集合如样本图像中包括两个以上像素点的区域加以标注。因为标注的样本图像中像素点的抓取质量是作为训练时的目标数据,因此文中将其称为像素点的目标抓取质量。The target capture quality of pixels in the sample image here can be the target capture quality of some pixels in the sample image, or the capture quality of all pixels in the sample image, which can be marked pixel by pixel , may also be to mark a set of multiple pixel points, such as an area including more than two pixel points in the sample image. Because the grasping quality of pixels in the labeled sample image is used as the target data during training, it is called the target grasping quality of pixels in this paper.
本公开实施例先获取样本物体的3D模型,基于3D模型进行抓取点采样和评估采样点的抓取质量,因为3D模型本身的几何形状是精准地,因而可以高质量地完成对抓取质量的评估。将选取的第一物体的3D模型加载生成第一模拟场景后,因为加载时3D模型的位置和姿态是可以跟踪的,因此可以计算出采样点与样本图像中像素点的位置关系,将采样点的抓取质量传递给样本图像中相应的像素点。本公开实施例生成的训练数据包括样本图像和标注数据(包括但不限于所述目标抓取质量),因此本公开实施例实现了对样本图像的自动标注,可以高效、高质量地生成训练数据,避免了人工标注带来的工作量繁重、标注质量不稳定等问题。In the embodiment of the present disclosure, the 3D model of the sample object is obtained first, and the grabbing point sampling and evaluation of the grabbing quality of the sampling point are performed based on the 3D model. Because the geometric shape of the 3D model itself is accurate, the quality of grabbing can be completed with high quality. evaluation of. After loading the 3D model of the selected first object to generate the first simulation scene, because the position and attitude of the 3D model can be tracked during loading, the positional relationship between the sampling point and the pixel point in the sample image can be calculated, and the sampling point The quality of grabbing is passed to the corresponding pixel in the sample image. The training data generated by the embodiments of the present disclosure includes sample images and annotation data (including but not limited to the target capture quality), so the embodiments of the present disclosure realize automatic annotation of sample images, and can generate training data efficiently and with high quality , avoiding the heavy workload and unstable labeling quality caused by manual labeling.
在本公开一示例性的实施例中,所述获取样本物体的3D模型,包括:创建或采集所述样本物体的3D模型,通过归一化使得所述样本物体的质心位于所述3D模型的模型坐标系的原点,所述样本物体的主轴与所述模型坐标系中一坐标轴的方向一致。本实施例创建3D模型时,所谓归一化可以体现为统一的建模规则,即将模型坐标系的原点建立在样本物体的质心,并且使模型坐标系中的一坐标轴与物体的主轴方向一致。如果是采集的已经创建好的3D模型,则可以通过对3D模型进行平移和旋转来实现归一化,满足上述质心位于原点,主轴与一坐标轴方向一致的要求。In an exemplary embodiment of the present disclosure, the acquiring the 3D model of the sample object includes: creating or collecting the 3D model of the sample object, and normalizing the center of mass of the sample object to be located at the center of the 3D model The origin of the model coordinate system, the main axis of the sample object is consistent with the direction of a coordinate axis in the model coordinate system. When creating a 3D model in this embodiment, the so-called normalization can be embodied as a unified modeling rule, that is, the origin of the model coordinate system is established at the center of mass of the sample object, and one coordinate axis in the model coordinate system is consistent with the main axis direction of the object . If it is a collected 3D model that has been created, normalization can be achieved by translating and rotating the 3D model to meet the above requirements that the center of mass is located at the origin and the main axis is in the same direction as a coordinate axis.
在本公开一示例性的实施例中,所述基于所述样本物体的3D模型进行抓取点采样,包括:对所述样本物体的3D模型进行点云采样,确定采样点在3D模型中的第一位置和抓取方向并记录;所述第一位置用所述采样点在所述3D模型的模型坐标系中的坐标表示,所述抓取方向根据所述3D模型中所述采样点的法向量确定。本实施例基于3D模型进行点云采样时,可以在样本物体的表面进行均匀采样,对于具体算法并不加以局限,通过适当设置采样点的数目,可以使得样本物体表面的采样点具有适当的密度,避免合适的抓取点被遗漏。在一个示例中,可以将一个采样点的设定邻域范围内所有点拟合的平面的法向量作为该采样点的法向量。In an exemplary embodiment of the present disclosure, the performing grab point sampling based on the 3D model of the sample object includes: performing point cloud sampling on the 3D model of the sample object, and determining the position of the sampling point in the 3D model The first position and grabbing direction are recorded; the first position is represented by the coordinates of the sampling point in the model coordinate system of the 3D model, and the grabbing direction is based on the coordinates of the sampling point in the 3D model The normal vector is determined. In this embodiment, when performing point cloud sampling based on a 3D model, uniform sampling can be performed on the surface of the sample object, and the specific algorithm is not limited. By appropriately setting the number of sampling points, the sampling points on the surface of the sample object can have an appropriate density. , to avoid missing suitable grab points. In an example, a normal vector of a plane fitted by all points within a set neighborhood range of a sampling point may be used as a normal vector of the sampling point.
在本公开一示例性的实施例中,所述评估采样点的抓取质量,包括:在使用单一吸盘吸取样本物体的场景下,根据每一采样点的封闭性质量和对抗质量估算该采样点的抓取质量;其中,所述封闭性质量根据吸盘在该采样点位置吸取样本物体且所述吸盘的轴向与该采样点的抓取方向一致的情况下,该吸盘端部与该样本物体表面之间的密闭程度确定,所述对抗质量根据该情况下根据该样本物体的重力力矩和吸盘吸取物体时能够产生的力矩对所述重力力矩的对抗程度确定。In an exemplary embodiment of the present disclosure, the evaluation of the grasping quality of the sampling point includes: in the scenario where a single suction cup is used to pick up the sample object, estimating the sampling point according to the sealing quality and the confrontation quality of each sampling point The grasping quality; wherein, the sealing quality is based on the fact that the sucker sucks the sample object at the sampling point and the axial direction of the sucker is consistent with the grabbing direction of the sampling point. The airtightness between the surfaces is determined, and the resisting mass is determined according to the gravitational moment of the sample object and the resisting degree of the gravitational moment against the gravitational moment that can be generated when the suction cup absorbs the object.
本公开实施例中,使用所述吸盘吸取样本物体时,重力力矩会使得样本物体旋转掉落(样本物体的质量在配置时赋值),而吸盘对样本物体的吸力和吸盘端部与样本物体之间的摩擦力能够提供与所述重力力矩对抗的力矩,以防止样本物体掉落,所述吸力和摩擦力可以作为配置信息或根据配置信息(如吸盘参数、物体材质等)计算得到。因此所述对抗程度即体现了吸取时 物体的稳定程度,可以根据相关公式计算。上述封闭性质量和对抗质量可以分别给于评分,再将两者分数之和、或者平均值、或者加权平均值等作为采样点的抓取质量。采样点的密闭性质量和对抗质量是由3D模型的局部几何特性决定的,可以充分地体现物体局部几何信息与抓取点优劣的关系,因此可以实现对采样点的抓取质量的准确评估。In the embodiment of the present disclosure, when the suction cup is used to suck the sample object, the gravitational moment will cause the sample object to rotate and fall (the mass of the sample object is assigned during configuration), and the suction force of the suction cup on the sample object and the distance between the end of the suction cup and the sample object The friction force between them can provide a moment against the gravitational moment to prevent the sample object from falling. The suction force and friction force can be used as configuration information or calculated according to configuration information (such as suction cup parameters, object material, etc.). Therefore, the degree of resistance reflects the stability of the object during absorption, which can be calculated according to relevant formulas. The above closedness quality and confrontation quality can be scored separately, and then the sum of the two scores, or the average value, or the weighted average value can be used as the grasping quality of the sampling point. The airtight quality and confrontation quality of the sampling point are determined by the local geometric characteristics of the 3D model, which can fully reflect the relationship between the local geometric information of the object and the quality of the grasping point, so the accurate evaluation of the grasping quality of the sampling point can be realized .
虽然本公开实施例是以单一吸盘吸取物体为例,但本公开不局限于此,对于通过多点吸取,或者通过多点夹紧物体的抓取方式,同样可以根据体现抓取效率、物体稳定性和成功概率的指标来评估采样点的抓取质量。Although the embodiment of the present disclosure takes a single sucker to pick up an object as an example, the present disclosure is not limited thereto. For the grasping method of picking up an object through multiple points or clamping an object through multiple points, it can also be achieved according to the grasping efficiency and the stability of the object. The index of quality and probability of success is used to evaluate the grasping quality of sampling points.
在本公开一示例性的实施例中,所述模拟场景通过将所述第一物体的3D模型加载到初始场景而得到,所述加载过程包括:In an exemplary embodiment of the present disclosure, the simulation scene is obtained by loading the 3D model of the first object into the initial scene, and the loading process includes:
从所述样本物体中选取要加载的第一物体的种类和数量,为所述第一物体的质量赋值;selecting the type and quantity of the first object to be loaded from the sample objects, and assigning a value to the quality of the first object;
将所述第一物体的3D模型按随机的位置和姿态加载到所述初始场景中;Loading the 3D model of the first object into the initial scene according to a random position and posture;
使用物理引擎仿真所述第一物体掉落的过程以及最终形成的堆叠状态,得到所述模拟场景;using a physics engine to simulate the falling process of the first object and the final stacking state to obtain the simulated scene;
记录所述第一物体的3D模型在所述模拟场景中的第二位置和姿态。Recording a second position and posture of the 3D model of the first object in the simulation scene.
本公开实施例通过上述加载过程,可以模拟出各种物体堆叠的场景,基于该场景生成的训练数据,使得基于所述训练数据训练的模型适用于物体堆叠的复杂场景下对物体抓取点的估计,解决该复杂场景下物体抓取点难以估计的问题。可以在初始场景中设置模拟的料框,将第一物体的3D模型加载到所述料框,并通过仿真模拟第一物体之间及第一物体与料框之间的碰撞过程,使得最终形成的物体堆叠的模拟场景更接近于真实场景。但该料框并不是必需的。本公开其他实施例中,也可以加载为第一物体有序堆放的模拟场景,这取决于对实际工作场景模拟的需要。The embodiment of the present disclosure can simulate various object stacking scenes through the above loading process, and the training data generated based on the scene makes the model trained based on the training data suitable for object grasping points in complex scenes of object stacking. Estimation, to solve the problem that the object grasping point is difficult to estimate in this complex scene. A simulated material frame can be set in the initial scene, the 3D model of the first object is loaded into the material frame, and the collision process between the first objects and between the first object and the material frame is simulated through simulation, so that the final form The simulated scene of object stacking is closer to the real scene. But the material frame is not necessary. In other embodiments of the present disclosure, a simulated scene in which the first objects are stacked in an orderly manner may also be loaded, depending on the need for simulating the actual working scene.
对于同一个初始场景,可以以不同的方式多次加载第一物体以得到多个模拟场景。所述不同的方式如可以是加载的第一物体的种类和/或数量不同,加载时3D模型初始的位置和姿态不同等。For the same initial scene, the first object may be loaded multiple times in different ways to obtain multiple simulation scenes. The different manners may be, for example, different types and/or quantities of the loaded first objects, different initial positions and postures of the 3D models during loading, and the like.
在本公开一示例性的实施例中,所述对所述模拟场景进行渲染,生成训练用的样本图像,包括:对每一模拟场景进行至少两次渲染,得到至少两组训练用的样本图像;其中,每次渲染时,在该模拟场景加入模拟相机、设定光源并为加载的第一物体添加纹理,渲染出的2D图像和深度图像作为一组样本图像;所述多次渲染中的任意两次渲染有以下至少一种参数不同:物体的纹理、模拟相机参数、光线参数。本实施例在渲染图片的过程中对模拟环境进行打光,通过调整模拟相机参数(如内参、位置、角度等)、光线参数(如打光的颜色和强度等),物体的纹理等,可以加强数据随机化程度,丰富样本图像的内容,增加样本图像的数量,从而提升训练数据的质量,进而提高训练好的的估计模型的性能。In an exemplary embodiment of the present disclosure, the rendering of the simulated scene to generate sample images for training includes: rendering each simulated scene at least twice to obtain at least two sets of sample images for training ; Wherein, when rendering each time, add a simulated camera to the simulated scene, set the light source and add texture to the loaded first object, and render the 2D image and depth image as a set of sample images; in the multiple renderings At least one of the following parameters is different between any two renderings: object texture, simulated camera parameters, and light parameters. In this embodiment, the simulated environment is illuminated during the process of rendering pictures, by adjusting the parameters of the simulated camera (such as internal parameters, position, angle, etc.), light parameters (such as the color and intensity of lighting, etc.), the texture of the object, etc., can Strengthen the degree of data randomization, enrich the content of sample images, and increase the number of sample images, thereby improving the quality of training data, and then improving the performance of the trained estimation model.
在本实施例的一个示例中,所述每次渲染时为加载的第一物体添加纹理,包括:每次渲染时,对加载到该模拟场景中的每一个第一物体,从采集的多种真实纹理中随机选择一种贴到该第一物体的表面;或者,每次渲染时,对加载到该模拟场景中每一种类的第一物体,从采集的多种真实纹理中随机选择一种贴到该种类的第一物体的表面。本示例通过随机化技术弥补真实数据和仿真数据之间的领域差异。所述真实纹理如可以从实际物体图像采集,使用真实纹理的图像等。将选取的纹理随机贴于模拟场景中随机堆叠的第一物体表面,可以渲染出多个带有不同纹理的图像。本公开实施例通过给物体抓取点估计模型提供具有不同纹理但相对一致的几何信息的样本图像,以及根据局部几何信息计算的采样点的抓取质量生成标注信息,可以使得估计模型利用局部几何信息去预测抓取点的抓取质量,从而可以实现模型对于未知物体的泛化能力。In an example of this embodiment, adding textures to the loaded first object each time of rendering includes: each time of rendering, for each first object loaded into the simulation scene, from various Randomly select one of the real textures to paste on the surface of the first object; or, each time rendering, for each type of first object loaded into the simulation scene, randomly select one of the collected multiple real textures Attached to the surface of the first object of this kind. This example compensates for domain differences between real and simulated data through a randomization technique. For example, the real texture can be collected from an actual object image, an image of the real texture can be used, and the like. Randomly pasting the selected texture on the surface of the first object stacked randomly in the simulated scene can render multiple images with different textures. In the embodiment of the present disclosure, by providing sample images with different textures but relatively consistent geometric information to the object grasping point estimation model, and generating annotation information based on the grasping quality of the sampling points calculated according to the local geometric information, the estimation model can use the local geometry Information to predict the grasping quality of the grasping point, so as to realize the generalization ability of the model for unknown objects.
在本公开一示例性的实施例中,所述样本图像包括2D图像和深度图像;所述根据所述第一物体的采样点的抓取质量生成所述样本图像中像素点的目标抓取质量,包括:In an exemplary embodiment of the present disclosure, the sample image includes a 2D image and a depth image; generating the target capture quality of the pixel points in the sample image according to the capture quality of the sampling points of the first object ,include:
对所述渲染出2D图像和深度图像的每一模拟场景处理如下,如图2所示:Each simulation scene of the rendered 2D image and depth image is processed as follows, as shown in Figure 2:
步骤210,根据渲染时的模拟相机内参和渲染出的所述深度图像,得到 该模拟场景中可见的第一物体的点云; Step 210, according to the internal parameters of the simulated camera during rendering and the rendered depth image, obtain the point cloud of the first visible object in the simulated scene;
步骤220,根据目标采样点在3D模型中的第一位置、所述3D模型在该模拟场景中的第二位置和加载后的姿态变化,确定所述目标采样点在所述点云中的位置,所述目标采样点指所述可见的第一物体的采样点;Step 220: Determine the position of the target sampling point in the point cloud according to the first position of the target sampling point in the 3D model, the second position of the 3D model in the simulation scene, and the attitude change after loading , the target sampling point refers to the sampling point of the visible first object;
步骤230,根据所述目标采样点的抓取质量和在所述点云中的位置,确定所述点云中的点的抓取质量并标注为所述2D图像中对应的像素点的目标抓取质量。 Step 230, according to the capture quality of the target sampling point and the position in the point cloud, determine the capture quality of the point in the point cloud and mark it as the target capture quality of the corresponding pixel point in the 2D image. Take the quality.
根据渲染时的模拟相机内参和深度图像(还可以包括其他信息)得到的可见的第一物体的点云,与根据上述第一位置、第二位置和姿态变化计算出的目标采样点的位置,并不一定是对齐的。而点云上的点与2D图像中的像素点有像素级别的一一对应关系的,目标采样点映射到2D图像中时,不一定对应2D图像中的某一个像素点,有可能落在某几个像素点之间。因而需要将根据目标采样点的抓取质量和在所述点云中的位置来确定所述点云中的点的抓取质量。The point cloud of the visible first object obtained according to the internal parameters of the simulated camera and the depth image (which may also include other information) during rendering, and the position of the target sampling point calculated according to the above-mentioned first position, second position and attitude change, Not necessarily aligned. However, there is a one-to-one correspondence between the points on the point cloud and the pixels in the 2D image at the pixel level. between a few pixels. Therefore, it is necessary to determine the grasping quality of the point in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud.
在本公开一示例性的实施例中,所述根据所述目标采样点的抓取质量和在所述点云中的位置,确定所述点云中的点的抓取质量,包括:In an exemplary embodiment of the present disclosure, the determining the grasping quality of the point in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud includes:
第一种,对每一目标采样点,将所述点云中与该目标采样点邻近的点的抓取质量确定为该目标采样点的抓取质量;或者The first one, for each target sampling point, determine the grasping quality of the points adjacent to the target sampling point in the point cloud as the grasping quality of the target sampling point; or
第二种,对所述点云中的点,根据与该点位置邻近的目标采样点的抓取质量插值得到该点的抓取质量;或者The second type, for a point in the point cloud, obtain the grasping quality of the point according to the grasping quality interpolation of the target sampling point adjacent to the position of the point; or
第三种,对每一目标采样点,将所述点云中与该目标采样点邻近的点的抓取质量确定为该目标采样点的抓取质量,确定完所有目标采样点邻近的点的抓取质量后,通过插值得到所述点云中其他点的抓取质量。The third method, for each target sampling point, determine the grasping quality of the points adjacent to the target sampling point in the point cloud as the grasping quality of the target sampling point, and determine the quality of all points adjacent to the target sampling point After the quality is grasped, the quality of grasping of other points in the point cloud is obtained by interpolation.
本公开实施例提供了多种将目标采样点的抓取质量传递给点云的点的方法。其中,第一种是将目标采样点的抓取质量赋给点云中邻近的点。在一个示例中,该邻近的点可以是点云中距离该目标采样点最近的一个或多个点,如可以根据设定的距离阈值,筛选出点云中到该目标采样点的距离小于所述 距离阈值的点,作为与该目标采样点邻近的点。第二种是一种插值方法,点云中的一个点可以根据邻近的多个目标采样点的抓取质量插值得到,插值时,可以采用基于高斯滤波的插值方法,或者根据多个目标采样点各自到该点的距离大小,为多个目标采样点赋予不同的权重,距离越大,权重越小,基于该权重对所述多个目标采样点的抓取质量进行加权平均,得到该点的抓取质量,本实施例也可以采用其他插值方法。与该点邻近的点也可以根据设定的距离阈值来筛选,如果该点只找到一个邻近的目标采样点,可以将该目标采样点的抓取质量赋予该点。第三种则是先确定点云中与目标采样点邻近的点的抓取质量之后,再根据点云中部分点的抓取质量通过插值得到点云中其他点的抓取质量。上述第二种和第三种方法均可以得到点云中所有点的抓取质量,将点云中这些点的抓取质量映射为2D图像中对应像素点的抓取质量后,就可以绘制出2D图像的抓取质量的热力图。但采用第一种方式,只得到点云中部分点的抓取质量,进而通过映射得到2D图像中部分像素点的抓取质量,也是可以的。此时在训练时,只将所述部分像素点的预测抓取质量与目标抓取质量比较,计算损失,进而根据损失优化模型。Embodiments of the present disclosure provide various methods for transferring the grasping quality of target sampling points to the points of the point cloud. Among them, the first is to assign the grasping quality of the target sampling point to the adjacent points in the point cloud. In an example, the adjacent point can be one or more points closest to the target sampling point in the point cloud, such as can be filtered out according to the set distance threshold, and the distance to the target sampling point in the point cloud is less than the set point. The point with the above-mentioned distance threshold is taken as the point adjacent to the target sampling point. The second is an interpolation method. A point in the point cloud can be interpolated according to the grasping quality of multiple nearby target sampling points. When interpolating, an interpolation method based on Gaussian filtering can be used, or based on multiple target sampling points. Each distance to the point is given different weights for multiple target sampling points. The larger the distance, the smaller the weight. Based on the weight, the grasping quality of the multiple target sampling points is weighted and averaged to obtain the weight of the point. Grabbing quality, this embodiment can also use other interpolation methods. The points adjacent to this point can also be filtered according to the set distance threshold. If only one adjacent target sampling point is found for this point, the grabbing quality of this target sampling point can be assigned to this point. The third method is to determine the capture quality of the points adjacent to the target sampling point in the point cloud, and then obtain the capture quality of other points in the point cloud by interpolation according to the capture quality of some points in the point cloud. Both the above second and third methods can obtain the capture quality of all points in the point cloud, and after mapping the capture quality of these points in the point cloud to the capture quality of the corresponding pixel points in the 2D image, you can draw Heatmap of grasping quality for 2D images. However, using the first method, it is also possible to obtain only the capture quality of some points in the point cloud, and then obtain the capture quality of some pixels in the 2D image through mapping. At this time, during training, only the predicted grasping quality of the part of pixels is compared with the target grasping quality, the loss is calculated, and then the model is optimized according to the loss.
在本公开一示例性的实施例中,所述根据所述目标采样点的抓取质量和在所述点云中的位置,确定所述点云中的点的抓取质量之后,所述生成方法还包括:对每一目标采样点,将该目标采样点的抓取方向作为所述点云中与该目标采样点邻近的点的抓取方向,结合所述点云中所述可见的第一物体之间的相对位置关系,在确定与该目标采样点邻近的点处的抓取空间小于所需的抓取空间的情况下,将所述点云中到该目标采样点的距离小于设定距离阈值的点的抓取质量向下调整。本公开实施例考虑到在堆叠状态下各物体的质量较优的抓取点可能因为相邻物体的存在而没有足够的空间来完成抓取操作,因此在确定了点云中点的抓取质量之后,再进行抓取空间的判决,对于受到抓取空间不足影响的点的抓取质量向下调整,具体地可以调整到某一个设定的质量阈值之下,以避免其被选中。In an exemplary embodiment of the present disclosure, after determining the grasping quality of the point in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud, the generating The method further includes: for each target sampling point, using the grabbing direction of the target sampling point as the grabbing direction of a point adjacent to the target sampling point in the point cloud, combining the visible first Relative positional relationship between objects, when it is determined that the grabbing space at a point adjacent to the target sampling point is smaller than the required grabbing space, the distance from the point cloud to the target sampling point is smaller than the set Grabbing quality is adjusted downwards for points with a certain distance threshold. The embodiment of the present disclosure considers that in the stacked state, the grasping points with better quality of each object may not have enough space to complete the grasping operation due to the existence of adjacent objects, so the grasping quality of the points in the point cloud is determined Afterwards, the judgment of the grasping space is carried out, and the grasping quality of the point affected by the insufficient grasping space is adjusted downward, specifically, it can be adjusted below a certain set quality threshold to avoid being selected.
在本公开一示例性的实施例中,所述样本图像包括2D图像,所述生成方法还包括:标注所述2D图像中每一像素点的分类,所述分类包括前景和背景,其中前景即图像中的第一物体。对像素点分类,可以用于训练估计模 型区分前景和背景的能力,从输入估计模型的样本图像中准确筛选出处于前景的点(即第一物体上的点),因此只有前景的点才需要进行预测抓取质量的估计。对2D图像点像素点的分类也可以基于对点云上的点的分类来得到。而通过将模拟场景中第一物体与模拟场景中背景之间的边界点映射到所述点云上,就可以确定点云上的点的分类,即是前景的点,还是背景的点。In an exemplary embodiment of the present disclosure, the sample image includes a 2D image, and the generating method further includes: labeling the classification of each pixel in the 2D image, the classification includes foreground and background, where the foreground is The first object in the image. Classification of pixel points can be used to train the estimation model to distinguish the ability of the foreground and background, and accurately select the points in the foreground (that is, the points on the first object) from the sample image input to the estimation model, so only the foreground points are needed. Estimates of predictive grasp quality are performed. The classification of the 2D image point pixels can also be obtained based on the classification of the points on the point cloud. By mapping the boundary points between the first object in the simulation scene and the background in the simulation scene to the point cloud, it is possible to determine the classification of the points on the point cloud, that is, foreground points or background points.
本公开一实施例还提供了一种物体抓取点估计模型的训练数据的生成方法,包括:An embodiment of the present disclosure also provides a method for generating training data of an object grasping point estimation model, including:
步骤一,收集各类样本物体的3D模型,对3D模型进行归一化,使得模型坐标系的原点置于样本物体的质心,模型坐标系的一坐标轴与样本物体的主轴一致。Step 1: Collect 3D models of various sample objects, and normalize the 3D models so that the origin of the model coordinate system is placed at the center of mass of the sample object, and the first coordinate axis of the model coordinate system is consistent with the main axis of the sample object.
可以使用例如立体光刻(STereoLithography,简称STL)等格式的3D模型,通过对3D模型中顶点和面信息的统计,通过求取所有顶点中心点的方式获得样本物体的质心位置。之后将模型坐标系的原点平移至样本物体的质心位置。此外,可以使用主元分析(PCA)方法确认样本物体的主轴方向,之后将样本物体的3D模型旋转,使得模型坐标系的一坐标轴方向与样本物体的主轴同向。由此获得归一化之后的3D模型,其模型坐标系的原点为样本物体的质心,模型坐标系中一坐标轴的方向与样本物体主轴的方向一致。A 3D model in a format such as STereoLithography (STL for short) can be used, and the position of the center of mass of the sample object can be obtained by calculating the center points of all vertices through the statistics of the vertex and surface information in the 3D model. Then translate the origin of the model coordinate system to the center of mass of the sample object. In addition, the principal component analysis (PCA) method can be used to confirm the main axis direction of the sample object, and then the 3D model of the sample object is rotated so that the direction of a coordinate axis of the model coordinate system is in the same direction as the main axis of the sample object. Thus, the normalized 3D model is obtained, the origin of the model coordinate system is the center of mass of the sample object, and the direction of a coordinate axis in the model coordinate system is consistent with the direction of the main axis of the sample object.
步骤二,对样本物体的3D模型进行抓取点采样,得到每一采样点的第一位置和抓取方向并记录; Step 2, sampling the grabbing points of the 3D model of the sample object, obtaining and recording the first position and grabbing direction of each sampling point;
本实施例的采样过程为对物体模型进行点云采样,使用采样后的点云,以固定邻域估计法向量,每一个点及其法向量代表一个采样点。在一个示例中,以单一吸盘吸取物体的场景为例,此时抓取点为吸取点。基于物体现有的顶点,使用体素采样方法或者其他采样方法(例如最远点采样)获得设定数量的采样点。同时使用每一采样点所在的一定范围邻域内所有点估算该采样点的法向量方向。估计法向量的方法可以是使用随机抽样一致算法(Random sample consensus,简称RANSAC)等去拟合采样点邻域中所有点估计出一个平面,平面的法向量近似为该采样点的法向量。The sampling process in this embodiment is to perform point cloud sampling on the object model, and use the sampled point cloud to estimate the normal vector in a fixed neighborhood, and each point and its normal vector represent a sampling point. In an example, take a scene where a single sucker grabs an object as an example, and at this time, the grabbing point is the sucking point. Based on the existing vertices of the object, use the voxel sampling method or other sampling methods (such as the farthest point sampling) to obtain a set number of sampling points. At the same time, all points within a certain range of neighborhood where each sampling point is located are used to estimate the direction of the normal vector of the sampling point. The method of estimating the normal vector can be to use the random sample consensus algorithm (RANSAC for short) to fit all points in the neighborhood of the sampling point to estimate a plane, and the normal vector of the plane is approximately the normal vector of the sampling point.
步骤三,对采样点进行吸取质量评估; Step 3, assessing the quality of the sampling points;
在单一吸盘吸取物体的场景下,质量评估过程包括计算吸取时的封闭性质量以及吸取时对重力力矩的对抗质量(需要可以对抗重力力矩从而完成稳定抓取),根据每一采样点的封闭性质量和对抗质量估算该采样点的抓取质量。在一个示例中,针对采用单一吸盘吸取的场景,要评估采样的吸取点(即采样点)是否是一个可以稳定地将样本物体吸取起来的吸取点。In the scenario where a single sucker picks up an object, the quality assessment process includes calculating the closedness quality during suction and the counterweight to the gravitational moment during suction (it needs to be able to resist the gravitational moment to complete stable grasping), according to the closure of each sampling point Quality and Adversarial Quality Estimate the grasping quality at that sampling point. In an example, for a scene using a single suction cup, it is necessary to evaluate whether the sampled suction point (that is, the sampling point) is a suction point that can stably pick up the sample object.
评估包括两个方面,首先是封闭性质量。封闭性质量可以通过如下方式来衡量:将具有设定半径的吸盘的端部近似为一个多边形,将这个多边形通过采样点的抓取点方向投射到3D模型的表面,然后比较投影后的多边形总体边长和原始边长。如果投影后的总体边长较原始边长增大较多则封闭性不好,反之如果变化不大则封闭性较好,该增大程度可以用增大值与原始边长值的比例表示,该比例可以根据其落入的比例区间给出一个相应评分。另一个方面是计算吸盘沿抓取方向(也可称为吸取点方向)在采样点位置吸取样本物体时,对重力力矩的对抗质量。对抗质量可以通过“wrench resistance”的建模方案计算,“wrench”是一个六维向量,前三维是力,后三维是力矩,该六维向量构成的空间为“wrench space”,“wrench resistance”表示作用在某个点的力和力矩合成的wrench是否有抵抗能力。如果重力矩可以被包含在吸取力及其摩擦力所产生的力矩所提供的wrench space中,则可以提供稳定吸取,反之则不行。最后,通过将封闭性质量和对抗质量的计算结果分别正则化为0到1之间的分数,并进行求和,从而得到对每一个吸取点的吸取质量评估结果。Evaluation includes two aspects, the first is the quality of closure. The quality of closure can be measured by approximating the end of the suction cup with a set radius as a polygon, projecting this polygon onto the surface of the 3D model through the grab point direction of the sampling point, and then comparing the projected polygon population side length and original side length. If the overall side length after projection increases more than the original side length, the sealing performance is not good. On the contrary, if the change is not large, the sealing performance is better. The increase degree can be expressed by the ratio of the increased value to the original side length value, The ratio can be given a corresponding score according to the ratio interval it falls into. Another aspect is to calculate the counter mass against the gravitational moment when the suction cup sucks the sample object at the sampling point along the grasping direction (also called the suction point direction). The confrontation quality can be calculated through the "wrench resistance" modeling scheme. "wrench" is a six-dimensional vector, the first three dimensions are forces, and the last three dimensions are moments. The space formed by the six-dimensional vectors is "wrench space", "wrench resistance" Indicates whether the combined wrench of the force and moment acting on a certain point has resistance. If the gravitational moment can be included in the wrench space provided by the suction force and the torque generated by the friction force, it can provide stable suction, but not vice versa. Finally, by normalizing the calculation results of closure quality and adversarial quality into scores between 0 and 1, and summing them up, the suction quality evaluation results for each suction point are obtained.
步骤四,搭建初始的模拟数据采集场景即初始场景,将从样本物体中选取的多个第一物体加载到该初始模拟场景中,使用物理引擎仿真第一物体的掉落动态和最终的堆叠姿态。Step 4: Build the initial simulation data collection scene, that is, the initial scene, load multiple first objects selected from the sample objects into the initial simulation scene, and use the physics engine to simulate the falling dynamics and final stacking posture of the first objects .
基于可以模拟物体动力学的物理引擎及相关仿真软件。添加一个模拟的料框,让其静态存在于仿真环境中以提供相应的碰撞基础。同时可以将第一物体的3D模型通过随机的位置和姿态加载到仿真环境中,并赋予每个3D模型一定的质量。这样通过物理引擎的模拟,第一物体的3D模型可以通过模拟重力的作用随机掉落在料框中,物理引擎也会同时计算不同第一物体之间的碰撞信息,从而让第一物体形成一个和真实场景非常接近的堆叠状态。基 于这样的方案,就在仿真场景中获得了接近真实的随机堆叠的第一物体的第二位置和姿态。Based on a physics engine and related simulation software that can simulate the dynamics of objects. Add a simulated material frame and let it exist statically in the simulation environment to provide the corresponding collision basis. At the same time, the 3D model of the first object can be loaded into the simulation environment through random positions and postures, and a certain quality can be given to each 3D model. In this way, through the simulation of the physics engine, the 3D model of the first object can randomly fall into the material frame by simulating the effect of gravity, and the physics engine will also calculate the collision information between different first objects at the same time, so that the first object forms a The stacking state is very close to the real scene. Based on such a scheme, the second position and posture of the first objects that are close to the real random stacking are obtained in the simulation scene.
步骤五,根据采样点的抓取质量生成基于所述模拟场景渲染的样本图像的标注数据。Step 5, generating annotation data of the sample image rendered based on the simulated scene according to the capture quality of the sampling points.
本步骤需要将在3D模型上进行抓取点采样得到的采样点及评估得到的每一采样点的抓取质量,映射到基于堆叠物体的模块场景。由于对模拟场景仿真时可以获取第一物体的3D模型的第二位置和姿态,而采样点的位置是基于3D模型的模型坐标系表示的,那么容易计算出在模拟场景中这些采样点的位置。This step needs to map the sampling points obtained by sampling the grab points on the 3D model and the grasping quality of each sampling point obtained through evaluation to the module scene based on stacked objects. Since the second position and attitude of the 3D model of the first object can be obtained during the simulation of the simulated scene, and the positions of the sampling points are expressed based on the model coordinate system of the 3D model, it is easy to calculate the positions of these sampling points in the simulated scene .
为了对模拟场景进行渲染,在仿真环境中的设定位置加入一个模拟相机,使用基于光线追踪的渲染引擎,高效地渲染出堆叠场景中第一物体的2D图像(如纹理图像)和深度图。结合模拟相机内参信息,可以将渲染出的深度图像转换为点云。基于计算出的采样点在模拟场景中的位置以及渲染得到的第一物体的点云,可以确定每一采样点在所属第一物体的点云中的位置。In order to render the simulated scene, a simulated camera is added at a set position in the simulated environment, and a ray-tracing-based rendering engine is used to efficiently render the 2D image (such as a texture image) and depth map of the first object in the stacked scene. Combined with the internal reference information of the simulated camera, the rendered depth image can be converted into a point cloud. Based on the calculated positions of the sampling points in the simulated scene and the rendered point cloud of the first object, the position of each sampling point in the point cloud of the first object to which it belongs can be determined.
为了通过领域随机化技术手段弥补真实数据和仿真数据之间的领域差异。本实施例通过采集各类真实纹理(如实际物体图片,某种规则纹理图片等),并将采集的真实纹理随机地贴于模拟环境中随机堆叠的第一物体表面。在基于光线追踪的模拟相机渲染过程中,就可以渲染出带有不同纹理的2D图像。通过给估计模型提供具有不同纹理但像素点的目标抓取质量相同的2D图像,可以驱动估计模型利用物体的局部几何信息去预测像素点的抓取质量,从而可以实现模型对于不同未知物体的泛化能力。In order to make up for the domain differences between real data and simulated data through domain randomization techniques. In this embodiment, various types of real textures (such as pictures of actual objects, pictures of certain regular textures, etc.) are collected, and the collected real textures are randomly pasted on the surface of the first object stacked randomly in the simulation environment. In the ray tracing-based analog camera rendering process, 2D images with different textures can be rendered. By providing the estimation model with 2D images with different textures but the same quality of pixel capture, the estimation model can be driven to use the local geometric information of the object to predict the quality of pixel capture, so that the model can be used for different unknown objects. ability.
在将采样点的抓取质量传送给点云中的点之前,可以先基于同一第一物体中的采样点的位置对这些采样点的抓取质量做一个高斯滤波。通过求取采样点在点云中邻近的点的方式,使得点云中位于一个采样点的设定邻域范围内的点(即与该采样点邻近的点)可以获得该采样点的抓取质量。而渲染所得的点云和渲染所得的2D图像之间有像素级别的一一对应关系,因此可以将所述邻近的点的抓取质量标注为2D图像中对应像素点的目标抓取质量。在一个示例中,对于2D图像中其他像素点的目标抓取质量,可以根据所述对应像素点的目标抓取质量插值得到。结合采样点的抓取方向和第一物体的 点云的局部几何信息(如第一物体之间的相对位置、距离等),可以将抓取空间不足的像素点的抓取质量调低,以便在选择最优抓取点时过滤掉一部分由于碰撞导致的低质量抓取点。需要说明的是,此处对抓取质量的调整也可以针对2D图像中的对应像素点进行。Before the grasping qualities of the sampling points are transmitted to the points in the point cloud, a Gaussian filter may be performed on the grasping qualities of these sampling points based on the positions of the sampling points in the same first object. By finding the points adjacent to the sampling point in the point cloud, the points in the point cloud within the set neighborhood range of a sampling point (that is, the points adjacent to the sampling point) can obtain the capture of the sampling point quality. There is a pixel-level one-to-one correspondence between the rendered point cloud and the rendered 2D image, so the capture quality of the adjacent point can be marked as the target capture quality of the corresponding pixel in the 2D image. In an example, the target grasping quality of other pixels in the 2D image may be obtained by interpolation according to the target grasping quality of the corresponding pixel. Combining the grabbing direction of the sampling point and the local geometric information of the point cloud of the first object (such as the relative position and distance between the first objects, etc.), the grabbing quality of pixels with insufficient grabbing space can be adjusted down, so that When selecting the optimal grab point, some low-quality grab points caused by collisions are filtered out. It should be noted that the adjustment of the capture quality here may also be performed on corresponding pixels in the 2D image.
由此可以获得模拟场景渲染出的2D图像的抓取质量热力图。可选地,将所述抓取质量热力图输出为样本图像的标注数据,但标注数据不一定是热力图的形式,只要包含2D图像中像素点的目标抓取质量的信息即可。以所述抓取质量热力图为标注数据时,训练时可以驱动所述估计模型学习或拟合所述抓取质量热力图。In this way, the capture quality heat map of the 2D image rendered by the simulated scene can be obtained. Optionally, the grasping quality heat map is output as annotation data of the sample image, but the mark data is not necessarily in the form of a heat map, as long as it contains information about the target grasping quality of pixels in the 2D image. When the grasping quality heat map is used as labeled data, the estimation model may be driven to learn or fit the grasping quality heat map during training.
本公开一实施例还提供了一种物体抓取点估计模型的训练数据的生成装置,如图3所示,包括处理器60以及存储有计算机程序的存储器50,其中,所述处理器60执行所述计算机程序时实现如本公开任一实施例所述的物体抓取点估计模型的训练数据的生成方法。本公开实施例及其他实施例中的处理器可以是一种集成电路芯片,具有信号的处理能力。所述处理器可以是通用处理器,如中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。An embodiment of the present disclosure also provides a device for generating training data of an object grasping point estimation model, as shown in FIG. 3 , including a processor 60 and a memory 50 storing a computer program, wherein the processor 60 executes The computer program implements the method for generating training data of an object grasping point estimation model according to any embodiment of the present disclosure. The processor in the embodiments of the present disclosure and other embodiments may be an integrated circuit chip, which has a signal processing capability. Described processor can be general-purpose processor, as central processing unit (Central Processing Unit, be called for short CPU), network processor (Network Processor, be called for short NP) etc.; It can also be digital signal processor (DSP), application-specific integrated circuit ( ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps and logic block diagrams disclosed in the embodiments of the present invention may be implemented or executed. A general purpose processor may be a microprocessor or any conventional processor or the like.
本公开上述实施例可以具有以下优点:The above embodiments of the present disclosure may have the following advantages:
使用合成数据生成以及合成数据自动标注来替代人工数据采集和标注,可以降低成本,提升自动化程度,而且可以使得数据质量有较高保证,抓取点的标注准确度有较高保证。Using synthetic data generation and automatic labeling of synthetic data to replace manual data collection and labeling can reduce costs, improve automation, and ensure high data quality and high accuracy of labeling points.
在合成数据标注过程中使用物体物理模型(即3D模型)和物体几何信息,基于基础物理学原理进行抓取点质量评估,从而确保抓取点标注合理性。In the synthetic data labeling process, the object physical model (ie 3D model) and object geometric information are used to evaluate the quality of the grab points based on basic physics principles, so as to ensure the rationality of the grab point labeling.
在合成数据生成过程中使用领域随机化技术,使用纹理随机,光照随机,相机位置随机等方式生成大量合成数据来训练估计模型。从而使得估计模型可以跨越合成数据和真实数据的领域鸿沟,学习到物体的局部几何特征,从 而准确完成物体抓取点的估计任务。In the process of generating synthetic data, domain randomization technology is used to generate a large amount of synthetic data to train the estimation model by means of random texture, random illumination, random camera position, etc. Therefore, the estimation model can bridge the domain gap between synthetic data and real data, and learn the local geometric features of the object, so as to accurately complete the task of estimating the object's grasping point.
本公开一实施例还提供了一种物体抓取点的估计模型的训练方法,如图4所示,包括:An embodiment of the present disclosure also provides a method for training an estimation model of an object grasping point, as shown in FIG. 4 , including:
步骤310,获取训练数据,所述训练数据包括样本图像和所述样本图像中像素点的目标抓取质量; Step 310, acquiring training data, the training data including sample images and target capture quality of pixels in the sample images;
步骤320,以所述样本图像为输入数据,采用机器学习的方式对物体抓取点的估计模型进行训练,训练时根据所述估计模型输出的样本图像中像素点的预测抓取质量和所述目标抓取质量之间的差值计算损失。 Step 320, using the sample image as input data, using machine learning to train the estimation model of the object grasping point, during training, according to the predicted grasping quality of the pixel points in the sample image output by the estimation model and the described The difference between target grasp qualities computes the loss.
本公开实施例估计模型的训练方法学习的是2D图像中像素点的抓取质量,之后再根据2D图像中像素点的预测抓取质量选出最优的抓取点,相对于直接最优的抓取点的方式,具有更好的精度和稳定性。The training method of the estimation model in the embodiment of the present disclosure learns the grasping quality of the pixels in the 2D image, and then selects the optimal grasping point according to the predicted grasping quality of the pixels in the 2D image. The way of grabbing points has better accuracy and stability.
本公开实施例的机器学习可以是有监督的深度学习,以及非深度学习的机器学习等。The machine learning in the embodiments of the present disclosure may be supervised deep learning, non-deep learning machine learning, and the like.
在本公开一示例性的实施例中,所述训练数据按照如本公开任一实施例所述物体抓取点估计模型的训练数据的生成方法生成。In an exemplary embodiment of the present disclosure, the training data is generated according to the method for generating training data of the object grasping point estimation model described in any embodiment of the present disclosure.
在本公开一示例性的实施例中,所述估计模型的网络架构如图5所示,包括:In an exemplary embodiment of the present disclosure, the network architecture of the estimation model is shown in Figure 5, including:
主干网络(Backbone)10,采用语义分割网络架构(比如DeepLab,UNet等),设置为从输入的2D图像和深度图像中提取特征;The backbone network (Backbone) 10 adopts a semantic segmentation network architecture (such as DeepLab, UNet, etc.), and is set to extract features from the input 2D image and depth image;
多分支网络20,采用多任务学习网络架构,设置为基于提取的所述特征进行预测,以输出所述2D图像中像素点的预测抓取质量。The multi-branch network 20 adopts a multi-task learning network architecture and is configured to perform prediction based on the extracted features, so as to output the predicted capture quality of pixels in the 2D image.
所述一示例中,所述多分支网络(也可称为网络头或检测头)包括:In the example, the multi-branch network (also referred to as a network head or a detection head) includes:
第一分支网络21学习语义分割信息以区分前景背景,设置为输出所述2D图像中每一像素点的分类置信度,所述分类包括前景和背景;及The first branch network 21 learns semantic segmentation information to distinguish the foreground and background, and is configured to output the classification confidence of each pixel in the 2D image, and the classification includes foreground and background; and
第二分支网络23,学习2D图像中像素点的抓取质量信息,设置为输出所述2D图像中根据所述分类置信度确定的分类为前景的像素点的预测抓取 质量。例如可以将分类为前景的置信度大于设定置信度阈值的像素点称为分类为前景的像素点。The second branch network 23 learns the grasping quality information of pixels in the 2D image, and is configured to output the predicted grasping quality of pixels classified as foreground determined according to the classification confidence in the 2D image. For example, pixels classified as foreground with confidence greater than a set confidence threshold may be referred to as foreground pixels.
本示例中涉及到分类,因此所述训练数据中需要包括分类的数据。This example involves classification, so the training data needs to include classified data.
本示例中,所述样本图像包括2D图像和深度图像;所述主干网络10和多分支网络20均包含深度通道,其中的卷积层可以采用3D卷积结构。In this example, the sample image includes a 2D image and a depth image; both the backbone network 10 and the multi-branch network 20 include depth channels, and the convolutional layers therein may adopt a 3D convolutional structure.
本示例中,在训练时,第一分支网络21的损失基于所述2D图像中所有像素点的分类损失计算;第二分支网络23的损失基于分类为前景的部分或全部像素点的预测抓取质量与目标抓取质量的差值计算;主干网络10的损失根据第一分支网络21和第二分支网络23的总损失计算。在计算出各个网络的损失后,可以使用梯度下降算法对各网络的参数进行优化,直至损失最小,模型收敛。在训练过程中,也可以对深度图像进行随机方块形遮挡,比如一次遮挡64*64个像素,使得网络可以更好地利用深度图中的结构化信息。In this example, during training, the loss of the first branch network 21 is calculated based on the classification loss of all pixels in the 2D image; the loss of the second branch network 23 is based on the predicted capture of some or all pixels classified as foreground The difference between the quality and the target grasping quality is calculated; the loss of the backbone network 10 is calculated according to the total loss of the first branch network 21 and the second branch network 23 . After calculating the loss of each network, the parameters of each network can be optimized using the gradient descent algorithm until the loss is minimized and the model converges. During the training process, the depth image can also be randomly block-shaped, such as 64*64 pixels at a time, so that the network can better utilize the structured information in the depth image.
使用训练数据对上述评估模块进行多次迭代训练后,使用验证数据对训练出的估计模型的精度进行验证,验证数据可以用与训练数据相同的方法生成,在估计模型的精度达到要求后,该估计模型训练好,可以使用,如果精度达不到要求,则继续进行训练。使用时输入包含实际的待抓取物体的2D图像和深度图像,输出所述2D图像中像素点的预测抓取质量。After using the training data to train the above evaluation module for multiple iterations, use the verification data to verify the accuracy of the trained estimation model. The verification data can be generated in the same way as the training data. After the accuracy of the estimation model meets the requirements, the It is estimated that the model is trained well and can be used. If the accuracy does not meet the requirements, continue training. When in use, the 2D image and the depth image containing the actual object to be grasped are input, and the predicted grasping quality of the pixels in the 2D image is output.
本公开实施例使用基于深度学习原理的多任务学习框架构建抓取点估计模型,可以有效解决简单点云分割方案中错误率高以及无法区分临近物体的问题。The embodiments of the present disclosure use a multi-task learning framework based on deep learning principles to build a grasping point estimation model, which can effectively solve the problems of high error rate and inability to distinguish adjacent objects in a simple point cloud segmentation scheme.
本公开一实施例还提供了一种物体抓取点的估计模型的训练装置,参见图3,包括处理器以及存储有计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的物体抓取点的估计模型的训练方法。An embodiment of the present disclosure also provides a training device for an estimation model of an object grasping point, referring to FIG. 3 , which includes a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the following is implemented: A method for training an estimation model of an object grasping point described in any embodiment of the present disclosure.
本公开实施例估计模型的训练方法通过像素级别的稠密预测来预测2D图像中像素点的抓取质量。在一个分支上做像素级的前景和背景的分类预测。在另一分支上,可以针对2D图像中分类为前景的每一个像素输出一个抓取质量的预测值也即预测抓取质量。本公开实施例估计模型的主干网络和分支网络均包含深度通道,在输入端,将包含深度通道信息的深度图像输入主干 网络,再将深度通道学习到的特征从通道维度方向融合入彩色的2D图像的特征中,并进行逐像素的多任务预测,可以帮助估计模型更好的处理待抓取物体堆叠场景下的抓取点估计任务。In the embodiment of the present disclosure, the estimation model training method predicts the grasping quality of pixels in a 2D image through pixel-level dense prediction. Do pixel-level foreground and background classification predictions on one branch. In another branch, a grasp quality prediction value, ie predicted grasp quality, may be output for each pixel classified as foreground in the 2D image. Both the backbone network and the branch network of the estimation model in the embodiment of the disclosure include depth channels. At the input end, the depth image containing the depth channel information is input into the backbone network, and then the features learned by the depth channel are fused into the color 2D from the channel dimension direction. The features of the image, and pixel-by-pixel multi-task prediction, can help the estimation model to better handle the grasping point estimation task in the scene where objects to be grasped are stacked.
本公开一实施例还提供了一种物体抓取点的估计方法,如图6所示,包括:An embodiment of the present disclosure also provides a method for estimating an object grasping point, as shown in FIG. 6 , including:
步骤410,获取包含待抓取物体的场景图像,所述场景图像包括2D图像,或包括2D图像和深度图像; Step 410, acquiring a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;
步骤420,将所述场景图像输入物体抓取点的估计模型,其中,所述估计模型是采用本公开任一实施例所述的训练方法训练好的估计模型; Step 420, input the scene image into the estimation model of the object grasping point, wherein the estimation model is an estimation model trained by the training method described in any embodiment of the present disclosure;
步骤430,根据所述估计模型输出的所述2D图像中像素点的预测抓取质量,确定所述待抓取物体的抓取点的位置。Step 430: Determine the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model.
本公开实施例实现相机驱动,待抓取物体的场景图像如2D图像和深度图像,可以通过适配各类工业场景的深度相机拍摄得到。从深度相机获取彩色2D图像和深度图像后将其裁剪并缩放到估计模型输入所要求的图片大小,再输入所述估计模型。The embodiments of the present disclosure realize camera driving, and scene images of objects to be captured, such as 2D images and depth images, can be captured by depth cameras adapted to various industrial scenes. After acquiring the color 2D image and the depth image from the depth camera, they are cropped and scaled to the image size required by the estimation model input, and then input into the estimation model.
在本公开一示例性的实施例中,根据所述估计模型输出的所述2D图像中像素点的预测抓取质量,确定所述待抓取物体的抓取点的位置,包括:In an exemplary embodiment of the present disclosure, determining the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model includes:
选取出所述待抓取物体中的预测抓取质量大于设定质量阈值的全部或部分像素点;Selecting all or part of the pixels in the object to be grasped whose predicted grasping quality is greater than a set quality threshold;
对选取出的像素点进行聚类并计算出一个或多个类中心,将所述类中心对应的像素点作为所述待抓取物体的候选抓取点;Clustering the selected pixels and calculating one or more class centers, using the pixels corresponding to the class centers as candidate grabbing points for the object to be grabbed;
基于预定规则对得到的所述候选抓取点排序,根据所述排序将最优的一个候选抓取点确定为所述待抓取物体的抓取点。The obtained candidate grasping points are sorted based on a predetermined rule, and an optimal candidate grasping point is determined as the grasping point of the object to be grasped according to the ranking.
在本公开一示例性的实施例中,基于预定规则对得到的所述候选抓取点排序时,可以基于预定的启发式规则进行排序,所述启发式规则例如可以基于候选抓取点相对于相机的距离,吸取点是否在实际的料框中,吸取点是否 会带来碰撞等等条件设置,利用这些信息对候选抓取点进行排序,将最优的一个候选抓取点确定为待抓取物体的抓取点。In an exemplary embodiment of the present disclosure, when sorting the obtained grabbing points based on predetermined rules, the sorting may be based on predetermined heuristic rules, and the heuristic rules may be based, for example, on the relative The distance from the camera, whether the pick-up point is in the actual material frame, whether the pick-up point will bring collisions and other condition settings, use these information to sort the candidate grabbing points, and determine the best candidate grabbing point as the one to be grabbed Grab points for objects.
本公开一实施例还提供了一种物体抓取点的估计装置,参见图3,包括处理器以及存储有计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的物体抓取点的估计方法。An embodiment of the present disclosure also provides a device for estimating the grasping point of an object. Referring to FIG. 3 , it includes a processor and a memory storing a computer program, wherein, when the processor executes the computer program, it implements any aspect of the present disclosure. A method for estimating object grasping points described in an embodiment.
本公开上述实施例基于训练好的估计模型,将相机拍摄的2D图像和深度图像送入估计模型进行前向推理,输出2D图像中像素点的预测评估质量。预测评估质量大于设定质量阈值的像素点如果超过设定的数量,可以将其中设定数量的预测抓取质量最优的像素点如TOP50、TOP100等选出。对选取出的像素点进行聚类并计算出一个或多个类中心后,可以将2D图像中与类中心最近的像素点(可以是一个像素点,也可以是一个区域中的像素点)作为所述候选抓取点。由于采用的估计模型可以达到较好的精度,因此采用本实施例的估计方法和装置可以提高物体抓取点估计的准确性,进而提升抓取的成功率。The above-mentioned embodiments of the present disclosure are based on the trained estimation model, send the 2D image and the depth image captured by the camera into the estimation model for forward reasoning, and output the prediction and evaluation quality of the pixels in the 2D image. If the number of pixels whose predicted evaluation quality is greater than the set quality threshold exceeds the set number, the set number of pixels with the best predicted capture quality, such as TOP50, TOP100, etc., can be selected. After clustering the selected pixels and calculating one or more class centers, the nearest pixel to the class center in the 2D image (it can be a pixel or a pixel in an area) can be used as The candidate grabbing points. Since the estimation model adopted can achieve better accuracy, the estimation method and device of this embodiment can improve the accuracy of object grasping point estimation, thereby improving the success rate of grasping.
本公开一实施例还提供了一种机器人视觉系统,如图7所示,包括:An embodiment of the present disclosure also provides a robot vision system, as shown in FIG. 7 , including:
相机1,设置为拍摄包含待抓取物体的场景图像,所述场景图像包括2D图像,或包括2D图像和深度图像;The camera 1 is configured to shoot a scene image containing an object to be captured, and the scene image includes a 2D image, or includes a 2D image and a depth image;
控制装置2,包括如权利要求20所述的物体抓取点的估计装置,所述控制装置设置为根据所述相机拍摄的所述场景图像,确定所述待抓取物体的抓取点的位置;及,根据所述抓取点的位置控制机器人执行的抓取动作;The control device 2 includes the estimation device of the object grasping point according to claim 20, the control device is configured to determine the position of the grasping point of the object to be grasped according to the scene image captured by the camera ; and, controlling the grabbing action performed by the robot according to the position of the grabbing point;
机器人3,设置为执行所述抓取动作。The robot 3 is configured to perform the grasping action.
本公开实施例的机器人视觉系统可以提高物体抓取点估计的准确性,进而提升抓取的成功率。The robot vision system of the embodiments of the present disclosure can improve the accuracy of object grasping point estimation, thereby improving the success rate of grasping.
本公开一实施例还提供了一种非瞬态计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序时被处理器执行时实现如本公开任一实施例所述的物体抓取点估计模型的训练数据的生成方法,或者实现本公开任一实施例所述的物体抓取点的估计模型的训练方法,或者实现本公开任一实施例所述的物体抓取点的估计方法。An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program described in any embodiment of the present disclosure can be implemented. The method for generating training data of the object grasping point estimation model described above, or the method for training the object grasping point estimation model described in any embodiment of the present disclosure, or the object grasping method described in any embodiment of the present disclosure Estimation method for taking points.
在本公开上述任意一个或多个示例性实施例中,所描述的功能可以硬件、软件、固件或其任一组合来实施。如果以软件实施,那么功能可作为一个或多个指令或代码存储在计算机可读介质上或经由计算机可读介质传输,且由基于硬件的处理单元执行。计算机可读介质可包含对应于例如数据存储介质等有形介质的计算机可读存储介质,或包含促进计算机程序例如根据通信协议从一处传送到另一处的任何介质的通信介质。以此方式,计算机可读介质通常可对应于非暂时性的有形计算机可读存储介质或例如信号或载波等通信介质。数据存储介质可为可由一个或多个计算机或者一个或多个处理器存取以检索用于实施本公开中描述的技术的指令、代码和/或数据结构的任何可用介质。计算机程序产品可包含计算机可读介质。In any one or more of the above-mentioned exemplary embodiments of the present disclosure, the described functions may be implemented by hardware, software, firmware or any combination thereof. If implemented in software, the functions may be stored on or transmitted over, as one or more instructions or code, a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media that correspond to tangible media such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, eg, according to a communication protocol. In this manner, a computer-readable medium may generally correspond to a non-transitory tangible computer-readable storage medium or a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure. A computer program product may comprise a computer readable medium.
举例来说且并非限制,此类计算机可读存储介质可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储装置、磁盘存储装置或其它磁性存储装置、快闪存储器或可用来以指令或数据结构的形式存储所要程序代码且可由计算机存取的任何其它介质。而且,还可以将任何连接称作计算机可读介质举例来说,如果使用同轴电缆、光纤电缆、双绞线、数字订户线(DSL)或例如红外线、无线电及微波等无线技术从网站、服务器或其它远程源传输指令,则同轴电缆、光纤电缆、双纹线、DSL或例如红外线、无线电及微波等无线技术包含于介质的定义中。然而应了解,计算机可读存储介质和数据存储介质不包含连接、载波、信号或其它瞬时(瞬态)介质,而是针对非瞬时有形存储介质。如本文中所使用,磁盘及光盘包含压缩光盘(CD)、激光光盘、光学光盘、数字多功能光盘(DVD)、软磁盘或蓝光光盘等,其中磁盘通常以磁性方式再生数据,而光盘使用激光以光学方式再生数据。上文的组合也应包含在计算机可读介质的范围内。By way of example and not limitation, such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk or other magnetic storage, flash memory, or may be used to store instructions or data Any other medium that stores desired program code in the form of a structure and that can be accessed by a computer. Moreover, any connection could also be termed a computer-readable medium. For example, if a connection is made from a website, server or other remote source for transmitting instructions, coaxial cable, fiber optic cable, dual wire, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not encompass connections, carrier waves, signals, or other transitory (transitory) media, but are instead directed to non-transitory tangible storage media. As used herein, disk and disc include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, or blu-ray disc, etc. where disks usually reproduce data magnetically, while discs use lasers to Data is reproduced optically. Combinations of the above should also be included within the scope of computer-readable media.
可由例如一个或多个数字信号理器(DSP)、通用微处理器、专用集成电路(ASIC)现场可编程逻辑阵列(FPGA)或其它等效集成或离散逻辑电路等一个或多个处理器来执行指令。因此,如本文中所使用的术语“处理器”可指上述结构或适合于实施本文中所描述的技术的任一其它结构中的任一者。另外,在一些方面中,本文描述的功能性可提供于经配置以用于编码和解码的专用 硬件和/或软件模块内,或并入在组合式编解码器中。并且,可将所述技术完全实施于一个或多个电路或逻辑元件中。can be implemented by one or more processors such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Execute instructions. Accordingly, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements.
本公开实施例的技术方案可在广泛多种装置或设备中实施,包含无线手机、集成电路(IC)或一组IC(例如,芯片组)。本公开实施例中描各种组件、模块或单元以强调经配置以执行所描述的技术的装置的功能方面,但不一定需要通过不同硬件单元来实现。而是,如上所述,各种单元可在编解码器硬件单元中组合或由互操作硬件单元(包含如上所述的一个或多个处理器)的集合结合合适软件和/或固件来提供。The technical solutions of the embodiments of the present disclosure may be implemented in a wide variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a set of ICs (eg, a chipset). Various components, modules, or units are described in the disclosed embodiments to emphasize functional aspects of devices configured to perform the described techniques, but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units (comprising one or more processors as described above) in combination with suitable software and/or firmware.

Claims (22)

  1. 一种物体抓取点估计模型的训练数据的生成方法,包括:A method for generating training data for an object grasping point estimation model, comprising:
    获取样本物体的3D模型,基于所述样本物体的3D模型进行抓取点采样并评估采样点的抓取质量;Acquiring a 3D model of the sample object, sampling grab points based on the 3D model of the sample object and evaluating the grab quality of the sample points;
    对加载有第一物体的3D模型的模拟场景进行渲染,生成训练用的样本图像,所述第一物体是从所述样本物体中选取的;Rendering a simulated scene loaded with a 3D model of a first object to generate a sample image for training, the first object is selected from the sample objects;
    根据所述第一物体的采样点的抓取质量生成所述样本图像中像素点的目标抓取质量。A target grasping quality of pixels in the sample image is generated according to the grasping quality of the sampling points of the first object.
  2. 根据权利要求1所述的生成方法,其特征在于:The generating method according to claim 1, characterized in that:
    所述获取样本物体的3D模型,包括:创建或采集所述样本物体的3D模型,通过归一化使得所述样本物体的质心位于所述3D模型的模型坐标系的原点,所述样本物体的主轴与所述模型坐标系中一坐标轴的方向一致。The acquisition of the 3D model of the sample object includes: creating or collecting the 3D model of the sample object, making the center of mass of the sample object at the origin of the model coordinate system of the 3D model through normalization, and the The main axis is consistent with the direction of a coordinate axis in the model coordinate system.
  3. 根据权利要求1所述的生成方法,其特征在于:The generating method according to claim 1, characterized in that:
    所述基于所述样本物体的3D模型进行抓取点采样,包括:对所述样本物体的3D模型进行点云采样,确定采样点在3D模型中的第一位置和抓取方向并记录;所述第一位置用所述采样点在所述3D模型的模型坐标系中的坐标表示,所述抓取方向根据所述3D模型中所述采样点的法向量确定。The sampling of grabbing points based on the 3D model of the sample object includes: sampling the point cloud of the 3D model of the sample object, determining and recording the first position and grabbing direction of the sampling point in the 3D model; The first position is represented by the coordinates of the sampling point in the model coordinate system of the 3D model, and the grasping direction is determined according to the normal vector of the sampling point in the 3D model.
  4. 根据权利要求1或2或3所述的生成方法,其特征在于:According to the generation method described in claim 1 or 2 or 3, it is characterized in that:
    所述评估采样点的抓取质量,包括:在使用单一吸盘吸取样本物体的场景下,根据每一采样点的封闭性质量和对抗质量估算该采样点的抓取质量;其中,所述封闭性质量根据吸盘在该采样点位置吸取样本物体且所述吸盘的轴向与该采样点的抓取方向一致的情况下,该吸盘端部与该样本物体表面之间的密闭程度确定,所述对抗质量根据该情况下根据该样本物体的重力力矩和吸盘吸取物体时能够产生的力矩对所述重力力矩的对抗程度确定。The evaluation of the grasping quality of the sampling point includes: in the scenario where a single suction cup is used to absorb the sample object, estimating the grasping quality of the sampling point according to the closedness quality and the confrontation quality of each sampling point; wherein, the closedness The quality is determined according to the airtightness between the end of the suction cup and the surface of the sample object when the suction cup sucks the sample object at the position of the sampling point and the axial direction of the suction cup is consistent with the grabbing direction of the sampling point. The quality is determined according to the gravitational moment of the sample object in this case and the resistance degree of the moment that the suction cup can generate when the object is picked up by the suction cup against the gravitational moment.
  5. 根据权利要求1所述的生成方法,其特征在于:The generating method according to claim 1, characterized in that:
    所述模拟场景通过将所述第一物体的3D模型加载到初始场景而得到, 所述加载过程包括:The simulated scene is obtained by loading the 3D model of the first object into the initial scene, and the loading process includes:
    从所述样本物体中选取要加载的第一物体的种类和数量,为所述第一物体的质量赋值;selecting the type and quantity of the first object to be loaded from the sample objects, and assigning a value to the quality of the first object;
    将所述第一物体的3D模型按随机的位置和姿态加载到所述初始场景中;Loading the 3D model of the first object into the initial scene according to a random position and posture;
    使用物理引擎仿真所述第一物体掉落的过程以及最终形成的堆叠状态,得到所述模拟场景;using a physics engine to simulate the falling process of the first object and the final stacking state to obtain the simulated scene;
    记录所述第一物体的3D模型在所述模拟场景中的第二位置和姿态。Recording a second position and posture of the 3D model of the first object in the simulation scene.
  6. 根据权利要求1所述的生成方法,其特征在于:The generating method according to claim 1, characterized in that:
    所述对所述模拟场景进行渲染,生成训练用的样本图像,包括:对每一模拟场景进行至少两次渲染,得到至少两组训练用的样本图像;其中,每次渲染时,在该模拟场景加入模拟相机、设定光源并为加载的第一物体添加纹理,渲染出的2D图像和深度图像作为一组样本图像;所述多次渲染中的任意两次渲染有以下至少一种参数不同:物体的纹理、模拟相机参数、光线参数。The rendering of the simulated scene to generate sample images for training includes: rendering each simulated scene at least twice to obtain at least two groups of sample images for training; Add a simulated camera to the scene, set the light source, and add texture to the loaded first object, and render the 2D image and depth image as a set of sample images; any two renderings in the multiple renderings have at least one of the following parameters different : The texture of the object, the simulated camera parameters, and the light parameters.
  7. 根据权利要求6所述的生成方法,其特征在于:The generation method according to claim 6, characterized in that:
    所述每次渲染时为加载的第一物体添加纹理,包括:Adding textures to the loaded first object each time the rendering includes:
    每次渲染时,对加载到该模拟场景中的每一个第一物体,从采集的多种真实纹理中随机选择一种贴到该第一物体的表面;或者At each rendering, for each first object loaded into the simulation scene, randomly select one of the multiple real textures collected and paste it on the surface of the first object; or
    每次渲染时,对加载到该模拟场景中每一种类的第一物体,从采集的多种真实纹理中随机选择一种贴到该种类的第一物体的表面。At each rendering, for each type of first object loaded into the simulation scene, one of the multiple real textures collected is randomly selected to be pasted on the surface of the first object of this type.
  8. 根据权利要求1或6或7所述的生成方法,其特征在于:According to the generation method described in claim 1 or 6 or 7, it is characterized in that:
    所述样本图像包括2D图像和深度图像;所述根据所述第一物体的采样点的抓取质量生成所述样本图像中像素点的目标抓取质量,包括:The sample image includes a 2D image and a depth image; the generating the target capture quality of pixels in the sample image according to the capture quality of the sampling points of the first object includes:
    对所述渲染出2D图像和深度图像的每一模拟场景处理如下:Each simulated scene of the rendered 2D image and depth image is processed as follows:
    根据渲染时的模拟相机内参和渲染出的所述深度图像,得到该模拟场景中可见的第一物体的点云;Obtaining the point cloud of the first object visible in the simulation scene according to the internal parameters of the simulation camera during rendering and the rendered depth image;
    根据目标采样点在3D模型中的第一位置、所述3D模型在该模拟场景中 的第二位置和加载后的姿态变化,确定所述目标采样点在所述点云中的位置,所述目标采样点指所述可见的第一物体的采样点;Determine the position of the target sampling point in the point cloud according to the first position of the target sampling point in the 3D model, the second position of the 3D model in the simulation scene, and the attitude change after loading, the The target sampling point refers to the sampling point of the visible first object;
    根据所述目标采样点的抓取质量和在所述点云中的位置,确定所述点云中的点的抓取质量,并将所述点的抓取质量标注为所述2D图像中对应的像素点的目标抓取质量。According to the grasping quality of the target sampling point and the position in the point cloud, determine the grasping quality of the point in the point cloud, and mark the grasping quality of the point as corresponding in the 2D image The target grabbing quality of pixels.
  9. 根据权利要求8所述的生成方法,其特征在于:The generation method according to claim 8, characterized in that:
    所述根据所述目标采样点的抓取质量和在所述点云中的位置,确定所述点云中的点的抓取质量,包括:The determining the grasping quality of the point in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud includes:
    对每一目标采样点,将所述点云中与该目标采样点邻近的点的抓取质量确定为该目标采样点的抓取质量;或者For each target sampling point, determining the grasping quality of points adjacent to the target sampling point in the point cloud as the grasping quality of the target sampling point; or
    对所述点云中的点,根据与该点位置邻近的目标采样点的抓取质量插值得到该点的抓取质量;或者For a point in the point cloud, obtain the grasping quality of the point according to the grasping quality interpolation of the target sampling point adjacent to the position of the point; or
    对每一目标采样点,将所述点云中与该目标采样点邻近的点的抓取质量确定为该目标采样点的抓取质量,确定完所有目标采样点邻近的点的抓取质量后,通过插值得到所述点云中其他点的抓取质量。For each target sampling point, determine the grasping quality of the points adjacent to the target sampling point in the point cloud as the grasping quality of the target sampling point, after determining the grasping quality of all the points adjacent to the target sampling point , to obtain the grasping quality of other points in the point cloud by interpolation.
  10. 根据权利要求9所述的生成方法,其特征在于:The generation method according to claim 9, characterized in that:
    所述根据所述目标采样点的抓取质量和在所述点云中的位置,确定所述点云中的点的抓取质量之后,所述生成方法还包括:对每一目标采样点,将该目标采样点的抓取方向作为所述点云中与该目标采样点邻近的点的抓取方向,结合所述点云中所述可见的第一物体之间的相对位置关系,在确定与该目标采样点邻近的点处的抓取空间小于所需的抓取空间的情况下,将所述点云中到该目标采样点的距离小于设定距离阈值的点的抓取质量向下调整。After determining the grasping quality of the points in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud, the generating method further includes: for each target sampling point, The grabbing direction of the target sampling point is used as the grabbing direction of a point adjacent to the target sampling point in the point cloud, and in combination with the relative positional relationship between the visible first objects in the point cloud, determine When the grabbing space at the point adjacent to the target sampling point is less than the required grabbing space, the grabbing quality of the point in the point cloud whose distance to the target sampling point is less than the set distance threshold is lowered Adjustment.
  11. 根据权利要求1所述的生成方法,其特征在于:The generating method according to claim 1, characterized in that:
    所述样本图像包括2D图像,所述生成方法还包括:生成所述2D图像中每一像素点的分类的数据,所述分类包括前景和背景。The sample image includes a 2D image, and the generation method further includes: generating classification data of each pixel in the 2D image, the classification including foreground and background.
  12. 一种物体抓取点的估计模型的训练方法,包括:A method for training an estimation model of an object grasping point, comprising:
    获取训练数据,所述训练数据包括样本图像和所述样本图像中像素点的目标抓取质量;Acquiring training data, the training data includes a sample image and the target grasping quality of pixels in the sample image;
    以所述样本图像为输入数据,采用机器学习的方式对物体抓取点的估计模型进行训练,训练时根据所述估计模型输出的样本图像中像素点的预测抓取质量和所述目标抓取质量之间的差值计算损失。Taking the sample image as input data, using machine learning to train the estimation model of the object grasping point, during training, according to the predicted grasping quality of the pixel points in the sample image output by the estimated model and the grasping quality of the target The difference between the masses calculates the loss.
  13. 根据权利要求12所述的训练方法,其特征在于:The training method according to claim 12, characterized in that:
    所述训练数据按照如权利要求1至11中任一所述的生成方法生成。The training data is generated according to the generation method described in any one of claims 1-11.
  14. 根据权利要求12或13所述的训练方法,其特征在于:The training method according to claim 12 or 13, characterized in that:
    所述样本图像包括2D图像和深度图像;所述估计模型包括主干网络和多分支网络,其中:The sample image includes a 2D image and a depth image; the estimation model includes a backbone network and a multi-branch network, wherein:
    所述主干网络采用语义分割网络架构且包含深度通道,设置为从输入的2D图像和深度图像中提取特征;The backbone network adopts a semantic segmentation network architecture and includes a depth channel, which is configured to extract features from input 2D images and depth images;
    所述多分支网络采用多任务学习网络架构且包含深度通道,设置为基于提取的所述特征进行预测,输出所述2D图像中像素点的预测抓取质量。The multi-branch network adopts a multi-task learning network architecture and includes a depth channel, which is set to perform prediction based on the extracted features, and output the predicted capture quality of pixels in the 2D image.
  15. 根据权利要求12所述的训练方法,其特征在于:The training method according to claim 12, characterized in that:
    所述训练数据按照如权利要求12所述的生成方法生成;The training data is generated according to the generation method as claimed in claim 12;
    所述多分支网络包括:The multi-branch network includes:
    第一分支网络,设置为输出所述2D图像中每一像素点的分类置信度,所述分类包括前景和背景;及The first branch network is configured to output the classification confidence of each pixel in the 2D image, and the classification includes foreground and background; and
    第二分支网络,设置为输出所述2D图像中根据所述分类置信度确定的分类为前景的像素点的预测抓取质量。The second branch network is configured to output the predicted grasping quality of pixels classified as foreground determined according to the classification confidence in the 2D image.
  16. 一种物体抓取点的估计方法,包括:A method for estimating an object grasping point, comprising:
    获取包含待抓取物体的场景图像,所述场景图像包括2D图像,或包括2D图像和深度图像;Acquiring a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;
    将所述场景图像输入物体抓取点的估计模型,其中,所述估计模型采用如权利要求12至15中任一所述的训练方法训练好的估计模型;Inputting the scene image into the estimation model of the object grasping point, wherein the estimation model adopts the estimation model trained by the training method according to any one of claims 12 to 15;
    根据所述估计模型输出的所述2D图像中像素点的预测抓取质量,确定所述待抓取物体的抓取点的位置。Determine the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model.
  17. 如权利要求16中所述的估计方法,其特征在于:Estimation method as claimed in claim 16, characterized in that:
    所述根据所述估计模型输出的所述2D图像中像素点的预测抓取质量,确定所述待抓取物体的抓取点的位置,包括:The determining the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model includes:
    选取出所述待抓取物体中的预测抓取质量大于设定质量阈值的全部或部分像素点;Selecting all or part of the pixels in the object to be grasped whose predicted grasping quality is greater than a set quality threshold;
    对选取出的像素点进行聚类并计算出一个或多个类中心,将所述类中心对应的像素点作为所述待抓取物体的候选抓取点;Clustering the selected pixels and calculating one or more class centers, using the pixels corresponding to the class centers as candidate grabbing points for the object to be grabbed;
    基于预定规则对得到的所述候选抓取点排序,根据所述排序将最优的一个候选抓取点确定为所述待抓取物体的抓取点。The obtained candidate grasping points are sorted based on a predetermined rule, and an optimal candidate grasping point is determined as the grasping point of the object to be grasped according to the ranking.
  18. 一种物体抓取点估计模型的训练数据的生成装置,其特征在于,包括处理器以及存储有计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求1或11所述的物体抓取点估计模型的训练数据的生成方法。A device for generating training data of an object grasping point estimation model, characterized in that it includes a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the computer program described in claim 1 or 11 is realized. A method for generating training data for the object grasping point estimation model described above.
  19. 一种物体抓取点的估计模型的训练装置,其特征在于,包括处理器以及存储有计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求12或15所述的物体抓取点的估计模型的训练方法。An object grasping point estimation model training device, characterized in that it includes a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the method according to claim 12 or 15 is realized. A method for training an estimation model of object grasp points.
  20. 一种物体抓取点的估计装置,其特征在于,包括处理器以及存储有计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求16或17所述的物体抓取点的估计方法。An object grasping point estimation device, characterized by comprising a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the object grasping according to claim 16 or 17 is realized point estimation method.
  21. 一种机器人视觉系统,其特征在于,包括:A robot vision system, characterized in that it comprises:
    相机,设置为拍摄包含待抓取物体的场景图像,所述场景图像包括2D图像,或包括2D图像和深度图像;A camera, configured to shoot a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;
    控制装置,包括如权利要求20所述的物体抓取点的估计装置,所述控制装置设置为根据所述相机拍摄的所述场景图像,确定所述待抓取物体的抓取 点的位置;及,根据所述抓取点的位置控制机器人执行的抓取动作;A control device, comprising the object grasping point estimation device according to claim 20, the control device is configured to determine the position of the grasping point of the object to be grasped according to the scene image captured by the camera; And, controlling the grabbing action performed by the robot according to the position of the grabbing point;
    机器人,设置为执行所述抓取动作。A robot configured to perform said grasping action.
  22. 一种非瞬态计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至11中任一所述的物体抓取点估计模型的训练数据的生成方法,或者实现如权利要求12至15中任一所述的物体抓取点的估计模型的训练方法,或者实现如权利要求16或17所述的物体抓取点的估计方法。A non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, characterized in that, when the computer program is executed by a processor, the object according to any one of claims 1 to 11 is realized A method for generating training data of a grasping point estimation model, or a method for training an estimation model of an object grasping point as claimed in any one of claims 12 to 15, or an object grasping method as claimed in claim 16 or 17 Estimation method for taking points.
PCT/CN2022/135705 2021-12-29 2022-11-30 Object grabbing point estimation method, apparatus and system, model training method, apparatus and system, and data generation method, apparatus and system WO2023124734A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111643324.3 2021-12-29
CN202111643324.3A CN116416444B (en) 2021-12-29 2021-12-29 Object grabbing point estimation, model training and data generation method, device and system

Publications (1)

Publication Number Publication Date
WO2023124734A1 true WO2023124734A1 (en) 2023-07-06

Family

ID=86997564

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/135705 WO2023124734A1 (en) 2021-12-29 2022-11-30 Object grabbing point estimation method, apparatus and system, model training method, apparatus and system, and data generation method, apparatus and system

Country Status (2)

Country Link
CN (1) CN116416444B (en)
WO (1) WO2023124734A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116841914A (en) * 2023-09-01 2023-10-03 星河视效科技(北京)有限公司 Method, device, equipment and storage medium for calling rendering engine
CN117656083B (en) * 2024-01-31 2024-04-30 厦门理工学院 Seven-degree-of-freedom grabbing gesture generation method, device, medium and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108058172A (en) * 2017-11-30 2018-05-22 深圳市唯特视科技有限公司 A kind of manipulator grasping means based on autoregression model
CN109598264A (en) * 2017-09-30 2019-04-09 北京猎户星空科技有限公司 Grasping body method and device
US20200061811A1 (en) * 2018-08-24 2020-02-27 Nvidia Corporation Robotic control system
CN111553949A (en) * 2020-04-30 2020-08-18 张辉 Positioning and grabbing method for irregular workpiece based on single-frame RGB-D image deep learning

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019037863A1 (en) * 2017-08-24 2019-02-28 Toyota Motor Europe System and method for label augmentation in video data
CN111226237A (en) * 2017-09-01 2020-06-02 加利福尼亚大学董事会 Robotic system and method for robust grasping and aiming of objects
CN108818586B (en) * 2018-07-09 2021-04-06 山东大学 Object gravity center detection method suitable for automatic grabbing by manipulator
CN109159113B (en) * 2018-08-14 2020-11-10 西安交通大学 Robot operation method based on visual reasoning
CN109523629B (en) * 2018-11-27 2023-04-07 上海交通大学 Object semantic and pose data set generation method based on physical simulation
CN109658413B (en) * 2018-12-12 2022-08-09 达闼机器人股份有限公司 Method for detecting grabbing position of robot target object
CN111127548B (en) * 2019-12-25 2023-11-24 深圳市商汤科技有限公司 Grabbing position detection model training method, grabbing position detection method and grabbing position detection device
CN111161387B (en) * 2019-12-31 2023-05-30 华东理工大学 Method and system for synthesizing images in stacked scene, storage medium and terminal equipment
CN212553849U (en) * 2020-05-26 2021-02-19 腾米机器人科技(深圳)有限责任公司 Object grabbing manipulator
CN111844101B (en) * 2020-07-31 2022-09-06 中国科学技术大学 Multi-finger dexterous hand sorting planning method
CN113034526B (en) * 2021-03-29 2024-01-16 深圳市优必选科技股份有限公司 Grabbing method, grabbing device and robot
CN113297701B (en) * 2021-06-10 2022-12-20 清华大学深圳国际研究生院 Simulation data set generation method and device for multiple industrial part stacking scenes
CN113436293B (en) * 2021-07-13 2022-05-03 浙江大学 Intelligent captured image generation method based on condition generation type countermeasure network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598264A (en) * 2017-09-30 2019-04-09 北京猎户星空科技有限公司 Grasping body method and device
CN108058172A (en) * 2017-11-30 2018-05-22 深圳市唯特视科技有限公司 A kind of manipulator grasping means based on autoregression model
US20200061811A1 (en) * 2018-08-24 2020-02-27 Nvidia Corporation Robotic control system
CN111553949A (en) * 2020-04-30 2020-08-18 张辉 Positioning and grabbing method for irregular workpiece based on single-frame RGB-D image deep learning

Also Published As

Publication number Publication date
CN116416444B (en) 2024-04-16
CN116416444A (en) 2023-07-11

Similar Documents

Publication Publication Date Title
US11763550B2 (en) Forming a dataset for fully-supervised learning
WO2023124734A1 (en) Object grabbing point estimation method, apparatus and system, model training method, apparatus and system, and data generation method, apparatus and system
Depierre et al. Jacquard: A large scale dataset for robotic grasp detection
CN108656107B (en) Mechanical arm grabbing system and method based on image processing
CN109584298B (en) Robot-oriented autonomous object picking task online self-learning method
Marton et al. Hierarchical object geometric categorization and appearance classification for mobile manipulation
WO2021113408A1 (en) Synthesizing images from 3d models
CN111906782B (en) Intelligent robot grabbing method based on three-dimensional vision
TWI666595B (en) System and method for object labeling
CN110929795B (en) Method for quickly identifying and positioning welding spot of high-speed wire welding machine
CN115816460B (en) Mechanical arm grabbing method based on deep learning target detection and image segmentation
Kasaei et al. Perceiving, learning, and recognizing 3d objects: An approach to cognitive service robots
CN117124302B (en) Part sorting method and device, electronic equipment and storage medium
CN113894058A (en) Quality detection and sorting method and system based on deep learning and storage medium
Madessa et al. Leveraging an instance segmentation method for detection of transparent materials
CN113034575A (en) Model construction method, pose estimation method and object picking device
CN115359119A (en) Workpiece pose estimation method and device for disordered sorting scene
Pattar et al. Automatic data collection for object detection and grasp-position estimation with mobile robots and invisible markers
Suzui et al. Toward 6 dof object pose estimation with minimum dataset
Fang et al. A pick-and-throw method for enhancing robotic sorting ability via deep reinforcement learning
Chowdhury et al. Comparison of neural network-based pose estimation approaches for mobile manipulation
CN111783537A (en) Two-stage rapid grabbing detection method based on target detection characteristics
Martinson Interactive training of object detection without imagenet
Keaveny Experimental Evaluation of Affordance Detection Applied to 6-DoF Pose Estimation for Intelligent Robotic Grasping of Household Objects
Yang et al. Integrating Deep Learning Models and Depth Cameras to Achieve Digital Transformation: A Case Study in Shoe Company

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913989

Country of ref document: EP

Kind code of ref document: A1