CN116416444A

CN116416444A - Object grabbing point estimation, model training and data generation method, device and system

Info

Publication number: CN116416444A
Application number: CN202111643324.3A
Authority: CN
Inventors: 周韬
Original assignee: Midea Group Co Ltd; Guangdong Midea White Goods Technology Innovation Center Co Ltd
Current assignee: Midea Group Co Ltd; Guangdong Midea White Goods Technology Innovation Center Co Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2023-07-11
Anticipated expiration: 2041-12-29
Also published as: WO2023124734A1; CN116416444B

Abstract

The object grabbing point estimation, model training and data generation method, device and system are used for carrying out grabbing point sampling based on a 3D model of a sample object and evaluating grabbing quality of sampling points, rendering a simulation scene of the 3D model loaded with a first object, generating a sample image for training and target grabbing quality of pixel points in the sample image, training an estimation model of the object grabbing point as training data, and using the trained model for estimating the object grabbing point. According to the embodiment of the disclosure, automatic labeling of the sample image can be realized, training data can be generated efficiently and with high quality, and the estimation accuracy of the grabbing points is improved.

Description

Object grabbing point estimation, model training and data generation method, device and system

Technical Field

The present disclosure relates to, but is not limited to, artificial intelligence techniques, and in particular, to methods, apparatus, and systems for object grabbing point estimation, model training, and data generation.

Background

In a robotic vision guidance application scenario, a challenge encountered by a robotic vision system is the need to guide a robot to grasp thousands of different inventory items (stock keeping unit, SKUs for short). These objects are often system unknown or too costly to maintain a physical model or texture template for all SKUs due to the excessive variety. In the simplest case, although the objects to be grasped are rectangular objects (boxes or boxes) in the unstacking application, the texture, size, etc. of the objects vary from scene to scene. Classical template matching-based object localization or recognition schemes are therefore difficult to apply in such scenarios. In some e-commerce warehouse application scenarios, many objects have irregular shapes, wherein the most common objects are box-like objects and bottle-like objects, the objects are stacked together, and a robot vision guiding system is required to efficiently sort the objects from one piece to another in a stacked state, perform subsequent code scanning or identification operations, and send the objects into a proper target material frame.

In the process, how the robot vision system estimates the most suitable grabbing point (which can be the sucking point but is not limited to the sucking point) of the robot according to the scene shot by the camera without the prior knowledge of the object, and guides the robot to execute the object grabbing action is still a problem to be solved.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

An embodiment of the present disclosure provides a method for generating training data of an object grabbing point estimation model, including:

acquiring a 3D model of a sample object, sampling a grabbing point based on the 3D model of the sample object, and evaluating grabbing quality of the sampling point;

rendering a simulated scene of a 3D model loaded with a first object, and generating a sample image for training, wherein the first object is selected from the sample objects;

and generating target grabbing quality of the pixel points in the sample image according to the grabbing quality of the sampling points of the first object.

An embodiment of the present disclosure further provides a device for generating training data of an object grabbing point estimation model, which includes a processor and a memory storing a computer program, where the processor implements the method for generating training data of the object grabbing point estimation model according to any embodiment of the present disclosure when executing the computer program.

The method and the device of the embodiment of the disclosure realize automatic labeling of the sample image, can generate training data with high efficiency and high quality, and avoid the problems of heavy workload, unstable labeling quality and the like caused by manual labeling.

An embodiment of the present disclosure provides a training method for an estimation model of an object grabbing point, including:

acquiring training data, wherein the training data comprises a sample image and target grabbing quality of pixel points in the sample image;

training an estimation model of an object grabbing point by taking the sample image as input data in a machine learning mode, and calculating loss according to a difference value between the predicted grabbing quality of a pixel point in the sample image output by the estimation model and the target grabbing quality during training;

the estimation model comprises a backbone network adopting a semantic segmentation network architecture and a multi-branch network adopting a multi-task learning network architecture.

An embodiment of the present disclosure further provides a training device for an estimation model of an object grabbing point, including a processor and a memory storing a computer program, where the training method for an estimation model of an object grabbing point according to any embodiment of the present disclosure is implemented when the processor executes the computer program.

According to the method and the device, the grasping quality of the pixel points in the 2D image is learned through training, and compared with a direct optimal grasping point mode, the method and the device have better precision and stability.

An embodiment of the present disclosure provides a method for estimating an object grabbing point, including:

acquiring a scene image containing an object to be grabbed, wherein the scene image comprises a 2D image or comprises a 2D image and a depth image;

inputting the scene image into an estimation model of an object grabbing point, wherein the estimation model adopts an estimation model trained by the training method according to any embodiment of the disclosure;

and determining the position of the grabbing point of the object to be grabbed according to the predicted grabbing quality of the pixel point in the 2D image output by the estimation model.

An embodiment of the present disclosure further provides an apparatus for estimating an object grabbing point, including a processor and a memory storing a computer program, where the processor implements the method for estimating an object grabbing point according to any embodiment of the present disclosure when executing the computer program.

An embodiment of the present disclosure also provides a robot vision system, including:

a camera arranged to capture a scene image comprising an object to be grabbed, the scene image comprising a 2D image, or comprising a 2D image and a depth image;

A control device including an estimation device of object grabbing points according to an embodiment of the present disclosure, the control device being configured to determine a position of grabbing points of the object to be grabbed according to the scene image captured by the camera; and controlling the grabbing action executed by the robot according to the position of the grabbing point;

and a robot arranged to perform said gripping action.

The estimation method, the estimation device and the robot vision system can improve the accuracy of object grabbing point estimation and further improve grabbing success rate.

An embodiment of the present disclosure further provides a non-transitory computer readable storage medium storing a computer program, which when executed by a processor, implements a method for generating training data of an object grabbing point estimation model according to any embodiment of the present disclosure, or implements a method for training an object grabbing point estimation model according to any embodiment of the present disclosure, or implements a method for estimating an object grabbing point according to any embodiment of the present disclosure.

Other aspects will become apparent upon reading and understanding the accompanying drawings and detailed description.

Drawings

The accompanying drawings are included to provide an understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain, without limitation, the embodiments.

FIG. 1 is a flow chart of a method of generating training data for an object grabbing point estimation model according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of FIG. 1 for generating annotation data based on the quality of a sample taken;

FIG. 3 is a schematic diagram of an apparatus for generating training data according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of a method of training an estimation model of object gripping points in accordance with an embodiment of the present disclosure;

FIG. 5 is a network block diagram of an estimation model according to an embodiment of the present disclosure;

FIG. 6 is a flow chart of a method of object grasp point estimation according to an embodiment of the present disclosure;

fig. 7 is a schematic structural view of a robot vision system according to an embodiment of the present disclosure.

Detailed Description

The present disclosure describes several embodiments, but the description is illustrative and not limiting, and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described in the present disclosure.

In the description of the present disclosure, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment described as "exemplary" or "e.g." in this disclosure should not be taken as preferred or advantageous over other embodiments. "and/or" herein is a description of an association relationship of an associated object, meaning that there may be three relationships, e.g., a and/or B, which may represent: a exists alone, A and B exist together, and B exists alone. "plurality" means two or more than two. In addition, for the purpose of clearly describing the technical solutions of the embodiments of the present disclosure, words such as "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

In describing representative exemplary embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other sequences of steps are possible as will be appreciated by those of ordinary skill in the art. Accordingly, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present disclosure.

With the development of the deep learning technology, various tasks such as detection (2-dimensional object position and size estimation), segmentation (pixel-level object category prediction or instance index prediction) and the like can be completed by training the visual neural network model, and the object grabbing point estimation can also be based on the deep learning framework and proper training data, so that a grabbing point estimation method and equipment based on data driving are realized.

In one scheme, after a color map and a depth map are shot by a camera, point clouds are used for plane segmentation or Euclidean distance-based segmentation, so that different objects are tried to be segmented and detected in a scene, then a center point is obtained based on the segmented points to serve as grabbing point candidates, then a series of heuristic rules are used for ordering the grabbing point candidates, and finally a robot is guided to grab the optimal grabbing points. And simultaneously introducing a feedback system to record success and failure of each grabbing, and if the success is achieved, using the current object as a template to match the grabbing point of the next grabbing. The problem with this approach is that the performance of common point cloud segmentation is relatively weak, there are many points of false grasping, and the point cloud segmentation approach is prone to failure when objects are closely aligned.

In another approach, a deep learning framework is used, where the direction and area of the grabbing points are noted by manually labeling some limited data to obtain relevant training data, and training a neural network model based on these training data. In the running process of the system, the vision system can process, count and train pictures similar to the set, and estimate the object grabbing points in the pictures. The problem of this kind of scheme lies in that data acquisition and mark cost are higher, especially on the data mark level, snatch some direction and regional harder mark, need the mark person have stronger technical ability, and the human factor is more in the mark information simultaneously, mark quality can't systematic control to can't produce the model that has systematic quality assurance.

An embodiment of the present disclosure provides a method for generating training data of an object grabbing point estimation model, as shown in fig. 1, including:

step 110, acquiring a 3D model of a sample object, sampling a grabbing point based on the 3D model of the sample object, and evaluating grabbing quality of the sampling point;

the sample object can be various box-shaped objects, bottle-shaped objects, box-like objects and bottle-like objects, or other objects. The sample object may generally be selected from the items actually to be grasped, but it is not required to entirely cover the kinds of the items actually to be grasped. It is generally possible to select as the sample object an object whose geometry is typical, but the present disclosure does not require that the sample object must cover the shape of all objects to be grasped, and based on the generalization ability of the model, the trained model can still make grasp point estimates for objects of other shapes.

Step 120, rendering a simulated scene of a 3D model loaded with a first object, generating a sample image for training, wherein the first object is selected from the sample objects;

the loaded first object may be selected randomly by the system from the sample objects, or manually, or according to a configured rule. The first object selected may comprise one sample object, may comprise a plurality of sample objects, may comprise one sample object of one shape, and may comprise a plurality of sample objects of a plurality of shapes. The present embodiment is not limited thereto.

And 130, generating target grabbing quality of the pixel points in the sample image according to the grabbing quality of the sampling points of the first object.

The target capturing quality of the pixel points in the sample image can be the target capturing quality of part of the pixel points in the sample image, the capturing quality of all the pixel points in the sample image can be the capturing quality of the pixel points, the pixel points can be marked one by one, and the collection of the pixel points, such as a region comprising more than two pixel points in the sample image, can be marked. Since the capture quality of the pixel point in the labeled sample image is used as the target data during training, it is referred to herein as the target capture quality of the pixel point.

According to the embodiment of the disclosure, the 3D model of the sample object is acquired first, the grabbing point sampling is performed based on the 3D model, and the grabbing quality of the sampling point is evaluated, and because the geometric shape of the 3D model is accurate, the evaluation of the grabbing quality can be completed with high quality. After the 3D model of the selected first object is loaded to generate a first simulation scene, the position and the gesture of the 3D model can be tracked during loading, so that the position relation between the sampling point and the pixel point in the sample image can be calculated, and the grabbing quality of the sampling point is transferred to the corresponding pixel point in the sample image. The training data generated by the embodiment of the disclosure comprises the sample image and the labeling data (including but not limited to the target grabbing quality), so that the embodiment of the disclosure realizes automatic labeling of the sample image, can generate the training data with high efficiency and high quality, and avoids the problems of heavy workload, unstable labeling quality and the like caused by manual labeling.

In an exemplary embodiment of the present disclosure, the acquiring a 3D model of a sample object includes: creating or collecting a 3D model of the sample object, and normalizing to enable the mass center of the sample object to be located at the origin of a model coordinate system of the 3D model, wherein the main axis of the sample object is consistent with the direction of a coordinate axis in the model coordinate system. When the 3D model is created in this embodiment, so-called normalization may be embodied as a unified modeling rule, that is, the origin of the model coordinate system is established at the centroid of the sample object, and one coordinate axis in the model coordinate system is made to coincide with the main axis direction of the object. If the acquired 3D model is already created, normalization can be realized by translating and rotating the 3D model, and the requirement that the centroid is located at the origin and the main axis is consistent with the direction of a coordinate axis is met.

In an exemplary embodiment of the disclosure, the sampling the grabbing point based on the 3D model of the sample object includes: performing point cloud sampling on the 3D model of the sample object, determining a first position and a grabbing direction of a sampling point in the 3D model, and recording; the first position is represented by coordinates of the sampling point in a model coordinate system of the 3D model, and the grabbing direction is determined according to a normal vector of the sampling point in the 3D model. When the point cloud sampling is performed based on the 3D model, the surface of the sample object can be uniformly sampled, a specific algorithm is not limited, and the sampling points on the surface of the sample object can have proper density by properly setting the number of the sampling points, so that proper grabbing points are prevented from being omitted. In one example, the normal vector of the plane to which all points within the set neighborhood of a sample point fit may be taken as the normal vector of that sample point.

In an exemplary embodiment of the present disclosure, the evaluating the grabbing quality of the sampling point includes: under the scene of sucking a sample object by using a single sucking disc, estimating the grabbing quality of each sampling point according to the sealing quality and the antagonism quality of the sampling point; the sealing quality is determined according to the sealing degree between the end part of the sucker and the surface of the sample object under the condition that the sucker sucks the sample object at the position of the sampling point and the axial direction of the sucker is consistent with the grabbing direction of the sampling point, and the countering quality is determined according to the gravity moment of the sample object under the condition and the countering degree of the moment which can be generated when the sucker sucks the object to the gravity moment.

In the embodiment of the disclosure, when the suction cup is used for sucking the sample object, gravity moment can enable the sample object to rotate and drop (the mass of the sample object is assigned when the sample object is configured), and the suction force of the suction cup on the sample object and the friction force between the end of the suction cup and the sample object can provide moment which is opposite to the gravity moment so as to prevent the sample object from dropping, wherein the suction force and the friction force can be calculated as configuration information or according to the configuration information (such as suction cup parameters, object materials and the like). Therefore, the countermeasure degree represents the stability degree of the object during suction, and can be calculated according to a related formula. The seal quality and the countermeasure quality may be scored separately, and the sum of the scores, or the average value, or the weighted average value, or the like, may be used as the grasping quality of the sampling point. The tightness quality and the countermeasure quality of the sampling points are determined by the local geometric characteristics of the 3D model, and the relation between the local geometric information of the object and the quality of the grabbing points can be fully reflected, so that the accurate assessment of the grabbing quality of the sampling points can be realized.

Although the embodiment of the disclosure takes a single sucker to suck an object as an example, the disclosure is not limited thereto, and for a grabbing manner of sucking an object by multiple points or clamping an object by multiple points, the grabbing quality of the sampling point can be evaluated according to indexes showing grabbing efficiency, object stability and success probability.

In an exemplary embodiment of the present disclosure, the simulated scene is obtained by loading a 3D model of the first object into an initial scene, the loading process comprising:

selecting the type and the number of the first objects to be loaded from the sample objects, and assigning a value to the quality of the first objects;

loading the 3D model of the first object into the initial scene according to random positions and postures;

simulating the dropping process of the first object and the stacking state finally formed by using a physical engine to obtain the simulation scene;

a second position and pose of the 3D model of the first object in the simulated scene is recorded.

According to the embodiment of the disclosure, through the loading process, various object stacking scenes can be simulated, and the model trained based on the training data is suitable for estimating the object grabbing points in the complex object stacking scene based on the training data generated by the scene, so that the problem that the object grabbing points in the complex scene are difficult to estimate is solved. The simulated material frames can be arranged in the initial scene, the 3D model of the first object is loaded to the material frames, and the collision process between the first objects and the material frames is simulated, so that the finally formed simulated scene of the object stack is closer to the real scene. But the frame is not required. In other embodiments of the present disclosure, the simulated scene for ordered stacking of the first objects may also be loaded, depending on the need for simulation of the actual working scene.

For the same initial scene, the first object may be loaded multiple times in different ways to obtain multiple simulated scenes. The different modes may be, for example, different types and/or numbers of the loaded first objects, different initial positions and initial postures of the 3D model when being loaded, and the like.

In an exemplary embodiment of the present disclosure, the rendering the simulated scene to generate a training sample image includes: rendering each simulated scene at least twice to obtain at least two groups of sample images for training; when each rendering is performed, adding a simulation camera, setting a light source and adding textures for a loaded first object in the simulation scene, and rendering a 2D image and a depth image to serve as a group of sample images; any two of the plurality of renderings differ in at least one of the following parameters: texture of the object, simulated camera parameters, light parameters. According to the embodiment, the simulated environment is polished in the process of rendering the picture, and the randomization degree of data can be enhanced by adjusting parameters (such as internal parameters, positions, angles and the like) of the simulated camera, parameters (such as the polished color, strength and the like) of light, textures of an object and the like, so that the content of sample images is enriched, the number of the sample images is increased, the quality of training data is improved, and the performance of a trained estimated model is improved.

In one example of this embodiment, the adding texture to the loaded first object at each rendering includes: each time of rendering, randomly selecting one of the acquired multiple real textures from each first object loaded into the simulated scene to be attached to the surface of the first object; alternatively, for each type of first object loaded into the simulated scene, each rendering time, one type of first object is randomly selected from the plurality of collected real textures to be attached to the surface of the type of first object. The present example compensates for the domain difference between real data and simulated data by randomization techniques. The real texture may be acquired from an image of the real object, an image using the real texture, or the like. The selected textures are randomly attached to the surface of the first object which is randomly stacked in the simulated scene, and a plurality of images with different textures can be rendered. According to the method and the device for estimating the grabbing point of the unknown object, the sample images with different textures and relatively consistent geometric information are provided for the object grabbing point estimation model, and the annotation information is generated according to the grabbing quality of the sampling point calculated by the local geometric information, so that the estimation model can predict the grabbing quality of the grabbing point by utilizing the local geometric information, and the generalization capability of the model for the unknown object can be achieved.

In an exemplary embodiment of the present disclosure, the sample image includes a 2D image and a depth image; the generating the target grabbing quality of the pixel point in the sample image according to the grabbing quality of the sampling point of the first object includes:

each simulated scene of the rendered 2D image and depth image is processed as follows, as shown in fig. 2:

step 210, obtaining a point cloud of a first object visible in the simulated scene according to the internal parameters of the simulated camera during rendering and the rendered depth image;

step 220, determining the position of a target sampling point in the point cloud according to the first position of the target sampling point in the 3D model, the second position of the 3D model in the simulated scene and the loaded gesture change, wherein the target sampling point refers to the sampling point of the visible first object;

step 230, determining the grabbing quality of the point in the point cloud according to the grabbing quality of the target sampling point and the position in the point cloud, and marking the grabbing quality as the target grabbing quality of the corresponding pixel point in the 2D image.

The point cloud of the first object, which is visible from the simulated camera intrinsic and depth images (and may also include other information) at the time of rendering, is not necessarily aligned with the position of the target sampling point calculated from the above-described first position, second position, and change in pose. The points on the point cloud and the pixel points in the 2D image have a one-to-one correspondence of pixel levels, and when the target sampling points are mapped into the 2D image, the target sampling points do not necessarily correspond to any pixel point in the 2D image, and the target sampling points may fall between some pixel points. It is thus necessary to determine the quality of grabbing of points in the point cloud from the quality of grabbing of the target sampling point and the position in the point cloud.

In an exemplary embodiment of the present disclosure, the determining the grabbing quality of the points in the point cloud according to the grabbing quality of the target sampling point and the position in the point cloud includes:

firstly, for each target sampling point, determining the grabbing quality of a point adjacent to the target sampling point in the point cloud as the grabbing quality of the target sampling point; or alternatively

Secondly, interpolating a point in the point cloud according to the grabbing quality of a target sampling point adjacent to the point to obtain the grabbing quality of the point; or alternatively

Thirdly, for each target sampling point, determining the grabbing quality of a point adjacent to the target sampling point in the point cloud as the grabbing quality of the target sampling point, and obtaining the grabbing quality of other points in the point cloud through interpolation after the grabbing quality of the points adjacent to all the target sampling points is determined.

The disclosed embodiments provide various methods of communicating the capture quality of a target sampling point to a point of a point cloud. The first is to assign the grabbing quality of the target sampling point to the adjacent points in the point cloud. In one example, the adjacent points may be one or more points closest to the target sampling point in the point cloud, for example, a point in the point cloud with a distance to the target sampling point smaller than the distance threshold may be selected as a point adjacent to the target sampling point according to a set distance threshold. The second method is an interpolation method, wherein one point in the point cloud can be obtained by interpolation according to the grabbing quality of a plurality of adjacent target sampling points, the interpolation method based on gaussian filtering can be adopted during interpolation, or different weights are given to the plurality of target sampling points according to the distance between the plurality of target sampling points and the point, the larger the distance is, the smaller the weight is, the grabbing quality of the plurality of target sampling points is weighted and averaged based on the weight, so as to obtain the grabbing quality of the point, and other interpolation methods can be adopted in the embodiment. The points adjacent to the point can also be screened according to a set distance threshold value, and if the point only finds one adjacent target sampling point, the grabbing quality of the target sampling point can be given to the point. And thirdly, after the grabbing quality of the points adjacent to the target sampling point in the point cloud is determined, grabbing quality of other points in the point cloud is obtained through interpolation according to grabbing quality of partial points in the point cloud. The second and third methods can obtain the grabbing quality of all points in the point cloud, and map the grabbing quality of the points in the point cloud to the grabbing quality of the corresponding pixel points in the 2D image, and then draw a thermodynamic diagram of the grabbing quality of the 2D image. But in the first mode, only the grabbing quality of partial points in the point cloud is obtained, and then the grabbing quality of partial pixel points in the 2D image is obtained through mapping. At this time, during training, only the predicted grabbing quality of the partial pixel points is compared with the target grabbing quality, and the loss is calculated, so that the model is optimized according to the loss.

In an exemplary embodiment of the present disclosure, after the determining the grabbing quality of the points in the point cloud according to the grabbing quality of the target sampling point and the position in the point cloud, the generating method further includes: and regarding each target sampling point, taking the grabbing direction of the target sampling point as the grabbing direction of a point adjacent to the target sampling point in the point cloud, and combining the relative position relation between the visible first objects in the point cloud, and under the condition that the grabbing space at the point adjacent to the target sampling point is determined to be smaller than the required grabbing space, downwards adjusting the grabbing quality of the point, of which the distance from the point cloud to the target sampling point is smaller than a set distance threshold value. In the embodiment of the disclosure, since the grabbing points with better quality of each object in the stacked state may not have enough space to complete the grabbing operation due to the existence of the adjacent objects, the grabbing space is judged after the grabbing quality of the points in the point cloud is determined, and the grabbing quality of the points affected by the insufficient grabbing space is adjusted downwards, specifically, the grabbing quality of the points can be adjusted to be below a certain set quality threshold value, so that the points are prevented from being selected.

In an exemplary embodiment of the disclosure, the sample image includes a 2D image, and the generating method further includes: and labeling a classification of each pixel point in the 2D image, wherein the classification comprises a foreground and a background, and the foreground is a first object in the image. The pixel points are classified and can be used for training the capability of the estimation model to distinguish the foreground from the background, and the points in the foreground (namely the points on the first object) are accurately screened from the sample image input into the estimation model, so that only the points in the foreground need to be estimated for predicting the grabbing quality. Classification of 2D image point pixels may also be derived based on classification of points on the point cloud. By mapping boundary points between a first object in the simulated scene and a background in the simulated scene onto the point cloud, a classification of points on the point cloud, i.e. whether the points are foreground or background, can be determined.

An embodiment of the present disclosure further provides a method for generating training data of an object grabbing point estimation model, including:

step one, collecting 3D models of various sample objects, and normalizing the 3D models so that an origin of a model coordinate system is arranged at the mass center of the sample object, and one coordinate axis of the model coordinate system is consistent with a main axis of the sample object.

The centroid position of the sample object can be obtained by calculating the center points of all vertices by using a 3D model in a format such as STereoLithography (STL) and by counting the vertices and face information in the 3D model. The origin of the model coordinate system is then translated to the centroid position of the sample object. In addition, principal Component Analysis (PCA) methods may be used to confirm the principal axis direction of the sample object, and then the 3D model of the sample object is rotated such that one coordinate axis direction of the model coordinate system is in the same direction as the principal axis of the sample object. And obtaining the normalized 3D model, wherein the origin of a model coordinate system is the mass center of the sample object, and the direction of one coordinate axis in the model coordinate system is consistent with the direction of the main axis of the sample object.

Step two, sampling the grabbing points of the 3D model of the sample object to obtain a first position and grabbing direction of each sampling point and recording the first position and grabbing direction;

the sampling process of the embodiment is to sample the point cloud of the object model, and use the sampled point cloud to estimate the normal vector with a fixed neighborhood, wherein each point and its normal vector represent a sampling point. In one example, taking a scene of a single suction cup sucking an object as an example, the grabbing point is a sucking point at this time. Based on the existing vertices of the object, a set number of sampling points are obtained using a voxel sampling method or other sampling method (e.g., furthest point sampling). And simultaneously estimating the normal vector direction of each sampling point by using all points in a certain range neighborhood where the sampling point is located. The method for estimating the normal vector may be to use a random sampling consensus algorithm (Random sample consensus, RANSAC for short) or the like to fit all points in the neighborhood of the sampling point to estimate a plane, and the normal vector of the plane is approximate to the normal vector of the sampling point.

Thirdly, carrying out suction quality evaluation on the sampling points;

in the case of a single suction cup sucking an object, the mass evaluation process includes calculating the closing mass at the time of suction and the opposing mass to the gravitational moment at the time of suction (it is necessary to be able to oppose the gravitational moment so as to accomplish stable gripping), and estimating the gripping mass of each sampling point based on the closing mass and the opposing mass of the sampling point. In one example, for a scene that is sucked with a single suction cup, it is evaluated whether the suction point of the sample (i.e., the sampling point) is one that can stably suck up the sample object.

The assessment includes two aspects, first of all the quality of the closure. The quality of closure can be measured by: the end of the suction cup with the set radius is approximated to a polygon, the polygon is projected to the surface of the 3D model through the grabbing point direction of the sampling point, and then the total side length of the projected polygon is compared with the original side length. If the total side length after projection is increased more than the original side length, the sealing performance is not good, otherwise if the change is not big, the increase degree can be expressed by the ratio of the increase value to the original side length value, and the ratio can give a corresponding score according to the ratio interval in which the ratio falls. Another aspect is to calculate the countering quality of the gravitational moment when the suction cup sucks the sample object at the sampling point in the grabbing direction (which may also be referred to as the suction point direction). The countering quality can be calculated by modeling scheme of 'wrench resistance', wherein 'wrench' is a six-dimensional vector, the front three-dimensional vector is force, the rear three-dimensional vector is moment, and the space formed by the six-dimensional vector is 'wrench space', and 'wrench resistance' indicates whether wrench of force and moment acting at a certain point has resistance capability or not. If the gravitational moment can be included in the wrench space provided by the suction force and the moment created by its friction, a stable suction can be provided, and vice versa. Finally, the calculation results of the closure quality and the countermeasure quality are regularized to be fractions between 0 and 1 respectively, and summed, so that a suction quality evaluation result for each suction point is obtained.

And fourthly, constructing an initial simulated data acquisition scene, namely an initial scene, loading a plurality of first objects selected from the sample objects into the initial simulated scene, and simulating the dropping dynamics and the final stacking posture of the first objects by using a physical engine.

Based on a physics engine and related simulation software that can simulate the dynamics of an object. A simulated material frame is added and is made to exist in a simulation environment in a static state to provide a corresponding collision base. Meanwhile, the 3D models of the first object can be loaded into the simulation environment through random positions and postures, and each 3D model is endowed with certain quality. Therefore, through simulation of the physical engine, the 3D model of the first object can randomly fall into the material frame under the action of simulated gravity, and the physical engine can calculate collision information among different first objects at the same time, so that the first objects form a stacking state very close to a real scene. Based on such a scheme, a second position and pose of the first object close to a true random stack is obtained in the simulation scene.

And fifthly, generating annotation data of the sample image rendered based on the simulated scene according to the grabbing quality of the sampling points.

The step is to map the sampling points obtained by sampling the grabbing points on the 3D model and the grabbing quality of each sampling point obtained by evaluation to a module scene based on the stacked object. Since the second position and the pose of the 3D model of the first object can be acquired when simulating the simulated scene, and the positions of the sampling points are represented based on the model coordinate system of the 3D model, it is easy to calculate the positions of the sampling points in the simulated scene.

In order to render the simulated scene, a simulated camera is added at a set position in the simulated environment, and a 2D image (such as a texture image) and a depth map of a first object in the stacked scene are efficiently rendered by using a rendering engine based on ray tracing. The rendered depth image can be converted into a point cloud by combining with the analog camera internal reference information. Based on the calculated positions of the sampling points in the simulated scene and the rendered point cloud of the first object, the position of each sampling point in the point cloud of the first object can be determined.

In order to make up for the domain difference between the real data and the simulation data by a domain randomization technical means. In this embodiment, various real textures (such as a real object picture, a certain regular texture picture, etc.) are collected, and the collected real textures are randomly attached to the surface of a first object stacked randomly in a simulation environment. In an analog camera rendering process based on ray tracing, 2D images with different textures can be rendered. By providing the estimation model with 2D images with different textures and the same target grabbing quality of the pixel points, the estimation model can be driven to predict the grabbing quality of the pixel points by utilizing the local geometric information of the object, so that the generalization capability of the model for different unknown objects can be realized.

The quality of the samples may be gaussian filtered based on their position in the same first object before the quality of the samples is transferred to the points in the point cloud. By solving the adjacent points of the sampling points in the point cloud, the grabbing quality of the sampling points can be obtained by the points in the set neighborhood range of one sampling point (namely, the points adjacent to the sampling point) in the point cloud. The point cloud obtained by rendering and the 2D image obtained by rendering have a one-to-one correspondence of pixel levels, so that the grabbing quality of the adjacent points can be marked as the target grabbing quality of the corresponding pixel points in the 2D image. In one example, for the target capture quality of other pixels in the 2D image, the target capture quality of the corresponding pixels may be interpolated. The grabbing quality of the pixel points with insufficient grabbing space can be reduced by combining the grabbing direction of the sampling points and the local geometric information (such as the relative position and the distance between the first objects) of the point cloud of the first object, so that a part of low-quality grabbing points caused by collision can be filtered when the optimal grabbing points are selected. Note that, here, the adjustment of the capture quality may also be performed for a corresponding pixel point in the 2D image.

Therefore, a grabbing quality thermodynamic diagram of the 2D image rendered by the simulated scene can be obtained. Optionally, the capture quality thermodynamic diagram is output as labeling data of the sample image, but the labeling data is not necessarily in the form of a thermodynamic diagram, as long as the information of the target capture quality of the pixel points in the 2D image is included. When the grabbing quality thermodynamic diagram is used as labeling data, the estimation model can be driven to learn or fit the grabbing quality thermodynamic diagram during training.

An embodiment of the present disclosure further provides a device for generating training data of an object grabbing point estimation model, as shown in fig. 3, including a processor 60 and a memory 50 storing a computer program, where the processor 60 implements the method for generating training data of an object grabbing point estimation model according to any embodiment of the present disclosure when executing the computer program. The processor in the presently disclosed embodiments and others may be an integrated circuit chip having signal processing capabilities. The processor may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. The general purpose processor may be a microprocessor or any conventional processor or the like.

The above embodiments of the present disclosure may have the following advantages:

the synthetic data generation and the automatic labeling of the synthetic data are used for replacing manual data acquisition and labeling, so that the cost can be reduced, the degree of automation is improved, the data quality can be ensured to a higher degree, and the labeling accuracy of the grabbing points is ensured to a higher degree.

And in the process of labeling the synthetic data, an object physical model (namely a 3D model) and object geometric information are used, and quality evaluation of the grabbing points is carried out based on a basic physical principle, so that the rationality of labeling the grabbing points is ensured.

In the synthetic data generation process, a field randomization technology is used, and a large amount of synthetic data is generated by using modes of texture randomization, illumination randomization, camera position randomization and the like to train an estimation model. Therefore, the estimation model can span the field gap of the synthesized data and the real data, and the local geometric characteristics of the object are learned, so that the estimation task of the object grabbing point is accurately completed.

An embodiment of the present disclosure further provides a training method of an estimation model of an object capturing point, as shown in fig. 4, including:

step 310, obtaining training data, wherein the training data comprises a sample image and target grabbing quality of pixel points in the sample image;

And 320, training an estimation model of the object grabbing point by taking the sample image as input data in a machine learning mode, and calculating loss according to the difference value between the predicted grabbing quality of the pixel point in the sample image output by the estimation model and the target grabbing quality during training.

According to the training method of the estimation model, the grabbing quality of the pixel points in the 2D image is learned, then the optimal grabbing point is selected according to the predicted grabbing quality of the pixel points in the 2D image, and compared with a mode of directly optimizing the grabbing point, the method has better precision and stability.

The machine learning of the embodiments of the present disclosure may be supervised deep learning, as well as non-deep learning machine learning, and the like.

In an exemplary embodiment of the present disclosure, the training data is generated according to a method for generating training data of an object grabbing point estimation model according to any embodiment of the present disclosure.

In an exemplary embodiment of the disclosure, the network architecture of the estimation model is shown in fig. 5, and includes:

a Backbone network (Backbone) 10, employing a semantic segmentation network architecture (such as deep lab, UNet, etc.), configured to extract features from the input 2D image and depth image;

The multi-branch network 20, which adopts a multi-task learning network architecture, is configured to predict based on the extracted features, so as to output the predicted capture quality of the pixels in the 2D image.

In the example, the multi-drop network (which may also be referred to as a network header or a detection header) includes:

the first branch network 21 learns semantic segmentation information to distinguish foreground and background, and is configured to output a classification confidence of each pixel point in the 2D image, wherein the classification comprises foreground and background; a kind of electronic device with high-pressure air-conditioning system

And a second branch network 23 for learning the grabbing quality information of the pixels in the 2D image and setting to output the predicted grabbing quality of the pixels classified as the foreground determined according to the classification confidence in the 2D image. For example, pixels classified as foreground having a confidence greater than a set confidence threshold may be referred to as foreground pixels.

Classification is referred to in this example, so that the training data needs to include classified data.

In this example, the sample image includes a 2D image and a depth image; the backbone network 10 and the multi-branch network 20 each comprise deep channels, wherein the convolution layers may be in a 3D convolution structure.

In this example, during training, the loss of the first branch network 21 is calculated based on the classification loss of all pixels in the 2D image; the loss of the second branch network 23 is calculated based on the difference between the predicted grabbing quality and the target grabbing quality of some or all pixels classified as foreground; the loss of the backbone network 10 is calculated from the total loss of the first branch network 21 and the second branch network 23. After the loss of each network is calculated, the parameters of each network can be optimized by using a gradient descent algorithm until the loss is minimum and the model converges. In the training process, random square shielding can be performed on the depth image, for example, 64 x 64 pixels are shielded at a time, so that the network can better utilize the structural information in the depth image.

After the training data is used for carrying out iterative training on the evaluation module for a plurality of times, the verification data is used for verifying the accuracy of the trained estimation model, the verification data can be generated by the same method as the training data, after the accuracy of the estimation model meets the requirement, the estimation model is trained well and can be used, and if the accuracy does not meet the requirement, the training is continued. When in use, a 2D image and a depth image containing an actual object to be grabbed are input, and the predicted grabbing quality of pixel points in the 2D image is output.

The embodiment of the disclosure uses a multi-task learning framework based on a deep learning principle to construct a grabbing point estimation model, so that the problems that the error rate is high and adjacent objects cannot be distinguished in a simple single-point cloud segmentation scheme can be effectively solved.

An embodiment of the present disclosure further provides a training device for an estimation model of an object grabbing point, referring to fig. 3, including a processor and a memory storing a computer program, where the processor implements the training method for the estimation model of the object grabbing point according to any embodiment of the present disclosure when executing the computer program.

The training method of the estimation model predicts the grabbing quality of the pixel points in the 2D image through pixel-level dense prediction. Classification prediction of foreground and background at pixel level is done on one branch. On the other branch, a predicted value of the capture quality, i.e. the predicted capture quality, may be output for each pixel in the 2D image classified as foreground. The trunk network and the branch network of the estimation model of the embodiment of the disclosure both comprise depth channels, at the input end, a depth image comprising depth channel information is input into the trunk network, then the characteristics learned by the depth channels are fused into the characteristics of the color 2D image from the channel dimension direction, and pixel-by-pixel multitask prediction is carried out, so that the estimation model can be helped to better process the task of estimating the grabbing points under the object stacking scene to be grabbed.

An embodiment of the present disclosure further provides a method for estimating an object capturing point, as shown in fig. 6, including:

step 410, acquiring a scene image containing an object to be grabbed, wherein the scene image comprises a 2D image or comprises a 2D image and a depth image;

step 420, inputting the scene image into an estimation model of an object grabbing point, wherein the estimation model is a well-trained estimation model adopting the training method according to any embodiment of the disclosure;

and step 430, determining the position of the grabbing point of the object to be grabbed according to the predicted grabbing quality of the pixel point in the 2D image output by the estimation model.

The embodiment of the disclosure realizes camera driving, and scene images such as 2D images and depth images of objects to be grabbed can be obtained by shooting by a depth camera adapted to various industrial scenes. The color 2D image and the depth image are obtained from the depth camera, then cut and scaled to the picture size required by the input of the estimation model, and then the estimation model is input.

In an exemplary embodiment of the present disclosure, determining the position of the grabbing point of the object to be grabbed according to the predicted grabbing quality of the pixel point in the 2D image output by the estimation model includes:

Selecting all or part of pixel points of which the predicted grabbing quality is greater than a set quality threshold value in the object to be grabbed;

clustering the selected pixel points, calculating one or more class centers, and taking the pixel points corresponding to the class centers as candidate grabbing points of the object to be grabbed;

and sequencing the obtained candidate grabbing points based on a preset rule, and determining the optimal one candidate grabbing point as the grabbing point of the object to be grabbed according to the sequencing.

In an exemplary embodiment of the present disclosure, when the obtained candidate grabbing points are ordered based on a predetermined rule, the obtained candidate grabbing points may be ordered based on a predetermined heuristic rule, where the heuristic rule may be, for example, based on a distance between the candidate grabbing points and a camera, whether the sucking points are in an actual material frame, whether the sucking points bring about collision, and so on, and the candidate grabbing points are ordered by using these information, and an optimal candidate grabbing point is determined as the grabbing point of the object to be grabbed.

An embodiment of the present disclosure further provides an apparatus for estimating an object grabbing point, referring to fig. 3, including a processor and a memory storing a computer program, where the processor implements the method for estimating an object grabbing point according to any embodiment of the present disclosure when executing the computer program.

According to the embodiment of the disclosure, based on the trained estimation model, the 2D image and the depth image shot by the camera are sent into the estimation model to perform forward reasoning, and the prediction evaluation quality of the pixel points in the 2D image is output. If the number of pixels with the predicted estimated quality larger than the set quality threshold exceeds the set number, the pixels with the set number of pixels with the optimal predicted grabbing quality, such as TOP50, TOP100, etc., can be selected. After clustering the selected pixel points and calculating one or more class centers, the pixel point (which may be one pixel point or one pixel point in a region) closest to the class center in the 2D image may be used as the candidate capture point. The adopted estimation model can reach better precision, so the estimation method and the device can improve the accuracy of the object grabbing point estimation, and further improve the grabbing success rate.

An embodiment of the present disclosure further provides a robot vision system, as shown in fig. 7, including:

a camera 1 arranged to capture a scene image comprising an object to be grabbed, the scene image comprising a 2D image, or comprising a 2D image and a depth image;

control means 2 comprising the estimation means of object gripping points according to claim 20, said control means being arranged to determine the position of the gripping points of the object to be gripped from the scene image taken by the camera; and controlling the grabbing action executed by the robot according to the position of the grabbing point;

A robot 3 arranged to perform said gripping action.

The robot vision system disclosed by the embodiment of the disclosure can improve the accuracy of object grabbing point estimation, so that the grabbing success rate is improved.

An embodiment of the present disclosure further provides a non-transitory computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program, when executed by a processor, implements a method for generating training data of an object grabbing point estimation model according to any embodiment of the present disclosure, or implements a method for training an object grabbing point estimation model according to any embodiment of the present disclosure, or implements a method for estimating an object grabbing point according to any embodiment of the present disclosure.

In any one or more of the above-described exemplary embodiments of the present disclosure, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium, and executed by a hardware-based processing unit. The computer-readable medium may comprise a computer-readable storage medium corresponding to a tangible medium, such as a data storage medium, or a communication medium that facilitates transfer of a computer program from one place to another, such as according to a communication protocol. In this manner, a computer-readable medium may generally correspond to a non-transitory tangible computer-readable storage medium or a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.

By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Moreover, any connection may also be termed a computer-readable medium, for example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be appreciated, however, that computer-readable storage media and data storage media do not include connection, carrier wave, signal, or other transitory (transient) media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk or blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor" as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements.

The technical solutions of the embodiments of the present disclosure may be implemented in a wide variety of devices or apparatuses, including wireless handsets, integrated Circuits (ICs), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the described techniques, but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units (including one or more processors as described above) in combination with suitable software and/or firmware.

Claims

1. A method for generating training data of an object grabbing point estimation model comprises the following steps:

2. The method of generating according to claim 1, wherein:

the acquiring a 3D model of a sample object comprises: creating or collecting a 3D model of the sample object, and normalizing to enable the mass center of the sample object to be located at the origin of a model coordinate system of the 3D model, wherein the main axis of the sample object is consistent with the direction of a coordinate axis in the model coordinate system.

3. The method of generating according to claim 1, wherein:

the capturing point sampling based on the 3D model of the sample object comprises the following steps: performing point cloud sampling on the 3D model of the sample object, determining a first position and a grabbing direction of a sampling point in the 3D model, and recording; the first position is represented by coordinates of the sampling point in a model coordinate system of the 3D model, and the grabbing direction is determined according to a normal vector of the sampling point in the 3D model.

4. A generation method according to claim 1 or 2 or 3, characterized in that:

the evaluating the grabbing quality of the sampling point comprises the following steps: under the scene of sucking a sample object by using a single sucking disc, estimating the grabbing quality of each sampling point according to the sealing quality and the antagonism quality of the sampling point; the sealing quality is determined according to the sealing degree between the end part of the sucker and the surface of the sample object under the condition that the sucker sucks the sample object at the position of the sampling point and the axial direction of the sucker is consistent with the grabbing direction of the sampling point, and the countering quality is determined according to the gravity moment of the sample object under the condition and the countering degree of the moment which can be generated when the sucker sucks the object to the gravity moment.

5. The method of generating according to claim 1, wherein:

the simulated scene is obtained by loading a 3D model of the first object into an initial scene, the loading process comprising:

6. The method of generating according to claim 1, wherein:

rendering the simulated scene to generate a sample image for training, including: rendering each simulated scene at least twice to obtain at least two groups of sample images for training; when each rendering is performed, adding a simulation camera, setting a light source and adding textures for a loaded first object in the simulation scene, and rendering a 2D image and a depth image to serve as a group of sample images; any two of the plurality of renderings differ in at least one of the following parameters: texture of the object, simulated camera parameters, light parameters.

7. The method of generating according to claim 6, wherein:

adding texture to the loaded first object at each rendering time comprises the following steps:

each time of rendering, randomly selecting one of the acquired multiple real textures from each first object loaded into the simulated scene to be attached to the surface of the first object; or alternatively

For each kind of first object loaded into the simulated scene, randomly selecting one kind of first object from the collected multiple kinds of real textures to be attached to the surface of the first object.

8. The generating method according to claim 1 or 6 or 7, characterized in that:

the sample image includes a 2D image and a depth image; the generating the target grabbing quality of the pixel point in the sample image according to the grabbing quality of the sampling point of the first object includes:

each simulated scene of the rendered 2D image and depth image is processed as follows:

obtaining a point cloud of a first object visible in the simulated scene according to the internal parameters of the simulated camera during rendering and the rendered depth image;

determining the position of a target sampling point in the point cloud according to the first position of the target sampling point in the 3D model, the second position of the 3D model in the simulated scene and the loaded posture change, wherein the target sampling point refers to the sampling point of the visible first object;

and determining the grabbing quality of the point in the point cloud according to the grabbing quality of the target sampling point and the position of the point cloud, and marking the grabbing quality of the point as the target grabbing quality of the corresponding pixel point in the 2D image.

9. The method of generating according to claim 8, wherein:

the determining the grabbing quality of the points in the point cloud according to the grabbing quality of the target sampling points and the positions of the target sampling points in the point cloud comprises the following steps:

for each target sampling point, determining the grabbing quality of a point adjacent to the target sampling point in the point cloud as the grabbing quality of the target sampling point; or alternatively

Interpolating a point in the point cloud according to the grabbing quality of a target sampling point adjacent to the point to obtain the grabbing quality of the point; or alternatively

And for each target sampling point, determining the grabbing quality of the point adjacent to the target sampling point in the point cloud as the grabbing quality of the target sampling point, and obtaining the grabbing quality of other points in the point cloud through interpolation after the grabbing quality of the points adjacent to all the target sampling points is determined.

10. The method of generating according to claim 9, wherein:

after determining the grabbing quality of the points in the point cloud according to the grabbing quality of the target sampling points and the positions of the target sampling points in the point cloud, the generating method further comprises the following steps: and regarding each target sampling point, taking the grabbing direction of the target sampling point as the grabbing direction of a point adjacent to the target sampling point in the point cloud, and combining the relative position relation between the visible first objects in the point cloud, and under the condition that the grabbing space at the point adjacent to the target sampling point is determined to be smaller than the required grabbing space, downwards adjusting the grabbing quality of the point, of which the distance from the point cloud to the target sampling point is smaller than a set distance threshold value.

11. The method of generating according to claim 1, wherein:

the sample image comprises a 2D image, the generating method further comprising: data is generated for a classification for each pixel in the 2D image, the classification including a foreground and a background.

12. A method of training an estimation model of an object gripping point, comprising:

and training an estimation model of the object grabbing point by taking the sample image as input data in a machine learning mode, and calculating loss according to the difference value between the predicted grabbing quality of the pixel point in the sample image output by the estimation model and the target grabbing quality during training.

13. The training method of claim 12, wherein:

the training data is generated according to the generation method as claimed in any one of claims 1 to 11.

14. Training method according to claim 12 or 13, characterized in that:

the sample image includes a 2D image and a depth image; the estimation model includes a backbone network and a multi-branch network, wherein:

the backbone network adopts a semantic segmentation network architecture and comprises a depth channel, and is set to extract features from an input 2D image and a depth image;

The multi-branch network adopts a multi-task learning network architecture and comprises a depth channel, and is set to predict based on the extracted characteristics and output the predicted grabbing quality of the pixel points in the 2D image.

15. The training method of claim 12, wherein:

the training data is generated according to the generating method as claimed in claim 12;

the multi-drop network includes:

a first branch network configured to output a classification confidence of each pixel point in the 2D image, the classification including a foreground and a background; a kind of electronic device with high-pressure air-conditioning system

And the second branch network is used for outputting the predicted grabbing quality of the pixels classified as the foreground, which is determined according to the classification confidence in the 2D image.

16. An estimation method of an object grabbing point, comprising:

inputting the scene image into an estimation model of object grabbing points, wherein the estimation model is a trained estimation model by the training method according to any one of claims 12 to 15;

17. The estimation method as claimed in claim 16, wherein:

the determining the position of the grabbing point of the object to be grabbed according to the predicted grabbing quality of the pixel point in the 2D image output by the estimation model comprises the following steps:

18. A device for generating training data of an object grabbing point estimation model, characterized by comprising a processor and a memory storing a computer program, wherein the processor implements the method for generating training data of an object grabbing point estimation model according to claim 1 or 11 when executing the computer program.

19. A training device for an estimated model of an object gripping point, characterized by comprising a processor and a memory storing a computer program, wherein the processor implements the training method for an estimated model of an object gripping point according to claim 12 or 15 when executing the computer program.

20. An object-grabbing-point estimation device comprising a processor and a memory storing a computer program, wherein the processor, when executing the computer program, implements the object-grabbing-point estimation method according to claim 16 or 17.

21. A robotic vision system, comprising:

control means comprising the object gripping point estimation means according to claim 20, said control means being arranged to determine the position of the gripping point of the object to be gripped from the scene image taken by the camera; and controlling the grabbing action executed by the robot according to the position of the grabbing point;

and a robot arranged to perform said gripping action.

22. A non-transitory computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements a method of generating training data of an object grasp point estimation model according to any one of claims 1 to 11, or implements a method of training an object grasp point estimation model according to any one of claims 12 to 15, or implements an object grasp point estimation method according to claim 16 or 17.