WO2023124734A1

WO2023124734A1 - Object grabbing point estimation method, apparatus and system, model training method, apparatus and system, and data generation method, apparatus and system

Info

Publication number: WO2023124734A1
Application number: PCT/CN2022/135705
Authority: WO
Inventors: 周韬
Original assignee: 广东美的白色家电技术创新中心有限公司; 美的集团股份有限公司
Priority date: 2021-12-29
Filing date: 2022-11-30
Publication date: 2023-07-06
Also published as: CN116416444B; CN116416444A

Abstract

An object grabbing point estimation method, apparatus and system, a model training method, apparatus and system, and a data generation method, apparatus and system. The data generation method comprises: sampling a grabbing point on the basis of a 3D model of a sample object, and assessing the grabbing quality of the sampled point; and rendering a simulation scene, into which a 3D model of a first object is loaded, so as to generate a sample image for training and target grabbing quality of a pixel point therein. The sample image and the target grabbing quality are taken as training data to train an object grabbing point estimation model, and the trained model is used for estimating an object grabbing point. By means of the embodiments of the present disclosure, automatic labeling of a sample image can be realized, training data can be efficiently generated at high quality, and the estimation precision of a grabbing point is improved.

Description

Object grasping point estimation, model training and data generation method, device and system

technical field

The present disclosure relates to but is not limited to artificial intelligence technology, and specifically relates to a method, device and system for object grasping point estimation, model training and data generation.

Background technique

In the application scenario of robot vision guidance, the challenge encountered by the robot vision system is to guide the robot to grab thousands of different stock keeping units (SKU for short). These objects are usually unknown to the system, or due to the variety, it is too costly to maintain physical models or texture templates for all SKUs. The simplest example is in the depalletizing application, although the objects to be grabbed are all rectangular objects (boxes or boxes), the texture and size of the objects will change according to different scenes. Therefore, the classic object localization or recognition scheme based on template matching is difficult to apply in such scenarios. In some application scenarios of e-commerce warehousing, many objects have irregular shapes. The most common objects are box-like objects and bottle-like objects. Sorting one by one from the stacked state, performing subsequent scanning or identification operations and sending them into the appropriate target material box.

In this process, how does the robot vision system estimate the most suitable grasping point of the robot (it can be a suction point but not limited to this) based on the scene captured by the camera without prior knowledge of the object, and guide the robot to perform object grasping Taking actions is still a problem that needs to be solved.

Contents of the invention

The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.

An embodiment of the present disclosure provides a method for generating training data of an object grasping point estimation model, including:

Acquiring a 3D model of the sample object, sampling grab points based on the 3D model of the sample object and evaluating the grab quality of the sample points;

Rendering a simulated scene loaded with a 3D model of a first object to generate a sample image for training, the first object is selected from the sample objects;

A target grasping quality of pixels in the sample image is generated according to the grasping quality of the sampling points of the first object.

An embodiment of the present disclosure also provides a device for generating training data of an object grasping point estimation model, including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the implementation of the present disclosure is realized. A method for generating training data for an object grasping point estimation model described in any one of the embodiments.

The method and device of the above-mentioned embodiments of the present disclosure realize automatic labeling of sample images, can generate training data efficiently and with high quality, and avoid problems such as heavy workload and unstable labeling quality caused by manual labeling.

An embodiment of the present disclosure provides a method for training an estimation model of object grasping points, including:

Acquiring training data, the training data includes a sample image and the target grasping quality of pixels in the sample image;

Taking the sample image as input data, using machine learning to train the estimation model of the object grasping point, during training, according to the predicted grasping quality of the pixel points in the sample image output by the estimated model and the grasping quality of the target The difference between the masses calculates the loss;

Wherein, the estimation model includes a backbone network using a semantic segmentation network architecture and a multi-branch network, and the multi-branch network adopts a multi-task learning network architecture.

An embodiment of the present disclosure also provides a training device for an estimation model of an object grasping point, including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, any one of the methods described in the present disclosure can be realized. The method for training the estimation model of the object grasping point described in the embodiment.

The method and device of the above-mentioned embodiments of the present disclosure learn the capture quality of pixels in a 2D image through training, which has better accuracy and stability than the direct and optimal way of capturing points.

An embodiment of the present disclosure provides a method for estimating a grasping point of an object, including:

Acquiring a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;

Inputting the scene image into the estimation model of the object grasping point, wherein the estimation model adopts the estimation model trained by the training method described in any embodiment of the present disclosure;

Determine the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model.

An embodiment of the present disclosure also provides a device for estimating the grasping point of an object, including a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the method described in any embodiment of the present disclosure is implemented. The method for estimating object grasping points described above.

An embodiment of the present disclosure also provides a robot vision system, including:

A camera, configured to shoot a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;

The control device includes the object grasping point estimation device according to the embodiment of the present disclosure, the control device is configured to determine the position of the grasping point of the object to be grasped according to the scene image captured by the camera ; and, controlling the grabbing action performed by the robot according to the position of the grabbing point;

A robot configured to perform said grasping action.

The estimation method, device and robot vision system of the above-mentioned embodiments of the present disclosure can improve the accuracy of object grasping point estimation, thereby improving the success rate of grasping.

An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program described in any embodiment of the present disclosure can be implemented. The method for generating the training data of the object grasping point estimation model described above, or realize the training method of the object grasping point estimation model as described in any embodiment of the present disclosure, or realize the method as described in any embodiment of the present disclosure Method for estimating object grasp points.

Other aspects will be apparent to others upon reading and understanding the drawings and detailed description.

Description of drawings

The accompanying drawings are used to provide an understanding of the embodiments of the present disclosure, and constitute a part of the description, together with the embodiments of the present disclosure, are used to explain the technical solutions of the present disclosure, and do not constitute limitations to the technical solutions of the present disclosure.

1 is a flowchart of a method for generating training data for an object grasping point estimation model according to an embodiment of the present disclosure;

Fig. 2 is a flow chart of generating labeled data according to the grasping quality of sampling points in Fig. 1;

3 is a schematic diagram of a device for generating training data according to an embodiment of the present disclosure;

Fig. 4 is a flowchart of a training method for an estimation model of an object grasping point according to an embodiment of the present disclosure;

Fig. 5 is a network structure diagram of an estimation model according to an embodiment of the present disclosure;

FIG. 6 is a flowchart of a method for estimating an object grasping point according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a robot vision system according to an embodiment of the present disclosure.

Detailed ways

The present disclosure describes a number of embodiments, but the description is illustrative rather than restrictive, and it will be apparent to those of ordinary skill in the art that within the scope encompassed by the described embodiments of the present disclosure, There are many more embodiments and implementations.

In the description of the present disclosure, words such as "exemplary" or "for example" are used to mean an example, illustration or illustration. Any embodiment described in this disclosure as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments. "And/or" in this article is a description of the relationship between associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. "A plurality" means two or more than two. In addition, in order to clearly describe the technical solutions of the embodiments of the present disclosure, words such as "first" and "second" are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that words such as "first" and "second" do not limit the quantity and execution order, and words such as "first" and "second" do not necessarily limit the difference.

In describing representative exemplary embodiments, the specification may have presented a method and/or process as a particular sequence of steps. However, to the extent the method or process is not dependent on the specific order of steps described herein, the method or process should not be limited to the specific order of steps described. Other sequences of steps are also possible, as will be appreciated by those of ordinary skill in the art. Therefore, the specific order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, claims to the method and/or process should not be limited to performing their steps in the order written, those skilled in the art can readily understand that these orders can be changed and still remain within the spirit and scope of the disclosed embodiments Inside.

With the development of deep learning technology, various tasks such as detection (2-dimensional object position and size estimation), segmentation (pixel-level object category prediction or instance index prediction) and object grabbing point estimation can be completed by training the visual neural network model. It is also possible to implement a data-driven grasp point estimation method and equipment based on a deep learning framework and appropriate training data.

In one scheme, after the color map and depth map are taken by the camera, the point cloud is used for plane segmentation or segmentation based on Euclidean distance, so as to try to segment and detect different objects in the scene, and then based on the segmented points Find the center point as the grab point candidate, then use a series of heuristic rules to sort the grab point candidates, and finally guide the robot to grab the optimal grab point. At the same time, a feedback system is introduced to record the success or failure of each grab. If successful, the current object is used as a template to match the grab point of the next grab. The problem with this scheme is that the performance of ordinary point cloud segmentation is relatively weak, there will be more wrong capture points, and when the objects are closely arranged, the point cloud segmentation scheme is easy to fail.

In another solution, a deep learning framework is used to manually mark some limited data, mark the direction and area of its grasping point to obtain relevant training data, and train a neural network model based on these training data. During the operation of the system, the vision system can process and count pictures similar to the training set, and estimate the grasping points of the objects. The problem with this solution is that the cost of data collection and labeling is relatively high, especially at the level of data labeling. It is difficult to label the direction and area of the grabbing point, which requires the labeler to have strong technical capabilities. At the same time, there are many human factors in the labeling information. , the labeling quality cannot be systematically controlled, so it is impossible to produce a model with systematic quality assurance.

An embodiment of the present disclosure provides a method for generating training data of an object grasping point estimation model, as shown in FIG. 1 , including:

Step 110, acquiring the 3D model of the sample object, sampling the grabbing points based on the 3D model of the sample object and evaluating the grabbing quality of the sampling points;

The sample object can be various box-like objects, bottle-like objects, box-like objects and bottle-like objects, or objects of other shapes. The sample object can usually be selected from the actual items to be grabbed, but it is not required to cover all types of the actual items to be grabbed. Usually, items with typical geometric shapes among the items to be grasped can be selected as sample objects, but this disclosure does not require that the sample objects must cover the shapes of all items to be grasped. Based on the generalization ability of the model, the trained model is still Grasp point estimation can be performed for objects of other shapes.

Step 120, rendering the simulated scene loaded with the 3D model of the first object, and generating a sample image for training, the first object being selected from the sample objects;

The loaded first object may be randomly selected by the system from sample objects, or manually selected, or selected according to configured rules. The selected first object may include one type of sample object, or may include multiple types of sample objects, may include one type of sample object, or may include multiple types of sample objects. This embodiment is not limited to this.

Step 130, generating a target grasping quality of pixels in the sample image according to the grasping quality of the sampling points of the first object.

The target capture quality of pixels in the sample image here can be the target capture quality of some pixels in the sample image, or the capture quality of all pixels in the sample image, which can be marked pixel by pixel , may also be to mark a set of multiple pixel points, such as an area including more than two pixel points in the sample image. Because the grasping quality of pixels in the labeled sample image is used as the target data during training, it is called the target grasping quality of pixels in this paper.

In the embodiment of the present disclosure, the 3D model of the sample object is obtained first, and the grabbing point sampling and evaluation of the grabbing quality of the sampling point are performed based on the 3D model. Because the geometric shape of the 3D model itself is accurate, the quality of grabbing can be completed with high quality. evaluation of. After loading the 3D model of the selected first object to generate the first simulation scene, because the position and attitude of the 3D model can be tracked during loading, the positional relationship between the sampling point and the pixel point in the sample image can be calculated, and the sampling point The quality of grabbing is passed to the corresponding pixel in the sample image. The training data generated by the embodiments of the present disclosure includes sample images and annotation data (including but not limited to the target capture quality), so the embodiments of the present disclosure realize automatic annotation of sample images, and can generate training data efficiently and with high quality , avoiding the heavy workload and unstable labeling quality caused by manual labeling.

In an exemplary embodiment of the present disclosure, the acquiring the 3D model of the sample object includes: creating or collecting the 3D model of the sample object, and normalizing the center of mass of the sample object to be located at the center of the 3D model The origin of the model coordinate system, the main axis of the sample object is consistent with the direction of a coordinate axis in the model coordinate system. When creating a 3D model in this embodiment, the so-called normalization can be embodied as a unified modeling rule, that is, the origin of the model coordinate system is established at the center of mass of the sample object, and one coordinate axis in the model coordinate system is consistent with the main axis direction of the object . If it is a collected 3D model that has been created, normalization can be achieved by translating and rotating the 3D model to meet the above requirements that the center of mass is located at the origin and the main axis is in the same direction as a coordinate axis.

In an exemplary embodiment of the present disclosure, the performing grab point sampling based on the 3D model of the sample object includes: performing point cloud sampling on the 3D model of the sample object, and determining the position of the sampling point in the 3D model The first position and grabbing direction are recorded; the first position is represented by the coordinates of the sampling point in the model coordinate system of the 3D model, and the grabbing direction is based on the coordinates of the sampling point in the 3D model The normal vector is determined. In this embodiment, when performing point cloud sampling based on a 3D model, uniform sampling can be performed on the surface of the sample object, and the specific algorithm is not limited. By appropriately setting the number of sampling points, the sampling points on the surface of the sample object can have an appropriate density. , to avoid missing suitable grab points. In an example, a normal vector of a plane fitted by all points within a set neighborhood range of a sampling point may be used as a normal vector of the sampling point.

In an exemplary embodiment of the present disclosure, the evaluation of the grasping quality of the sampling point includes: in the scenario where a single suction cup is used to pick up the sample object, estimating the sampling point according to the sealing quality and the confrontation quality of each sampling point The grasping quality; wherein, the sealing quality is based on the fact that the sucker sucks the sample object at the sampling point and the axial direction of the sucker is consistent with the grabbing direction of the sampling point. The airtightness between the surfaces is determined, and the resisting mass is determined according to the gravitational moment of the sample object and the resisting degree of the gravitational moment against the gravitational moment that can be generated when the suction cup absorbs the object.

In the embodiment of the present disclosure, when the suction cup is used to suck the sample object, the gravitational moment will cause the sample object to rotate and fall (the mass of the sample object is assigned during configuration), and the suction force of the suction cup on the sample object and the distance between the end of the suction cup and the sample object The friction force between them can provide a moment against the gravitational moment to prevent the sample object from falling. The suction force and friction force can be used as configuration information or calculated according to configuration information (such as suction cup parameters, object material, etc.). Therefore, the degree of resistance reflects the stability of the object during absorption, which can be calculated according to relevant formulas. The above closedness quality and confrontation quality can be scored separately, and then the sum of the two scores, or the average value, or the weighted average value can be used as the grasping quality of the sampling point. The airtight quality and confrontation quality of the sampling point are determined by the local geometric characteristics of the 3D model, which can fully reflect the relationship between the local geometric information of the object and the quality of the grasping point, so the accurate evaluation of the grasping quality of the sampling point can be realized .

Although the embodiment of the present disclosure takes a single sucker to pick up an object as an example, the present disclosure is not limited thereto. For the grasping method of picking up an object through multiple points or clamping an object through multiple points, it can also be achieved according to the grasping efficiency and the stability of the object. The index of quality and probability of success is used to evaluate the grasping quality of sampling points.

In an exemplary embodiment of the present disclosure, the simulation scene is obtained by loading the 3D model of the first object into the initial scene, and the loading process includes:

selecting the type and quantity of the first object to be loaded from the sample objects, and assigning a value to the quality of the first object;

Loading the 3D model of the first object into the initial scene according to a random position and posture;

using a physics engine to simulate the falling process of the first object and the final stacking state to obtain the simulated scene;

Recording a second position and posture of the 3D model of the first object in the simulation scene.

The embodiment of the present disclosure can simulate various object stacking scenes through the above loading process, and the training data generated based on the scene makes the model trained based on the training data suitable for object grasping points in complex scenes of object stacking. Estimation, to solve the problem that the object grasping point is difficult to estimate in this complex scene. A simulated material frame can be set in the initial scene, the 3D model of the first object is loaded into the material frame, and the collision process between the first objects and between the first object and the material frame is simulated through simulation, so that the final form The simulated scene of object stacking is closer to the real scene. But the material frame is not necessary. In other embodiments of the present disclosure, a simulated scene in which the first objects are stacked in an orderly manner may also be loaded, depending on the need for simulating the actual working scene.

For the same initial scene, the first object may be loaded multiple times in different ways to obtain multiple simulation scenes. The different manners may be, for example, different types and/or quantities of the loaded first objects, different initial positions and postures of the 3D models during loading, and the like.

In an exemplary embodiment of the present disclosure, the rendering of the simulated scene to generate sample images for training includes: rendering each simulated scene at least twice to obtain at least two sets of sample images for training ; Wherein, when rendering each time, add a simulated camera to the simulated scene, set the light source and add texture to the loaded first object, and render the 2D image and depth image as a set of sample images; in the multiple renderings At least one of the following parameters is different between any two renderings: object texture, simulated camera parameters, and light parameters. In this embodiment, the simulated environment is illuminated during the process of rendering pictures, by adjusting the parameters of the simulated camera (such as internal parameters, position, angle, etc.), light parameters (such as the color and intensity of lighting, etc.), the texture of the object, etc., can Strengthen the degree of data randomization, enrich the content of sample images, and increase the number of sample images, thereby improving the quality of training data, and then improving the performance of the trained estimation model.

In an example of this embodiment, adding textures to the loaded first object each time of rendering includes: each time of rendering, for each first object loaded into the simulation scene, from various Randomly select one of the real textures to paste on the surface of the first object; or, each time rendering, for each type of first object loaded into the simulation scene, randomly select one of the collected multiple real textures Attached to the surface of the first object of this kind. This example compensates for domain differences between real and simulated data through a randomization technique. For example, the real texture can be collected from an actual object image, an image of the real texture can be used, and the like. Randomly pasting the selected texture on the surface of the first object stacked randomly in the simulated scene can render multiple images with different textures. In the embodiment of the present disclosure, by providing sample images with different textures but relatively consistent geometric information to the object grasping point estimation model, and generating annotation information based on the grasping quality of the sampling points calculated according to the local geometric information, the estimation model can use the local geometry Information to predict the grasping quality of the grasping point, so as to realize the generalization ability of the model for unknown objects.

In an exemplary embodiment of the present disclosure, the sample image includes a 2D image and a depth image; generating the target capture quality of the pixel points in the sample image according to the capture quality of the sampling points of the first object ,include:

Each simulation scene of the rendered 2D image and depth image is processed as follows, as shown in Figure 2:

Step 210, according to the internal parameters of the simulated camera during rendering and the rendered depth image, obtain the point cloud of the first visible object in the simulated scene;

Step 220: Determine the position of the target sampling point in the point cloud according to the first position of the target sampling point in the 3D model, the second position of the 3D model in the simulation scene, and the attitude change after loading , the target sampling point refers to the sampling point of the visible first object;

Step 230, according to the capture quality of the target sampling point and the position in the point cloud, determine the capture quality of the point in the point cloud and mark it as the target capture quality of the corresponding pixel point in the 2D image. Take the quality.

The point cloud of the visible first object obtained according to the internal parameters of the simulated camera and the depth image (which may also include other information) during rendering, and the position of the target sampling point calculated according to the above-mentioned first position, second position and attitude change, Not necessarily aligned. However, there is a one-to-one correspondence between the points on the point cloud and the pixels in the 2D image at the pixel level. between a few pixels. Therefore, it is necessary to determine the grasping quality of the point in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud.

In an exemplary embodiment of the present disclosure, the determining the grasping quality of the point in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud includes:

The first one, for each target sampling point, determine the grasping quality of the points adjacent to the target sampling point in the point cloud as the grasping quality of the target sampling point; or

The second type, for a point in the point cloud, obtain the grasping quality of the point according to the grasping quality interpolation of the target sampling point adjacent to the position of the point; or

The third method, for each target sampling point, determine the grasping quality of the points adjacent to the target sampling point in the point cloud as the grasping quality of the target sampling point, and determine the quality of all points adjacent to the target sampling point After the quality is grasped, the quality of grasping of other points in the point cloud is obtained by interpolation.

Embodiments of the present disclosure provide various methods for transferring the grasping quality of target sampling points to the points of the point cloud. Among them, the first is to assign the grasping quality of the target sampling point to the adjacent points in the point cloud. In an example, the adjacent point can be one or more points closest to the target sampling point in the point cloud, such as can be filtered out according to the set distance threshold, and the distance to the target sampling point in the point cloud is less than the set point. The point with the above-mentioned distance threshold is taken as the point adjacent to the target sampling point. The second is an interpolation method. A point in the point cloud can be interpolated according to the grasping quality of multiple nearby target sampling points. When interpolating, an interpolation method based on Gaussian filtering can be used, or based on multiple target sampling points. Each distance to the point is given different weights for multiple target sampling points. The larger the distance, the smaller the weight. Based on the weight, the grasping quality of the multiple target sampling points is weighted and averaged to obtain the weight of the point. Grabbing quality, this embodiment can also use other interpolation methods. The points adjacent to this point can also be filtered according to the set distance threshold. If only one adjacent target sampling point is found for this point, the grabbing quality of this target sampling point can be assigned to this point. The third method is to determine the capture quality of the points adjacent to the target sampling point in the point cloud, and then obtain the capture quality of other points in the point cloud by interpolation according to the capture quality of some points in the point cloud. Both the above second and third methods can obtain the capture quality of all points in the point cloud, and after mapping the capture quality of these points in the point cloud to the capture quality of the corresponding pixel points in the 2D image, you can draw Heatmap of grasping quality for 2D images. However, using the first method, it is also possible to obtain only the capture quality of some points in the point cloud, and then obtain the capture quality of some pixels in the 2D image through mapping. At this time, during training, only the predicted grasping quality of the part of pixels is compared with the target grasping quality, the loss is calculated, and then the model is optimized according to the loss.

In an exemplary embodiment of the present disclosure, after determining the grasping quality of the point in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud, the generating The method further includes: for each target sampling point, using the grabbing direction of the target sampling point as the grabbing direction of a point adjacent to the target sampling point in the point cloud, combining the visible first Relative positional relationship between objects, when it is determined that the grabbing space at a point adjacent to the target sampling point is smaller than the required grabbing space, the distance from the point cloud to the target sampling point is smaller than the set Grabbing quality is adjusted downwards for points with a certain distance threshold. The embodiment of the present disclosure considers that in the stacked state, the grasping points with better quality of each object may not have enough space to complete the grasping operation due to the existence of adjacent objects, so the grasping quality of the points in the point cloud is determined Afterwards, the judgment of the grasping space is carried out, and the grasping quality of the point affected by the insufficient grasping space is adjusted downward, specifically, it can be adjusted below a certain set quality threshold to avoid being selected.

In an exemplary embodiment of the present disclosure, the sample image includes a 2D image, and the generating method further includes: labeling the classification of each pixel in the 2D image, the classification includes foreground and background, where the foreground is The first object in the image. Classification of pixel points can be used to train the estimation model to distinguish the ability of the foreground and background, and accurately select the points in the foreground (that is, the points on the first object) from the sample image input to the estimation model, so only the foreground points are needed. Estimates of predictive grasp quality are performed. The classification of the 2D image point pixels can also be obtained based on the classification of the points on the point cloud. By mapping the boundary points between the first object in the simulation scene and the background in the simulation scene to the point cloud, it is possible to determine the classification of the points on the point cloud, that is, foreground points or background points.

An embodiment of the present disclosure also provides a method for generating training data of an object grasping point estimation model, including:

Step 1: Collect 3D models of various sample objects, and normalize the 3D models so that the origin of the model coordinate system is placed at the center of mass of the sample object, and the first coordinate axis of the model coordinate system is consistent with the main axis of the sample object.

A 3D model in a format such as STereoLithography (STL for short) can be used, and the position of the center of mass of the sample object can be obtained by calculating the center points of all vertices through the statistics of the vertex and surface information in the 3D model. Then translate the origin of the model coordinate system to the center of mass of the sample object. In addition, the principal component analysis (PCA) method can be used to confirm the main axis direction of the sample object, and then the 3D model of the sample object is rotated so that the direction of a coordinate axis of the model coordinate system is in the same direction as the main axis of the sample object. Thus, the normalized 3D model is obtained, the origin of the model coordinate system is the center of mass of the sample object, and the direction of a coordinate axis in the model coordinate system is consistent with the direction of the main axis of the sample object.

Step 2, sampling the grabbing points of the 3D model of the sample object, obtaining and recording the first position and grabbing direction of each sampling point;

The sampling process in this embodiment is to perform point cloud sampling on the object model, and use the sampled point cloud to estimate the normal vector in a fixed neighborhood, and each point and its normal vector represent a sampling point. In an example, take a scene where a single sucker grabs an object as an example, and at this time, the grabbing point is the sucking point. Based on the existing vertices of the object, use the voxel sampling method or other sampling methods (such as the farthest point sampling) to obtain a set number of sampling points. At the same time, all points within a certain range of neighborhood where each sampling point is located are used to estimate the direction of the normal vector of the sampling point. The method of estimating the normal vector can be to use the random sample consensus algorithm (RANSAC for short) to fit all points in the neighborhood of the sampling point to estimate a plane, and the normal vector of the plane is approximately the normal vector of the sampling point.

Step 3, assessing the quality of the sampling points;

In the scenario where a single sucker picks up an object, the quality assessment process includes calculating the closedness quality during suction and the counterweight to the gravitational moment during suction (it needs to be able to resist the gravitational moment to complete stable grasping), according to the closure of each sampling point Quality and Adversarial Quality Estimate the grasping quality at that sampling point. In an example, for a scene using a single suction cup, it is necessary to evaluate whether the sampled suction point (that is, the sampling point) is a suction point that can stably pick up the sample object.

Evaluation includes two aspects, the first is the quality of closure. The quality of closure can be measured by approximating the end of the suction cup with a set radius as a polygon, projecting this polygon onto the surface of the 3D model through the grab point direction of the sampling point, and then comparing the projected polygon population side length and original side length. If the overall side length after projection increases more than the original side length, the sealing performance is not good. On the contrary, if the change is not large, the sealing performance is better. The increase degree can be expressed by the ratio of the increased value to the original side length value, The ratio can be given a corresponding score according to the ratio interval it falls into. Another aspect is to calculate the counter mass against the gravitational moment when the suction cup sucks the sample object at the sampling point along the grasping direction (also called the suction point direction). The confrontation quality can be calculated through the "wrench resistance" modeling scheme. "wrench" is a six-dimensional vector, the first three dimensions are forces, and the last three dimensions are moments. The space formed by the six-dimensional vectors is "wrench space", "wrench resistance" Indicates whether the combined wrench of the force and moment acting on a certain point has resistance. If the gravitational moment can be included in the wrench space provided by the suction force and the torque generated by the friction force, it can provide stable suction, but not vice versa. Finally, by normalizing the calculation results of closure quality and adversarial quality into scores between 0 and 1, and summing them up, the suction quality evaluation results for each suction point are obtained.

Step 4: Build the initial simulation data collection scene, that is, the initial scene, load multiple first objects selected from the sample objects into the initial simulation scene, and use the physics engine to simulate the falling dynamics and final stacking posture of the first objects .

Based on a physics engine and related simulation software that can simulate the dynamics of objects. Add a simulated material frame and let it exist statically in the simulation environment to provide the corresponding collision basis. At the same time, the 3D model of the first object can be loaded into the simulation environment through random positions and postures, and a certain quality can be given to each 3D model. In this way, through the simulation of the physics engine, the 3D model of the first object can randomly fall into the material frame by simulating the effect of gravity, and the physics engine will also calculate the collision information between different first objects at the same time, so that the first object forms a The stacking state is very close to the real scene. Based on such a scheme, the second position and posture of the first objects that are close to the real random stacking are obtained in the simulation scene.

Step 5, generating annotation data of the sample image rendered based on the simulated scene according to the capture quality of the sampling points.

This step needs to map the sampling points obtained by sampling the grab points on the 3D model and the grasping quality of each sampling point obtained through evaluation to the module scene based on stacked objects. Since the second position and attitude of the 3D model of the first object can be obtained during the simulation of the simulated scene, and the positions of the sampling points are expressed based on the model coordinate system of the 3D model, it is easy to calculate the positions of these sampling points in the simulated scene .

In order to render the simulated scene, a simulated camera is added at a set position in the simulated environment, and a ray-tracing-based rendering engine is used to efficiently render the 2D image (such as a texture image) and depth map of the first object in the stacked scene. Combined with the internal reference information of the simulated camera, the rendered depth image can be converted into a point cloud. Based on the calculated positions of the sampling points in the simulated scene and the rendered point cloud of the first object, the position of each sampling point in the point cloud of the first object to which it belongs can be determined.

In order to make up for the domain differences between real data and simulated data through domain randomization techniques. In this embodiment, various types of real textures (such as pictures of actual objects, pictures of certain regular textures, etc.) are collected, and the collected real textures are randomly pasted on the surface of the first object stacked randomly in the simulation environment. In the ray tracing-based analog camera rendering process, 2D images with different textures can be rendered. By providing the estimation model with 2D images with different textures but the same quality of pixel capture, the estimation model can be driven to use the local geometric information of the object to predict the quality of pixel capture, so that the model can be used for different unknown objects. ability.

Before the grasping qualities of the sampling points are transmitted to the points in the point cloud, a Gaussian filter may be performed on the grasping qualities of these sampling points based on the positions of the sampling points in the same first object. By finding the points adjacent to the sampling point in the point cloud, the points in the point cloud within the set neighborhood range of a sampling point (that is, the points adjacent to the sampling point) can obtain the capture of the sampling point quality. There is a pixel-level one-to-one correspondence between the rendered point cloud and the rendered 2D image, so the capture quality of the adjacent point can be marked as the target capture quality of the corresponding pixel in the 2D image. In an example, the target grasping quality of other pixels in the 2D image may be obtained by interpolation according to the target grasping quality of the corresponding pixel. Combining the grabbing direction of the sampling point and the local geometric information of the point cloud of the first object (such as the relative position and distance between the first objects, etc.), the grabbing quality of pixels with insufficient grabbing space can be adjusted down, so that When selecting the optimal grab point, some low-quality grab points caused by collisions are filtered out. It should be noted that the adjustment of the capture quality here may also be performed on corresponding pixels in the 2D image.

In this way, the capture quality heat map of the 2D image rendered by the simulated scene can be obtained. Optionally, the grasping quality heat map is output as annotation data of the sample image, but the mark data is not necessarily in the form of a heat map, as long as it contains information about the target grasping quality of pixels in the 2D image. When the grasping quality heat map is used as labeled data, the estimation model may be driven to learn or fit the grasping quality heat map during training.

An embodiment of the present disclosure also provides a device for generating training data of an object grasping point estimation model, as shown in FIG. 3 , including a processor 60 and a memory 50 storing a computer program, wherein the processor 60 executes The computer program implements the method for generating training data of an object grasping point estimation model according to any embodiment of the present disclosure. The processor in the embodiments of the present disclosure and other embodiments may be an integrated circuit chip, which has a signal processing capability. Described processor can be general-purpose processor, as central processing unit (Central Processing Unit, be called for short CPU), network processor (Network Processor, be called for short NP) etc.; It can also be digital signal processor (DSP), application-specific integrated circuit ( ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps and logic block diagrams disclosed in the embodiments of the present invention may be implemented or executed. A general purpose processor may be a microprocessor or any conventional processor or the like.

The above embodiments of the present disclosure may have the following advantages:

Using synthetic data generation and automatic labeling of synthetic data to replace manual data collection and labeling can reduce costs, improve automation, and ensure high data quality and high accuracy of labeling points.

In the synthetic data labeling process, the object physical model (ie 3D model) and object geometric information are used to evaluate the quality of the grab points based on basic physics principles, so as to ensure the rationality of the grab point labeling.

In the process of generating synthetic data, domain randomization technology is used to generate a large amount of synthetic data to train the estimation model by means of random texture, random illumination, random camera position, etc. Therefore, the estimation model can bridge the domain gap between synthetic data and real data, and learn the local geometric features of the object, so as to accurately complete the task of estimating the object's grasping point.

An embodiment of the present disclosure also provides a method for training an estimation model of an object grasping point, as shown in FIG. 4 , including:

Step 310, acquiring training data, the training data including sample images and target capture quality of pixels in the sample images;

Step 320, using the sample image as input data, using machine learning to train the estimation model of the object grasping point, during training, according to the predicted grasping quality of the pixel points in the sample image output by the estimation model and the described The difference between target grasp qualities computes the loss.

The training method of the estimation model in the embodiment of the present disclosure learns the grasping quality of the pixels in the 2D image, and then selects the optimal grasping point according to the predicted grasping quality of the pixels in the 2D image. The way of grabbing points has better accuracy and stability.

The machine learning in the embodiments of the present disclosure may be supervised deep learning, non-deep learning machine learning, and the like.

In an exemplary embodiment of the present disclosure, the training data is generated according to the method for generating training data of the object grasping point estimation model described in any embodiment of the present disclosure.

In an exemplary embodiment of the present disclosure, the network architecture of the estimation model is shown in Figure 5, including:

The backbone network (Backbone) 10 adopts a semantic segmentation network architecture (such as DeepLab, UNet, etc.), and is set to extract features from the input 2D image and depth image;

The multi-branch network 20 adopts a multi-task learning network architecture and is configured to perform prediction based on the extracted features, so as to output the predicted capture quality of pixels in the 2D image.

In the example, the multi-branch network (also referred to as a network head or a detection head) includes:

The first branch network 21 learns semantic segmentation information to distinguish the foreground and background, and is configured to output the classification confidence of each pixel in the 2D image, and the classification includes foreground and background; and

The second branch network 23 learns the grasping quality information of pixels in the 2D image, and is configured to output the predicted grasping quality of pixels classified as foreground determined according to the classification confidence in the 2D image. For example, pixels classified as foreground with confidence greater than a set confidence threshold may be referred to as foreground pixels.

This example involves classification, so the training data needs to include classified data.

In this example, the sample image includes a 2D image and a depth image; both the backbone network 10 and the multi-branch network 20 include depth channels, and the convolutional layers therein may adopt a 3D convolutional structure.

In this example, during training, the loss of the first branch network 21 is calculated based on the classification loss of all pixels in the 2D image; the loss of the second branch network 23 is based on the predicted capture of some or all pixels classified as foreground The difference between the quality and the target grasping quality is calculated; the loss of the backbone network 10 is calculated according to the total loss of the first branch network 21 and the second branch network 23 . After calculating the loss of each network, the parameters of each network can be optimized using the gradient descent algorithm until the loss is minimized and the model converges. During the training process, the depth image can also be randomly block-shaped, such as 64*64 pixels at a time, so that the network can better utilize the structured information in the depth image.

After using the training data to train the above evaluation module for multiple iterations, use the verification data to verify the accuracy of the trained estimation model. The verification data can be generated in the same way as the training data. After the accuracy of the estimation model meets the requirements, the It is estimated that the model is trained well and can be used. If the accuracy does not meet the requirements, continue training. When in use, the 2D image and the depth image containing the actual object to be grasped are input, and the predicted grasping quality of the pixels in the 2D image is output.

The embodiments of the present disclosure use a multi-task learning framework based on deep learning principles to build a grasping point estimation model, which can effectively solve the problems of high error rate and inability to distinguish adjacent objects in a simple point cloud segmentation scheme.

An embodiment of the present disclosure also provides a training device for an estimation model of an object grasping point, referring to FIG. 3 , which includes a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the following is implemented: A method for training an estimation model of an object grasping point described in any embodiment of the present disclosure.

In the embodiment of the present disclosure, the estimation model training method predicts the grasping quality of pixels in a 2D image through pixel-level dense prediction. Do pixel-level foreground and background classification predictions on one branch. In another branch, a grasp quality prediction value, ie predicted grasp quality, may be output for each pixel classified as foreground in the 2D image. Both the backbone network and the branch network of the estimation model in the embodiment of the disclosure include depth channels. At the input end, the depth image containing the depth channel information is input into the backbone network, and then the features learned by the depth channel are fused into the color 2D from the channel dimension direction. The features of the image, and pixel-by-pixel multi-task prediction, can help the estimation model to better handle the grasping point estimation task in the scene where objects to be grasped are stacked.

An embodiment of the present disclosure also provides a method for estimating an object grasping point, as shown in FIG. 6 , including:

Step 410, acquiring a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;

Step 420, input the scene image into the estimation model of the object grasping point, wherein the estimation model is an estimation model trained by the training method described in any embodiment of the present disclosure;

Step 430: Determine the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model.

The embodiments of the present disclosure realize camera driving, and scene images of objects to be captured, such as 2D images and depth images, can be captured by depth cameras adapted to various industrial scenes. After acquiring the color 2D image and the depth image from the depth camera, they are cropped and scaled to the image size required by the estimation model input, and then input into the estimation model.

In an exemplary embodiment of the present disclosure, determining the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model includes:

Selecting all or part of the pixels in the object to be grasped whose predicted grasping quality is greater than a set quality threshold;

Clustering the selected pixels and calculating one or more class centers, using the pixels corresponding to the class centers as candidate grabbing points for the object to be grabbed;

The obtained candidate grasping points are sorted based on a predetermined rule, and an optimal candidate grasping point is determined as the grasping point of the object to be grasped according to the ranking.

In an exemplary embodiment of the present disclosure, when sorting the obtained grabbing points based on predetermined rules, the sorting may be based on predetermined heuristic rules, and the heuristic rules may be based, for example, on the relative The distance from the camera, whether the pick-up point is in the actual material frame, whether the pick-up point will bring collisions and other condition settings, use these information to sort the candidate grabbing points, and determine the best candidate grabbing point as the one to be grabbed Grab points for objects.

An embodiment of the present disclosure also provides a device for estimating the grasping point of an object. Referring to FIG. 3 , it includes a processor and a memory storing a computer program, wherein, when the processor executes the computer program, it implements any aspect of the present disclosure. A method for estimating object grasping points described in an embodiment.

The above-mentioned embodiments of the present disclosure are based on the trained estimation model, send the 2D image and the depth image captured by the camera into the estimation model for forward reasoning, and output the prediction and evaluation quality of the pixels in the 2D image. If the number of pixels whose predicted evaluation quality is greater than the set quality threshold exceeds the set number, the set number of pixels with the best predicted capture quality, such as TOP50, TOP100, etc., can be selected. After clustering the selected pixels and calculating one or more class centers, the nearest pixel to the class center in the 2D image (it can be a pixel or a pixel in an area) can be used as The candidate grabbing points. Since the estimation model adopted can achieve better accuracy, the estimation method and device of this embodiment can improve the accuracy of object grasping point estimation, thereby improving the success rate of grasping.

An embodiment of the present disclosure also provides a robot vision system, as shown in FIG. 7 , including:

The camera 1 is configured to shoot a scene image containing an object to be captured, and the scene image includes a 2D image, or includes a 2D image and a depth image;

The control device 2 includes the estimation device of the object grasping point according to claim 20, the control device is configured to determine the position of the grasping point of the object to be grasped according to the scene image captured by the camera ; and, controlling the grabbing action performed by the robot according to the position of the grabbing point;

The robot 3 is configured to perform the grasping action.

The robot vision system of the embodiments of the present disclosure can improve the accuracy of object grasping point estimation, thereby improving the success rate of grasping.

An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program described in any embodiment of the present disclosure can be implemented. The method for generating training data of the object grasping point estimation model described above, or the method for training the object grasping point estimation model described in any embodiment of the present disclosure, or the object grasping method described in any embodiment of the present disclosure Estimation method for taking points.

In any one or more of the above-mentioned exemplary embodiments of the present disclosure, the described functions may be implemented by hardware, software, firmware or any combination thereof. If implemented in software, the functions may be stored on or transmitted over, as one or more instructions or code, a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media that correspond to tangible media such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, eg, according to a communication protocol. In this manner, a computer-readable medium may generally correspond to a non-transitory tangible computer-readable storage medium or a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure. A computer program product may comprise a computer readable medium.

By way of example and not limitation, such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk or other magnetic storage, flash memory, or may be used to store instructions or data Any other medium that stores desired program code in the form of a structure and that can be accessed by a computer. Moreover, any connection could also be termed a computer-readable medium. For example, if a connection is made from a website, server or other remote source for transmitting instructions, coaxial cable, fiber optic cable, dual wire, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not encompass connections, carrier waves, signals, or other transitory (transitory) media, but are instead directed to non-transitory tangible storage media. As used herein, disk and disc include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, or blu-ray disc, etc. where disks usually reproduce data magnetically, while discs use lasers to Data is reproduced optically. Combinations of the above should also be included within the scope of computer-readable media.

can be implemented by one or more processors such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Execute instructions. Accordingly, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements.

The technical solutions of the embodiments of the present disclosure may be implemented in a wide variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a set of ICs (eg, a chipset). Various components, modules, or units are described in the disclosed embodiments to emphasize functional aspects of devices configured to perform the described techniques, but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units (comprising one or more processors as described above) in combination with suitable software and/or firmware.

Claims

A method for generating training data for an object grasping point estimation model, comprising:

Acquiring a 3D model of the sample object, sampling grab points based on the 3D model of the sample object and evaluating the grab quality of the sample points;

Rendering a simulated scene loaded with a 3D model of a first object to generate a sample image for training, the first object is selected from the sample objects;

A target grasping quality of pixels in the sample image is generated according to the grasping quality of the sampling points of the first object.
The generating method according to claim 1, characterized in that:

The acquisition of the 3D model of the sample object includes: creating or collecting the 3D model of the sample object, making the center of mass of the sample object at the origin of the model coordinate system of the 3D model through normalization, and the The main axis is consistent with the direction of a coordinate axis in the model coordinate system.
The generating method according to claim 1, characterized in that:

The sampling of grabbing points based on the 3D model of the sample object includes: sampling the point cloud of the 3D model of the sample object, determining and recording the first position and grabbing direction of the sampling point in the 3D model; The first position is represented by the coordinates of the sampling point in the model coordinate system of the 3D model, and the grasping direction is determined according to the normal vector of the sampling point in the 3D model.
According to the generation method described in claim 1 or 2 or 3, it is characterized in that:

The evaluation of the grasping quality of the sampling point includes: in the scenario where a single suction cup is used to absorb the sample object, estimating the grasping quality of the sampling point according to the closedness quality and the confrontation quality of each sampling point; wherein, the closedness The quality is determined according to the airtightness between the end of the suction cup and the surface of the sample object when the suction cup sucks the sample object at the position of the sampling point and the axial direction of the suction cup is consistent with the grabbing direction of the sampling point. The quality is determined according to the gravitational moment of the sample object in this case and the resistance degree of the moment that the suction cup can generate when the object is picked up by the suction cup against the gravitational moment.
The generating method according to claim 1, characterized in that:

The simulated scene is obtained by loading the 3D model of the first object into the initial scene, and the loading process includes:

selecting the type and quantity of the first object to be loaded from the sample objects, and assigning a value to the quality of the first object;

Loading the 3D model of the first object into the initial scene according to a random position and posture;

using a physics engine to simulate the falling process of the first object and the final stacking state to obtain the simulated scene;

Recording a second position and posture of the 3D model of the first object in the simulation scene.
The generating method according to claim 1, characterized in that:

The rendering of the simulated scene to generate sample images for training includes: rendering each simulated scene at least twice to obtain at least two groups of sample images for training; Add a simulated camera to the scene, set the light source, and add texture to the loaded first object, and render the 2D image and depth image as a set of sample images; any two renderings in the multiple renderings have at least one of the following parameters different : The texture of the object, the simulated camera parameters, and the light parameters.
The generation method according to claim 6, characterized in that:

Adding textures to the loaded first object each time the rendering includes:

At each rendering, for each first object loaded into the simulation scene, randomly select one of the multiple real textures collected and paste it on the surface of the first object; or

At each rendering, for each type of first object loaded into the simulation scene, one of the multiple real textures collected is randomly selected to be pasted on the surface of the first object of this type.
According to the generation method described in claim 1 or 6 or 7, it is characterized in that:

The sample image includes a 2D image and a depth image; the generating the target capture quality of pixels in the sample image according to the capture quality of the sampling points of the first object includes:

Each simulated scene of the rendered 2D image and depth image is processed as follows:

Obtaining the point cloud of the first object visible in the simulation scene according to the internal parameters of the simulation camera during rendering and the rendered depth image;

Determine the position of the target sampling point in the point cloud according to the first position of the target sampling point in the 3D model, the second position of the 3D model in the simulation scene, and the attitude change after loading, the The target sampling point refers to the sampling point of the visible first object;

According to the grasping quality of the target sampling point and the position in the point cloud, determine the grasping quality of the point in the point cloud, and mark the grasping quality of the point as corresponding in the 2D image The target grabbing quality of pixels.
The generation method according to claim 8, characterized in that:

The determining the grasping quality of the point in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud includes:

For each target sampling point, determining the grasping quality of points adjacent to the target sampling point in the point cloud as the grasping quality of the target sampling point; or

For a point in the point cloud, obtain the grasping quality of the point according to the grasping quality interpolation of the target sampling point adjacent to the position of the point; or

For each target sampling point, determine the grasping quality of the points adjacent to the target sampling point in the point cloud as the grasping quality of the target sampling point, after determining the grasping quality of all the points adjacent to the target sampling point , to obtain the grasping quality of other points in the point cloud by interpolation.
The generation method according to claim 9, characterized in that:

After determining the grasping quality of the points in the point cloud according to the grasping quality of the target sampling point and the position in the point cloud, the generating method further includes: for each target sampling point, The grabbing direction of the target sampling point is used as the grabbing direction of a point adjacent to the target sampling point in the point cloud, and in combination with the relative positional relationship between the visible first objects in the point cloud, determine When the grabbing space at the point adjacent to the target sampling point is less than the required grabbing space, the grabbing quality of the point in the point cloud whose distance to the target sampling point is less than the set distance threshold is lowered Adjustment.
The generating method according to claim 1, characterized in that:

The sample image includes a 2D image, and the generation method further includes: generating classification data of each pixel in the 2D image, the classification including foreground and background.
A method for training an estimation model of an object grasping point, comprising:

Acquiring training data, the training data includes a sample image and the target grasping quality of pixels in the sample image;

Taking the sample image as input data, using machine learning to train the estimation model of the object grasping point, during training, according to the predicted grasping quality of the pixel points in the sample image output by the estimated model and the grasping quality of the target The difference between the masses calculates the loss.
The training method according to claim 12, characterized in that:

The training data is generated according to the generation method described in any one of claims 1-11.
The training method according to claim 12 or 13, characterized in that:

The sample image includes a 2D image and a depth image; the estimation model includes a backbone network and a multi-branch network, wherein:

The backbone network adopts a semantic segmentation network architecture and includes a depth channel, which is configured to extract features from input 2D images and depth images;

The multi-branch network adopts a multi-task learning network architecture and includes a depth channel, which is set to perform prediction based on the extracted features, and output the predicted capture quality of pixels in the 2D image.
The training method according to claim 12, characterized in that:

The training data is generated according to the generation method as claimed in claim 12;

The multi-branch network includes:

The first branch network is configured to output the classification confidence of each pixel in the 2D image, and the classification includes foreground and background; and

The second branch network is configured to output the predicted grasping quality of pixels classified as foreground determined according to the classification confidence in the 2D image.
A method for estimating an object grasping point, comprising:

Acquiring a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;

Inputting the scene image into the estimation model of the object grasping point, wherein the estimation model adopts the estimation model trained by the training method according to any one of claims 12 to 15;

Determine the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model.
Estimation method as claimed in claim 16, characterized in that:

The determining the position of the grasping point of the object to be grasped according to the predicted grasping quality of the pixels in the 2D image output by the estimation model includes:

Selecting all or part of the pixels in the object to be grasped whose predicted grasping quality is greater than a set quality threshold;

Clustering the selected pixels and calculating one or more class centers, using the pixels corresponding to the class centers as candidate grabbing points for the object to be grabbed;

The obtained candidate grasping points are sorted based on a predetermined rule, and an optimal candidate grasping point is determined as the grasping point of the object to be grasped according to the ranking.
A device for generating training data of an object grasping point estimation model, characterized in that it includes a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the computer program described in claim 1 or 11 is realized. A method for generating training data for the object grasping point estimation model described above.
An object grasping point estimation model training device, characterized in that it includes a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the method according to claim 12 or 15 is realized. A method for training an estimation model of object grasp points.
An object grasping point estimation device, characterized by comprising a processor and a memory storing a computer program, wherein, when the processor executes the computer program, the object grasping according to claim 16 or 17 is realized point estimation method.
A robot vision system, characterized in that it comprises:

A camera, configured to shoot a scene image containing an object to be captured, where the scene image includes a 2D image, or includes a 2D image and a depth image;

A control device, comprising the object grasping point estimation device according to claim 20, the control device is configured to determine the position of the grasping point of the object to be grasped according to the scene image captured by the camera; And, controlling the grabbing action performed by the robot according to the position of the grabbing point;

A robot configured to perform said grasping action.
A non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, characterized in that, when the computer program is executed by a processor, the object according to any one of claims 1 to 11 is realized A method for generating training data of a grasping point estimation model, or a method for training an estimation model of an object grasping point as claimed in any one of claims 12 to 15, or an object grasping method as claimed in claim 16 or 17 Estimation method for taking points.