CN116197886A

CN116197886A - Image data processing method, device, electronic equipment and storage medium

Info

Publication number: CN116197886A
Application number: CN202111426987.XA
Authority: CN
Inventors: 崔致豪; 丁有爽; 邵天兰
Original assignee: Mech Mind Robotics Technologies Co Ltd
Current assignee: Mech Mind Robotics Technologies Co Ltd
Priority date: 2021-11-28
Filing date: 2021-11-28
Publication date: 2023-06-02

Abstract

The application discloses an image data processing method, an image data processing device, electronic equipment and a storage medium. The image data processing method comprises the following steps: acquiring a point cloud of an object to be grabbed; randomly sampling the acquired point clouds of the object to be grabbed, and randomly acquiring a certain number of point clouds from the point clouds; and predicting the position characteristics and the rotation characteristics of the object to be grabbed based on the randomly acquired point cloud. According to the grabbing control scheme, the partial point clouds in the object point clouds are extracted, and the position characteristics and the rotation characteristics of the object are estimated based on the partial point clouds, so that accurate pose information can be obtained under the condition that the quality of the object point clouds is poor, which is common in industry.

Description

Image data processing method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of automatic control of a robot arm or a gripper, program control B25J, and more particularly, to an image data processing method, apparatus, electronic device, and storage medium.

Background

At present, in the fields of markets, supermarkets, logistics and the like, robots are gradually used to replace manual operations for sorting, carrying, placing and the like of goods, and the traditional robots are limited to only operate in a predetermined mode or operate in a limited intelligent mode, so that in the scenes, the positions and the placing of the objects to be operated have higher requirements. For example, for sorting tasks in a supermarket, the task requirement is to take out the articles to be sorted placed in the material frame and carry them to a specified location. In this task, the robot visually recognizes the position of each article in the material frame and takes out the article and places it at a designated position, in such a task, in order to ensure that the robot can grasp each article smoothly, the existing scheme requires that the worker first place the article in the material frame neatly, and each article in the material frame needs to be placed in a specific posture, for example, the canned drink, the boxed food, the bagged food, etc., all require that the opening face up, and then the material frame with a large number of articles placed neatly is transported to the robot work area, and the grasping work is performed by the robot.

In any object grabbing task based on machine vision, detecting grabbing pose of an object is a standard workflow. How to accurately calculate the pose of an object, wherein Principal Component Analysis (PCA) is widely used as a regional point cloud pose estimation method in the industry. In the prior art, a principal component analysis method is used for estimating the pose of the surface of the object, so that a more accurate estimated result can be realized. However, in practical application, the problem of diversity of objects or quality of point clouds often causes that the principal component analysis method cannot estimate the pose of the point clouds on the surface of the more complex objects.

Disclosure of Invention

The present invention has been made in view of the above problems, and aims to overcome or at least partially solve the above problems. Specifically, according to the grabbing control scheme, part of point clouds in the object point clouds are extracted, and the position characteristics and the rotation characteristics of the object are estimated based on the part of point clouds, so that accurate pose information can be obtained under the condition that the quality of the object point clouds common in industry is poor.

All of the solutions disclosed in the claims and the description of the present application have one or more of the innovations described above, and accordingly, one or more of the technical problems described above can be solved. Specifically, the application provides an image data processing method, an image data processing device, an electronic device and a storage medium.

The image data processing method of the embodiment of the application comprises the following steps:

acquiring a point cloud of an object to be grabbed;

randomly sampling the acquired point clouds of the object to be grabbed, and randomly acquiring a certain number of point clouds from the point clouds;

and predicting the position characteristics and the rotation characteristics of the object to be grabbed based on the randomly acquired point cloud.

In certain embodiments, the article to be grasped includes a graspable region of the article to be grasped.

In some embodiments, the randomly sampling the acquired point cloud of the object to be grabbed includes randomly sampling the acquired point cloud of the object to be grabbed at least twice.

In certain embodiments, the position features comprise translation parameters and/or the rotation features comprise euler angles and/or rotation vector quaternions.

In some embodiments, a deep learning network based random sampling of the point cloud is performed and the position features and rotation features of the item to be grabbed are predicted.

In some embodiments, the deep learning network further includes a linear correction component and/or a batch normalization component.

In some embodiments, when training the deep learning network, random dithering and translation are performed on the point cloud for training, and collision detection is performed on the point cloud after random dithering and translation.

In some embodiments, the pose of the robotic end effector when performing gripping is predicted based on the positional characteristics and rotational characteristics of the item to be gripped.

In some embodiments, the pose of the robot end effector point cloud is compared to an error between the pose of the end effector predicted by the deep learning network, and the deep learning network is updated based on the error.

An image data processing device according to an embodiment of the present application includes:

the point cloud acquisition module is used for acquiring the point cloud of the object to be grabbed;

the random sampling module is used for randomly sampling the acquired point clouds of the articles to be grabbed, and randomly acquiring a certain number of point clouds from the point clouds;

and the pose prediction module is used for predicting the position characteristics and the rotation characteristics of the object to be grabbed based on the randomly acquired point cloud.

In some embodiments, the random sampling module and the pose prediction module are implemented based on a deep learning network.

The electronic device of the embodiment of the application comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the image data processing method of any embodiment.

The computer-readable storage medium of the embodiments of the present application has stored thereon a computer program which, when executed by a processor, implements the image data processing method of any of the above embodiments.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow diagram of a method of determining object pose information according to certain embodiments of the present application;

FIG. 2 is a schematic illustration of mask pretreatment according to certain embodiments of the present application;

FIG. 3 is a flow chart of a method of determining object pose information in the case of poor point clouds according to certain embodiments of the present application;

FIG. 4 is a schematic illustration of pitch, roll and yaw axes associated with a rotation matrix;

FIG. 5 is a schematic structural diagram of an apparatus for determining object pose information in case of poor point cloud according to some embodiments of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to some embodiments of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the description of the specific embodiments, it should be understood that the terms "center," "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the invention.

Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

The invention can be used in industrial robot control scenes based on visual identification. A typical vision-based industrial robot control scenario includes devices for capturing images, control devices such as hardware for a production line and a PLC for the production line, robot components for performing tasks, and operating systems or software for controlling these devices. The means for capturing images may include a 2D or 3D smart/non-smart industrial camera, which may include an area camera, a line camera, a black and white camera, a color camera, a CCD camera, a CMOS camera, an analog camera, a digital camera, a visible light camera, an infrared camera, an ultraviolet camera, etc., depending on different functions and application scenarios; the production line can comprise a packaging production line, a sorting production line, a logistics production line, a processing production line and the like which need robots; the robot parts used in the industrial scene for performing tasks may be biomimetic robots, such as a human-type robot or a dog-type robot, or may be conventional industrial robots, such as a mechanical arm, etc.; the industrial robot may be an operation type robot, a program controlled type robot, a teaching reproduction type robot, a numerical control type robot, a sensory control type robot, an adaptation control type robot, a learning control type robot, an intelligent robot, or the like; the mechanical arm can be a ball-and-socket type mechanical arm, a multi-joint mechanical arm, a rectangular coordinate mechanical arm, a cylindrical coordinate mechanical arm, a polar coordinate mechanical arm and the like according to the working principle, and can be a grabbing mechanical arm, a stacking mechanical arm, a welding mechanical arm and an industrial mechanical arm according to the functions of the mechanical arm; the end of the mechanical arm can be provided with an end effector, and the end effector can use a robot clamp, a robot gripper, a robot tool quick-change device, a robot collision sensor, a robot rotary connector, a robot pressure tool, a compliance device, a robot spray gun, a robot burr cleaning tool, a robot arc welding gun, a robot electric welding gun and the like according to the requirements of tasks; the robot clamp can be various universal clamps, and the universal clamps refer to clamps with standardized structures and wide application range, such as a three-jaw chuck and a four-jaw chuck for a lathe, a flat tongs and an index head for a milling machine, and the like. For another example, the clamp may be classified into a manual clamp, a pneumatic clamp, a hydraulic clamp, a gas-liquid linkage clamp, an electromagnetic clamp, a vacuum clamp, etc. or other bionic devices capable of picking up an article, according to a clamping power source used for the clamp. The device for collecting images, the control devices such as hardware for a production line, a PLC (programmable logic controller) for the production line and the like, the robot parts for executing tasks and the operating system or software for controlling the devices can communicate based on TCP (transmission control protocol), HTTP (hyper text transfer protocol) and GRPC (generic personal computer) protocols (Google Remote Procedure Call Protocol ) so as to transmit various control instructions or commands. The operating system or software may be disposed in any electronic device, typically such electronic devices include industrial computers, personal computers, notebook computers, tablet computers, cell phones, etc., which may communicate with other devices or systems by wired or wireless means. Further, the gripping appearing in the present invention refers to any gripping action capable of controlling an article to change the position of the article in a broad sense, and is not limited to gripping the article in a narrow sense in a "gripping" manner, in other words, gripping the article in a manner such as suction, lifting, tightening, or the like, and also falls within the scope of the gripping of the present invention. The articles to be grasped in the present invention may be cartons, plastic soft packs (including but not limited to snack packages, milk tetra pillow packages, milk plastic packages, etc.), cosmeceutical bottles, cosmeceuticals, and/or irregular toys, etc., which may be placed in a floor, tray, conveyor belt, and/or material basket.

Fig. 1 shows a flow diagram of a method of determining object pose information according to an embodiment of the invention. As shown in fig. 1, the method includes:

step S100, obtaining image data comprising at least one object to be grabbed;

step S110, processing the image data to determine a grippable region in the image data;

and step S120, performing pose estimation processing on the grippable region to acquire pose information of the grippable region, wherein the pose information can be used for controlling a clamp to perform a gripping operation on the grippable region.

With respect to step S100, the type of image data and the acquisition method are not limited in the present embodiment. As an example, the acquired image data may include a point cloud or an RGB color map, the point cloud information may be acquired through a 3D industrial camera, and the 3D industrial camera is generally equipped with two lenses, which capture the object group to be grabbed from different angles, respectively, and the three-dimensional image of the object can be displayed after processing. And placing the object group to be grabbed below the vision sensor, shooting by two lenses at the same time, and calculating X, Y, Z coordinate values of each point and coordinate directions of each point of the glass to be glued by using a universal binocular stereoscopic vision algorithm according to the obtained relative attitude parameters of the two images so as to convert the X, Y, Z coordinate values and the coordinate directions of each point into point cloud data of the object group to be grabbed. In the specific implementation, the point cloud can also be generated by using elements such as a laser detector, a visible light detector such as an LED, an infrared detector, a radar detector and the like, and the specific implementation of the invention is not limited.

The point cloud data acquired in the mode is three-dimensional data, so that the data corresponding to the dimension with small influence on grabbing is filtered, the data processing amount is reduced, the data processing speed is further increased, the efficiency is improved, and the acquired three-dimensional point cloud data of the object group to be grabbed can be orthographically mapped to a two-dimensional plane.

As an example, a depth map corresponding to the orthographic projection may also be generated. A two-dimensional color map corresponding to the three-dimensional object region and a depth map corresponding to the two-dimensional color map may be acquired in a direction perpendicular to the depth of the object. Wherein the two-dimensional color map corresponds to an image of a planar area perpendicular to a preset depth direction; each pixel point in the depth map corresponding to the two-dimensional color map corresponds to each pixel point in the two-dimensional color map one by one, and the value of each pixel point is the depth value of the pixel point.

For step S110, the gripper needs to perform gripping in the grippable area of the article at the time of actual gripping, and the non-grippable area has no substantial effect on gripping, so that the article in the present invention may also be the article grippable area. The grippable area of the article refers to a part on the surface of the article, which can be gripped by the clamp, in an industrial scene, the articles to be gripped can be placed in a orderly and orderly manner, and at the moment, the grippable area of each article is basically the same, and the manner of determining the grippable area is simpler; it is also possible to pile together in a chaotic and unordered manner, where the grippable area of each item is random and it is necessary to determine the grippable area in a complex manner. The present embodiment is not limited to a specific use scenario and a specific method of determining the graspable region, as long as the graspable region can be acquired. In one embodiment, a mask of the grippable region may also be generated.

One possible embodiment of determining the grabber area and generating the mask may be to first, after acquiring image data comprising one or more objects to be grabbed, process the image data to identify each pixel in the image, e.g. for a 256 x 256 image 256 x 65536 pixels should be identified; and classifying all the pixel points included in the whole image based on the characteristics of each pixel point, wherein the characteristics of the pixel points mainly refer to RGB values of the pixel points, and in an actual application scene, RGB color images can be processed into gray images for conveniently classifying the characteristics, and the gray images can be classified by using the gray values. For classification of the pixel points, it may be predetermined which class the pixel points need to be classified into, for example, a large stack of beverage cans, food boxes and frames is included in the RGB image obtained by photographing, so if the purpose is to generate a mask in which the beverage cans, food boxes and frames are to be generated, the predetermined classification may be beverage cans, food boxes and frames. The three different classifications can be provided with a label, wherein the label can be a number, for example, a beverage can is 1, a food box is 2, a material frame is 3, or the label can be a color, for example, a beverage can is red, a food box is blue, and a material frame is green, so that after the classification and the processing are carried out, the beverage can is marked with 1 or red, the food box is marked with 2 or blue, and the material frame is marked with 3 or green in a finally obtained image. In this embodiment, the mask of the grippable region of the object is to be generated, so that only the grippable region is classified, for example, blue, and the blue region in the image processed in this way is the mask of the grippable region of the object to be grippable; a channel of image output is then created for each class, the channel acting to extract as output all class-dependent features in the input image. For example, after we create a channel of image output for the class of grippable region, the acquired RGB color image is input into the channel, and then the image from which the features of the grippable region are extracted can be acquired from the output of the channel. Finally, the feature image of the grippable region obtained by the processing is combined with the original RGB image to generate the composite image data with the grippable region mask identified.

Masks generated in this manner are sometimes unsuitable, e.g., some masks are of a size and shape that is inconvenient to follow. For another example, some areas may have masks generated, but the clamps may not be able to perform a grab at the mask locations. An unsuitable mask can have a significant impact on subsequent processing, and therefore requires pretreatment of the resulting mask for further steps. As shown in fig. 2, the preprocessing of the mask may include: 1. and (3) performing expansion treatment on the mask to fill in defects such as missing and irregular mask images. For example, for each pixel point on the mask, a certain number of points, e.g., 8-25 points, around the point may be set to be the same color as the point. This step corresponds to filling the periphery of each pixel, so if there is a defect in the object mask, the missing part will be filled completely, after this, the object mask will become complete, there is no defect, and the mask will become slightly "fat" due to expansion, and proper expansion will help to follow-up further image processing operation; 2. judging whether the area of the mask meets the preset condition, and if not, eliminating the mask. First, smaller mask areas are likely to be erroneous because of the continuity of the image data, one grabbed area will typically include a large number of pixels with similar characteristics, and mask areas formed by discrete small pixels may not be truly grabbed areas; secondly, the robot end actuating mechanism, namely the clamp, needs to have a certain area in the foot falling position when the grabbing task is executed, if the area of the grabbing area is too small, the clamp cannot drop the foot in the area at all, and therefore the object cannot be grabbed, and therefore too small mask is meaningless. The predetermined condition may be set according to the size of the jig and the size of the noise, and the value thereof may be a determined size, or the number of the included pixels, or a ratio, for example, the predetermined condition may be set to 0.1%, that is, when the ratio of the mask area to the whole image area is less than 0.1%, the mask is considered to be unusable, and then is removed from the image; 3. and judging whether the number of the point clouds in the mask is less than the preset minimum number of the point clouds. The number of the point clouds reflects the quality of the acquisition of the camera, and if the number of the point clouds in a certain grippable area is too small, the shooting of the area is not accurate enough. The point cloud may be used to control the gripper to perform the gripping, and too small a number may have an impact on the gripper's control process. Thus, the number of point clouds that should be included at least in a certain mask area may be set, for example: and when the number of the point clouds covered in a certain grabbing area is less than 10, eliminating the mask from the image data or randomly adding the point clouds for the grabbing area until the number reaches 10.

The image, pose, rotation matrix, orientation, position, etc. of the object of the present invention may be an image, pose, rotation matrix, orientation, position, etc. of a graspable region of the object. The "article" appearing in all aspects of the present invention may be replaced with "the grippable area of the article" and "the grippable area of the article" may be replaced with "the article". Those skilled in the art will appreciate which "items" and "graspable areas of items" may be interchanged with one another as occurs in embodiments of the present invention.

For step S120, the point cloud of the object may be represented in a different coordinate system, and likewise, the pose of the object may be represented in a different coordinate system. A commonly used coordinate system is a camera coordinate system in which a camera is taken as the origin of the coordinate system. In performing a gripping task, the point cloud and pose of the item are typically represented under a robot coordinate system. The pose of the object has a corresponding relation with the pose of the robot, and after the point cloud pose of the object is acquired under a robot coordinate system, the robot/mechanical arm can calculate how to move to the position of the object and what angle and pose are used for grabbing the object based on the pose of the object. In the embodiment, the pose of the robot can be calculated through the point cloud of the grabbing area of the object to be grabbed. In one embodiment, the position of the gripping point of the gripper may be determined based on the position of the article to be gripped, and the rotation angle of each controllable joint of the gripper, or the angle of the end effector of the gripper, may be determined based on the orientation or rotation of the article. The focus of the present embodiment is on calculating the pose based on the graspable region of the article, not on a specific pose calculation method, and any pose determination method may be used in the present embodiment.

The existing pose determining method for calculating the pose of the object based on the 3D point cloud of the object has high requirements on the quality of the obtained point cloud, and when the quality of the point cloud is poor and the number of the point clouds is too small, the pose cannot be determined based on the poor point cloud. In order to solve the problem, the inventor proposes a method capable of calculating pose information of an object based on point cloud data of the object in the case that the quality of the point cloud is poor, which is one of the important points of the present invention.

Fig. 3 is a flow chart of a method for accurately acquiring object pose information under a poor point cloud condition according to an embodiment of the present invention. As shown in fig. 3, the method comprises at least the following steps:

step S200, acquiring point clouds of an object to be grabbed;

step S210, randomly sampling the acquired point clouds of the object to be grabbed, and randomly acquiring a certain number of point clouds from the point clouds;

step S220, predicting the position feature and the rotation feature of the object to be grabbed based on the randomly collected point cloud.

For step S200, the point cloud information may be acquired by a 3D industrial camera, which is generally equipped with two lenses, to capture the object groups to be captured from different angles, and the three-dimensional image of the object can be displayed after processing. And placing the object group to be grabbed below the vision sensor, shooting by two lenses at the same time, and calculating X, Y, Z coordinate values of each point and coordinate directions of each point of the glass to be glued by using a universal binocular stereoscopic vision algorithm according to the obtained relative attitude parameters of the two images so as to convert the X, Y, Z coordinate values and the coordinate directions of each point into point cloud data of the object group to be grabbed. In the specific implementation, the point cloud can also be generated by using elements such as a laser detector, a visible light detector such as an LED, an infrared detector, a radar detector and the like, and the specific implementation of the invention is not limited.

For step S210, after the point clouds including all the objects to be grabbed are acquired at one time, the point cloud of each of the objects to be grabbed is extracted, and random sampling is performed on the point clouds of each of the objects to be grabbed. In other embodiments, all the point clouds in the entire scene may be randomly sampled, and in this case, since the entire scene point cloud is input, the calculation speed is much faster than that of each region alone. As for the number of the acquired point clouds, the inventor finds that the effect is best when the number of the randomly acquired point clouds is not less than 1024 under a dense scene through multiple experiments. In one embodiment, the point cloud data may be randomly sampled multiple times, for example, for the acquired point cloud data, a certain number of point clouds, for example 1024 point clouds, are randomly acquired from the point cloud data through first sampling, and then the acquired point clouds are combined to generate first sampling point cloud; and then, for the first sampling point cloud, performing second sampling, randomly acquiring a certain number of point clouds from 1024 points again, for example 512 point clouds, and combining the acquired point clouds to generate a second sampling point cloud. And then estimating the pose of the object to be grabbed based on the first sampling point cloud and the second sampling point cloud at the same time.

For step S220, after the randomly collected point clouds are obtained, the complete point clouds of the articles are fitted based on the obtained small number of point clouds, so as to obtain the complete point clouds of each article to be grabbed, then the complete point clouds of each article are put back into the original scene to form the complete scene point clouds, and then the position features and the rotation features of the articles are obtained based on the complete point clouds of the articles, and the complete point clouds can be generated in a graphics processing mode or a template matching mode. In other embodiments, the position and rotation characteristics of the object point cloud can also be generated by directly fitting the object point cloud pose based on the sampled point cloud. The position feature may be a translation parameter or a translation vector, typically a set of coordinates (X, Y, Z) in a cartesian coordinate system, which expresses how the current pose of the object is translated with respect to the reference pose of the object, where the translation parameter may also represent the position coordinates of the object when the position of the reference pose of the object is at the origin of the coordinate system, i.e., (0, 0).

The rotation characteristic may be a parameter of a rotation matrix of the article. When an article with a specific orientation rotates, the article is converted into another specific orientation, and the rotation matrix is used for expressing what kind of rotation is performed on the article. Essentially, the rotation matrix reflects the transformation relationship represented by coordinates in one coordinate system in another coordinate system. In one embodiment, assuming that the reference article pose has a right-side-up orientation, i.e., an orientation in which the grippable region of the article is perpendicular to the Z-axis, and the pose of the article to be gripped is obtained after rotation from the reference pose, the rotation matrix from the reference pose to the current pose of the article is

There are various forms of rotation in the prior artThe invention is not limited to this, as is the case with a matrix. Alternatively, the rotation matrix of the present invention may be a rotation matrix obtained based on euler angles. Any one rotation may be expressed as a combination of three angles, in turn, around three axes of rotation, which are known as euler angles. As shown in fig. 4, the rotation of an article is described by 3 rotation components, which can be understood as an X-axis, a Y-axis and a Z-axis in a cartesian coordinate system, wherein the X-axis is a pitch axis, and the clockwise rotation angle along the X-axis is a pitch angle, denoted as α; the Y axis is a yaw axis, and the clockwise rotation angle along the Y axis is a yaw angle and is marked as beta; the Z axis is a rolling axis, and the angle along the Z axis rotating clockwise is a rolling angle and is marked as gamma. Any one rotation can be considered a combination of three rotation means, for example, if an article is rotated in XYZ, this means that the article is rotated clockwise by a along the X axis, then by β along the Y axis, and finally by γ along the Z axis. The rotation matrix is different in different rotation modes, and the total rotation modes are 12 rotation modes. Preferably, the article can be rotated from the reference direction to the current state in a ZYX manner, and correspondingly the rotation matrix of the article to be grabbed can be +. >

In another embodiment, the rotation matrix may be a rotation matrix formed by rotation vector quaternions, and the matrix is constructed by quaternions, so that the problem that the rotation order needs to be considered when constructing the matrix by using the euler angles can be avoided. Therefore, the rotation characteristic of the invention can be Euler angle or quaternion, and the Euler angle and the quaternion can be mutually converted, so that the invention is not repeated.

It should be appreciated that steps S210-S220 described above may also be performed by the deep learning network by inputting the image into the deep learning network. In order to enable the recognition of each instance and the determination of whether each instance is folded using a deep learning network, the deep learning network is first trained, which requires the acquisition of a large amount of image data of an industrial site comprising a plurality of objects to be grabbed. After a large amount of data is collected, the data is marked and then input into a network for training. A deep learning network that can be used in the present invention should include at least three components, one of which can be a sampling component for sampling and combining an input point cloud; another component may be a full connection layer based translation estimation component for estimating a positional characteristic of the item based on the sampled point cloud; a third essential component may be a fully connected layer based rotation estimation component for estimating rotation characteristics of the item based on the sampled point cloud. In one embodiment, a linear correction component can be added as an activation function after the convolution layer of the network to improve the problem of gradient disappearance in the network and increase the training speed, and any linear correction component can be used to realize the invention, such as linear correction with leakage, random linear correction with leakage, or linear correction with noise. In further embodiments, a batch normalization component may also be added after the convolutional layer of the network to unify the scattered data, thereby making machine learning easier to learn the rules among the data.

In one embodiment, the deep learning network does not use any pre-training model, training is performed directly from initialization. The deep learning network takes M (the number of point cloud areas) x 1024 (the number of sampling points) x 3 (the position of the point cloud in the 3-dimensional space) as input to perform model training. In one embodiment, the point cloud may be randomly dithered and translated during the training process, and collision detection is performed on the randomly dithered and translated point cloud to determine that the newly formed object point cloud does not cross each other unreasonably, where the randomly dithered refers to randomly changing coordinates of the point cloud, and the translated refers to translating the point cloud to a specific direction. The deep learning pose estimation network will iterate 30,000 times over the total training data, with an initial learning rate of 0.001, and decay by a factor of 10 at 20,000 and 25,000 iterations, respectively. In one embodiment, the database point cloud is continuously input into the deep learning network as training data during training, and the end effector of the robot is obtained as a reference for model iteration, and the error between the pose of the end effector of the robot and the pose of the end effector predicted by the deep learning network is compared, and the whole deep learning network is updated based on the error.

In one embodiment, the position feature and the rotation feature of at least one object to be grabbed may be used to calculate a grabbing feature value for ordering the grabbing difficulty of the plurality of objects to be grabbed, that is, ordering all objects to be grabbed based on the obtained grabbing feature value, and controlling the clamp to grab according to the order of the ordering. According to the actual situation, the characteristic values can be ordered only according to the grabbing characteristic values, or the characteristic values obtained based on the pose can be combined with other characteristic values, and comprehensive ordering is performed. If the sorting is performed in combination with the plurality of feature values, the plurality of feature values may be normalized, and a weight value may be set for each feature value, and the sorting is performed based on the normalized feature values and the corresponding weights, so as to control the jig to perform the capturing based on the sorting result.

In one embodiment, when the control jig grips the plurality of articles to be gripped based on the order of the gripping feature values, the plurality of articles to be gripped may be gripped sequentially in order, for example, gripping feature values of three articles, that is, the first article 5, the second article 10, and the third article 15, are obtained in one gripping task, and then the control jig may grip the third article in the first gripping, grip the second article in the second gripping, and grip the first article in the third gripping; it is also possible to grasp only the article in which the grasping characteristic value is highest, and recalculate the grasping characteristic value at the next grasping, for example, the grasping characteristic values of 5 articles are obtained in one grasping task, namely, the first article 5, the second article 10, the third article 15, the fourth article 11, the fifth article 18, and since the grasping characteristic value of the fifth article is highest, the jig is controlled to grasp the fifth article at the first grasping; before the second grabbing, re-acquiring image data, calculating grabbing characteristic values of the remaining 4 articles, grabbing the characteristic value highest, and so on until grabbing is completed.

In addition, it should be noted that although each embodiment of the present invention has a specific combination of features, further combinations and cross combinations of these features between embodiments are also possible.

Fig. 5 shows an apparatus for acquiring pose information of an article in case of poor point cloud according to still another embodiment of the present invention, the apparatus comprising:

the point cloud obtaining module 300 is configured to obtain a point cloud of an object to be grabbed, i.e. to implement step S200;

the random sampling module 310 is configured to randomly sample the obtained point clouds of the object to be grabbed, and randomly collect a certain number of point clouds from the point clouds, i.e. the point clouds are used to implement step S210;

the pose prediction module 320 is configured to predict a position feature and a rotation feature of the object to be grabbed based on the randomly collected point cloud, i.e. to implement step S220.

It should be understood that in the embodiment of the apparatus shown in fig. 5, only the main functions of the modules are described, and all the functions of each module correspond to the corresponding steps in the method embodiment, and the working principle of each module may refer to the description of the corresponding steps in the method embodiment. For example, the method for implementing step S220 by the pose prediction module 320 in the above embodiment shows that the portion of the description for describing and explaining step S220 is also the content for describing and explaining the function of the pose prediction module 320. In addition, although the correspondence between functions of the functional modules and the method is defined in the above embodiments, those skilled in the art will understand that the functions of the functional modules are not limited to the correspondence, that is, a specific functional module may also implement other method steps or a part of the method steps. For example, the above embodiment describes the method for implementing step S220 by the pose prediction module 320, however, the pose prediction module 320 may be used to implement the method or a part of the method of step S200 or S210 according to the actual situation.

The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of any of the above embodiments. It should be noted that, the computer program stored in the computer readable storage medium according to the embodiment of the present application may be executed by the processor of the electronic device, and in addition, the computer readable storage medium may be a storage medium built in the electronic device or may be a storage medium capable of being plugged into the electronic device in a pluggable manner, so that the computer readable storage medium according to the embodiment of the present application has higher flexibility and reliability.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which may be a control system/electronic system configured in an automobile, a mobile terminal (e.g., a smart mobile phone, etc.), a personal computer (PC, e.g., a desktop computer or a notebook computer, etc.), a tablet computer, a server, etc., and the specific embodiment of the present invention is not limited to the specific implementation of the electronic device.

As shown in fig. 6, the electronic device may include: a processor 1202, a communication interface (Communications Interface) 1204, a memory 1206, and a communication bus 1208.

Wherein:

the processor 1202, the communication interface 1204, and the memory 1206 communicate with each other via a communication bus 1208.

A communication interface 1204 for communicating with network elements of other devices, such as clients or other servers, etc.

The processor 1202 is configured to execute the program 1210, and may specifically perform relevant steps in the method embodiments described above.

In particular, program 1210 may include program code including computer operating instructions.

The processor 1202 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the electronic device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

Memory 1206 for storing program 1210. The memory 1206 may comprise high-speed RAM memory or may further comprise non-volatile memory (non-volatile memory), such as at least one disk memory.

Program 1210 may be downloaded and installed from a network and/or from a removable medium via communications interface 1204. The program, when executed by the processor 1202, may cause the processor 1202 to perform the operations of the method embodiments described above.

In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, system that includes a processing module, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It is to be understood that portions of embodiments of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

Furthermore, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

Although the embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the embodiments described above by those of ordinary skill in the art within the scope of the application.

Claims

1. An image data processing method, comprising:

acquiring a point cloud of an object to be grabbed;

2. The image data processing method according to claim 1, wherein: the article to be grabbed includes a grabbed area of the article to be grabbed.

3. The image data processing method according to claim 1, wherein: the step of randomly sampling the acquired point cloud of the object to be grabbed comprises randomly sampling the acquired point cloud of the object to be grabbed at least twice.

4. The image data processing method according to claim 1, wherein the position features comprise translation parameters and/or the rotation features comprise euler angles and/or rotation vector quaternions.

5. The image data processing method according to any one of claims 1 to 4, characterized in that: random sampling of the point cloud and prediction of position features and rotation features of the object to be grabbed are performed based on the deep learning network.

6. The image data processing method of claim 5, wherein the deep learning network further comprises a linear correction component and/or a batch normalization component.

7. The image data processing method according to claim 5, characterized by further comprising: and when training the deep learning network, randomly dithering and translating the point cloud for training, and detecting collision of the randomly dithered and translated point cloud.

8. The image data processing method according to claim 5, characterized by further comprising: and predicting the pose of the robot end effector when the robot end effector performs grabbing based on the position features and the rotation features of the objects to be grabbed.

9. The image data processing method according to claim 8, characterized by further comprising: comparing the pose of the robot end effector point cloud with the error between the pose of the end effector predicted by the deep learning network, and updating the deep learning network based on the error.

10. An image data processing apparatus, comprising:

11. The image data processing apparatus according to claim 10, wherein: the article to be grabbed includes a grabbed area of the article to be grabbed.

12. The image data processing apparatus according to claim 10, wherein: the step of randomly sampling the acquired point cloud of the object to be grabbed comprises randomly sampling the acquired point cloud of the object to be grabbed at least twice.

13. The image data processing device according to claim 10, wherein the position features comprise translation parameters and/or the rotation features comprise euler angles and/or rotation vector quaternions.

14. The image data processing apparatus according to any one of claims 10 to 13, wherein: the random sampling module and the pose prediction module are realized based on a deep learning network.

15. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the image data processing method according to any one of claims 1 to 9 when the computer program is executed.

16. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the image data processing method of any one of claims 1 to 9.