CN113787521B

CN113787521B - Robot grabbing method, system, medium and electronic device based on deep learning

Info

Publication number: CN113787521B
Application number: CN202111122883.XA
Authority: CN
Inventors: 王卫军; 王兆广; 徐友法; 张允�; 孙海峰; 杨亚; 郭雨晨; 陈凯
Original assignee: Shanghai Micro Motor Research Institute 21st Research Institute Of China Electronics Technology Corp
Current assignee: Shanghai Micro Motor Research Institute 21st Research Institute Of China Electronics Technology Corp
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2023-04-18
Anticipated expiration: 2041-09-24
Also published as: CN113787521A

Abstract

The application provides a robot grabbing method and system based on deep learning, a computer readable storage medium and electronic equipment. The method comprises the following steps: constructing three-dimensional point cloud information of a target object according to the input depth image, and encapsulating to obtain input data of the target object; obtaining grabbing evaluation values of different grabbing modes in candidate grabbing modes of the target object under the target scene based on a pre-trained grabbing mode evaluation model according to input data of the target object; and selecting the grabbing mode with the highest grabbing evaluation value to grab the target object in the target scene. Therefore, input data of an actual use scene is provided through a generation mode of the virtual target object, a decision basis of a robot grabbing mode is provided in an auxiliary mode, and quick analysis and effective decision of selecting the grabbing mode when facing to a three-dimensional object in a real scene are achieved; and an important decision basis is provided for the action and the behavior of the robot in the actual use process.

Description

Robot grabbing method, system, medium and electronic device based on deep learning

Technical Field

The present disclosure relates to the field of robot sensing technologies, and in particular, to a robot capture method and system based on deep learning, a computer-readable storage medium, and an electronic device.

Background

Object grabbing refers to selection of a grabbing mode capable of obtaining the best success rate under the accessibility constraint condition of given target objects and environment limitations. Selecting a suitable gripping mode for a target object is a basic task in robot application, for example, for a basic object carrying task, a robot firstly has the capability of correctly identifying the object and can smoothly complete subsequent carrying actions by adopting the best gripping mode at a proper time. In an actual application scenario, because a target object has a series of uncertain factors such as shape, posture and material properties, and is influenced by ambient light, viewing angle and other conditions, great challenges are often added to the accurate observation and execution of a grabbing behavior of a robot, how to accurately identify a target three-dimensional object and relevant environmental restrictions, and a scheme or a configuration option most suitable for a current scenario is selected in a given grabbing manner range, which is one of the problems to be solved urgently in the robot field.

The traditional methods usually plan information such as captured direction, force, speed and the like based on an analysis conclusion of a physics theory, but the methods implicitly assume that the information acquired by a visual perception system is perfect and error-free under an ideal condition, and the condition is difficult to meet in an actual application scene. Complex, variable and unpredictable usage scenarios often lead to errors in such rule-analysis-based decision-making grab planning. Therefore, the grab planning analyzed by the physical principle often does not have the feasibility of actual operation, and the relevant parameters obtained by empirical learning are likely to be influenced by the reduction of the cross-domain effect. These challenges result in the robot failing to effectively select the correct gripping manner in the practical application process, resulting in a decrease in the success rate of the robot behavior operation.

Therefore, there is a need to provide an improved solution to the above-mentioned deficiencies of the prior art.

Disclosure of Invention

An object of the present application is to provide a robot grabbing method, system, computer-readable storage medium and electronic device based on deep learning, so as to solve or alleviate the above problems in the prior art.

In order to achieve the above purpose, the present application provides the following technical solutions:

the application provides a robot grabbing method based on deep learning, which comprises the following steps: constructing three-dimensional point cloud information of a target object according to an input depth image, and encapsulating to obtain input data of the target object; obtaining grabbing evaluation values of different grabbing modes in candidate grabbing modes of the target object under the target scene based on a pre-trained grabbing mode evaluation model according to the input data of the target object; and selecting the grabbing mode with the highest grabbing evaluation value to grab the target object in the target scene.

Preferably, the constructing three-dimensional point cloud information of a target object according to an input depth image, and encapsulating the three-dimensional point cloud information to obtain input data of the target object includes: constructing point cloud information of the target object according to the input depth image based on a depth estimation method; and encapsulating the point cloud information of the target object to obtain the input data of the target object.

Preferably, the obtaining, according to the input data of the target object, grasping evaluation values of different grasping modes in candidate grasping modes of the target object in the target scene based on a pre-trained grasping mode evaluation model specifically includes: based on the pre-trained grasping mode evaluation model, performing model forward propagation operation on input data of the target object, and acquiring grasping evaluation values of different grasping modes in candidate grasping modes of the target object in a target scene according to influence factors of characteristic attributes of the target object; wherein the characteristic attributes of the target object include: the shape, attitude, position of the target object.

Preferably, the deep learning-based robot capture method further includes: generating training scene data of the grabbing mode evaluation model by a probability sampling method according to three-dimensional point cloud data of a training object obtained in advance; and according to the training scene data and a preset loss function, iteratively updating the capture mode evaluation model based on deep learning.

Preferably, the generating of the training scene data of the grasping manner evaluation model according to the three-dimensional point cloud data of the training object obtained in advance by a probabilistic sampling method includes: selecting a training object corresponding to the target scene from a training object set according to the target scene in a preset application scene set; wherein the set of training objects comprises a plurality of different classes of the training objects; the preset application scene set comprises a plurality of different target scenes, and the different target scenes correspond to the training objects of different categories respectively; generating three-dimensional point cloud data of the training object based on a rendering engine or a three-dimensional reconstruction method; wherein the three-dimensional point cloud data of the training object is in a format that can be used for rendering; rendering the three-dimensional point cloud data, performing probability sampling on the training object according to the statistical data of the training object in the target scene, and generating training scene data of the grabbing mode evaluation model, wherein the training scene data meet preset conditional probability distribution.

Preferably, the iteratively updating the capture mode evaluation model based on deep learning according to the training scene data and a preset loss function specifically includes: and according to the training scene data and a preset cross entropy loss function, carrying out iterative updating on the weight and the offset value of each layer in the grasping mode evaluation model based on the deep convolutional neural network by a random gradient descent method.

Preferably, the deep learning-based robot capture method further includes: and constructing the grasping mode evaluation model based on a TensorFlow learning framework.

The embodiment of the present application further provides a robot grasping system based on deep learning, including: the input unit is configured to construct three-dimensional point cloud information of a target object according to an input depth image, and package the three-dimensional point cloud information to obtain input data of the target object; the evaluation unit is configured to obtain grabbing evaluation values of different grabbing modes in candidate grabbing modes of the target object under a target scene based on a pre-trained grabbing mode evaluation model according to the input data of the target object; and the decision unit is configured to select the grabbing mode with the highest grabbing evaluation value and grab the target object in the target scene.

Embodiments of the present application further provide a computer-readable storage medium on which a computer program is stored, where the program is the deep learning-based robot capture method according to any of the above embodiments.

An embodiment of the present application further provides an electronic device, including: the robot grabbing system comprises a memory, a processor and a program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize the deep learning based robot grabbing method according to any one of the above embodiments.

Compared with the closest prior art, the technical scheme of the embodiment of the application has the following beneficial effects:

according to the technical scheme provided by the embodiment of the application, three-dimensional point cloud information of a target object in a target scene is constructed according to an input depth image, and is encapsulated to obtain input data of the target object; then, according to input data of the target object, obtaining grabbing evaluation values of different grabbing modes in candidate grabbing modes of the target object under the target scene based on a pre-trained grabbing mode evaluation model; and finally, selecting the grabbing mode with the highest grabbing evaluation value to grab the target object in the target scene. Therefore, input data of an actual use scene is provided through a generation mode of a virtual target object, a grabbing mode evaluation model for evaluating the effectiveness of grabbing modes is used for evaluating a plurality of grabbing modes simultaneously, comprehensive scores of different candidate grabbing modes on grabbing effects under a current input picture can be evaluated quickly, and according to the grabbing effect comprehensive score vector, the grabbing mode with the best effect which is most suitable for a decision basis of the current scene is selected; the method has the advantages that decision basis of the robot grabbing mode is provided in an auxiliary mode, and quick analysis and effective decision of selecting the grabbing mode when a three-dimensional object is faced in a real scene are achieved; and an important decision basis is provided for the action and the behavior of the robot in the actual use process.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. Wherein:

fig. 1 is a schematic flow diagram of a deep learning-based robot capture method according to some embodiments of the present application;

FIG. 2 is a technical framework diagram of the deep learning-based robot grabbing method of the embodiment shown in FIG. 1;

FIG. 3 is a schematic flow chart of a method for constructing a grasp-mode assessment model according to some embodiments of the present application;

FIG. 4 is a technical framework diagram of the construction of the grip evaluation model according to the embodiment shown in FIG. 3;

fig. 5 is a schematic structural diagram of a deep learning-based robotic grasping system according to some embodiments of the present application;

FIG. 6 is a schematic structural diagram of an electronic device provided in accordance with some embodiments of the present application;

fig. 7 is a hardware block diagram of an electronic device provided in accordance with some embodiments of the present application.

Detailed Description

The present application will be described in detail below with reference to the embodiments with reference to the attached drawings. The various examples are provided by way of explanation of the application and are not limiting of the application. In fact, it will be apparent to those skilled in the art that modifications and variations can be made in the present application without departing from the scope or spirit of the application. For instance, features illustrated or described as part of one embodiment, can be used with another embodiment to yield a still further embodiment. It is therefore intended that the present application cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

At present, a model obtained by training on a large-scale object grabbing mode labeling data set can robustly adapt to a complex application scene, but the collection of the data usually needs complicated manual labeling or tedious and repeated test repetition. The application notices that the existing method is difficult to generate effective training data which has a gain effect on model training, so that the generation and preparation of massive virtual training data are realized by adopting methods such as engine rendering, depth estimation and the like during the reconstruction of a three-dimensional model, and in the data rendering process, the difference between a test scene and an application scene is reduced as much as possible by adopting a probability distribution sampling mode, so that the robustness of a finally-generated model is ensured; therefore, the virtual three-dimensional object probability sampling grabbing method multitask learning technology based on deep learning effectively solves the problem that high-quality training data are insufficient in the object grabbing method. And a capture quality evaluation network model (capture mode evaluation model) is established based on the deep learning convolutional neural network, and multi-task analysis learning is performed according to various capture modes and the current characteristics of input objects (target objects), so that the final capture mode decision process can be judged according to different use scenes. Therefore, reasonable parameter learning is carried out by utilizing a deep learning method, and the final decision has the characteristics of good evaluation effect, high accuracy, high success rate, strong robustness and the like, and has better practical application experience.

As shown in fig. 1 and 2, the robot capture method based on deep learning includes:

s101, constructing three-dimensional point cloud information of a target object according to an input depth image, and encapsulating to obtain input data of the target object;

specifically, based on a depth estimation method, point cloud information of a target object is constructed according to an input depth image; the depth image is image data having depth information. And then encapsulating the point cloud information of the target object to obtain the input data of the target object. Here, the input data of the target object can clearly represent three-dimensional data of the three-dimensional information of the target object.

In the embodiment of the application, the target objects are classified and divided according to the use scene setting of the robot, and the set of each type of target objects corresponds to one use scene of the robot. The input depth image is a set of images of a target object in an actual use scene acquired by an image signal perception module of the robot. Then, three-dimensional data (three-dimensional point cloud information) of the target object is constructed through a rendering engine or a three-dimensional reconstruction method, the three-dimensional data of the target object is rendered according to an input format of the deep learning model, the three-dimensional data is derived into a specific rendering format (such as an obj format and a fbx format), and packaging is completed to be used as input data of the capture evaluation model.

Step S102, obtaining grabbing evaluation values of different grabbing modes in candidate grabbing modes of the target object under the target scene based on a pre-trained grabbing mode evaluation model according to input data of the target object;

in some optional embodiments, based on a pre-trained grasping mode evaluation model, performing model forward propagation operation on input data of a target object, and acquiring grasping evaluation values of different grasping modes in candidate grasping modes of the target object in a target scene according to influence factors of characteristic attributes of the target object; wherein the characteristic attributes of the target object include: the shape, pose, and position of the target object.

In the embodiment of the application, different grabbing modes of the same target object need to be matched under the conditions of different postures and different positions. The objective of the grasping mode evaluation model is to calculate the multitask evaluation value of each candidate grasping mode according to evaluation under the condition of giving the relevant information of the target object and the candidate grasping modes, and to use the influence factors of various factors in practical application as decision bases. That is, given the input data of the target object and the candidate grabbing manner of the target object under the target scene, a score evaluation of the grabbing success rate of the target object under the target scene is output.

In the embodiment of the present application, training data of the grasping manner evaluation model is generated by constructing the virtual object in step S101. The objective of the grasping mode evaluation model is to calculate the multitask evaluation value of each candidate grasping mode according to evaluation under the condition of giving the relevant information of the target object and the candidate grasping modes, and to use the influence factors of various factors in practical application as decision bases.

In the embodiment of the application, a grabbing mode mu epsilon is set, wherein phi represents a set of grabbing modes, and phi comprises a plurality of different candidate grabbing modes. And calculating the grabbing evaluation value of each candidate grabbing mode according to the grabbing parameters of each candidate grabbing mode in the target scene and the relative state of the simulated vision of the robot and the target object. The grabbing parameters and the relative state of the simulated vision of the robot and the target object can be set and adjusted in the grabbing simulated scene of the target scene. The grabbing parameters comprise grabbing direction, speed and force; the simulation vision of the robot is a virtual module of an image signal perception module of the robot in a grabbing simulation scene. The relative state x of the simulated vision of the robot and the target object is determined based on information such as a deflection angle, a depression angle, a focal length and the like of the simulated vision and the target object. The specific expression is shown as formula (1), and formula (1) is as follows:

x＝(P _o ,P _c )……………………………………(1)

wherein, P _o Representing the attitude of the target object, P _c Representing the pose of the simulated vision of the robot.

In the embodiment of the application, point cloud information (three-dimensional point cloud information of a target object constructed according to a depth image) of the target object acquired by a graphic signal perception module of a robot is represented by y, and a label of a grabbing success rate of a grabbing mode mu under a relative state x of a simulated vision and the target object is represented by M (mu, x) epsilon {0,1 }; and p (M, mu, x, y) represents the joint distribution of M, mu, x, y, and Q (mu, y) represents the success rate estimation value of the grabbing mode mu for grabbing successfully under the point cloud information y, namely the grabbing estimation value of the grabbing mode mu on the target object based on the joint distribution p (M, mu, x, y).

And S103, selecting the grabbing mode with the highest grabbing evaluation value, and grabbing the target object in the target scene.

In the embodiment of the application, when the decision of the grabbing mode of the target object is made, according to the grabbing evaluation value obtained by forward propagation of the grabbing mode evaluation model, the grabbing mode which satisfies the evaluation value with the highest is selected:

π _θ (y)＝argmax _μ∈Φ Q _θ (u,y)

wherein, pi _θ Representing the grasping mode with the highest evaluation value, Q _θ (u, y) indicates the passage of the grasping mode uAnd the grabbing evaluation value of the target object corresponding to the current point cloud information y is obtained by forward propagation of the grabbing mode evaluation model. Here, Q _θ (u, y) is a multi-dimensional vector containing weights of various influencing factors (such as behavior speed, conflict occurrence probability and the like); theta represents a set of various influence factors (such as behavior speed, conflict occurrence probability and the like), and the grabbing decision process is to calculate the comprehensive score by setting weight information such as the behavior speed, the conflict occurrence probability and the like, and acquire a grabbing mode with the largest score.

In the implementation of the application, input data of an actual use scene is provided through a generation mode of a virtual target object, a grabbing mode evaluation model for evaluating the effectiveness of grabbing modes is used for evaluating a plurality of grabbing evaluation effects simultaneously, comprehensive scores of different candidate grabbing modes on the grabbing effects under a current input picture can be evaluated quickly, and according to the grabbing effect comprehensive score vector, the grabbing mode with the best effect and most suitable for a decision basis of the current scene is selected; the method has the advantages that decision basis of the robot grabbing mode is provided in an auxiliary mode, and quick analysis and effective decision of selecting the grabbing mode when a three-dimensional object is faced in a real scene are achieved; and an important decision basis is provided for the action and the behavior of the robot in the actual use process.

As shown in fig. 3 and 4, the robot capture method based on deep learning further includes:

s301, generating training scene data of a capture mode evaluation model according to three-dimensional point cloud data of a training object obtained in advance by a probability sampling method;

in the embodiment of the application, aiming at the problem that the strategy for selecting the grabbing mode of a specific object (target object) is not robust enough, a three-dimensional data scheme is used for generating a large amount of virtual three-dimensional object data meeting the probability distribution of an actual application scene. The method comprises the steps of simulating actual scenes which may be met by the robot by constructing a virtual three-dimensional object, generating a large amount of training data closely related to the actual application scenes, and using the training data for robot grabbing mode training.

In the embodiment of the application, different grabbing modes of the same target object need to be matched under the conditions of different postures and different positions. Therefore, when the grasping mode evaluation model is trained, a use scene with the largest coverage range is generated through probability sampling according to given point cloud information of the three-dimensional object, and it is ensured that the grasping evaluation model can learn relatively robust experience parameters in a training process.

Specifically, firstly, according to a target scene in a preset application scene set, a training object corresponding to the target scene is selected from a training object set; the training object set comprises a plurality of different types of training objects, the preset application scene set comprises a plurality of different target scenes, and the plurality of different target scenes correspond to the plurality of different types of training objects respectively. Then, generating three-dimensional point cloud data of a training object based on a rendering engine or a three-dimensional reconstruction method; wherein the three-dimensional point cloud data of the training object is in a format that can be used for rendering; and finally, rendering the three-dimensional point cloud data, carrying out probability sampling on the training object according to the statistical data of the training object in the target scene, and generating training scene data which meets the probability distribution of a preset condition of the grasping mode evaluation model. Here, KL (Kullback-Leibler divergence) divergence is used as the preset conditional probability distribution.

In the embodiment of the application, probability sampling is added in the three-dimensional data rendering process, so that the generated training data can meet the preset condition probability distribution, and the training data and the actual use scene of the robot have less relative entropy. The preset conditional probability distribution is determined according to the set of the actual use scene of the robot and the corresponding target object. For example, in an actual use scene of the robot, the target objects include a water bottle and a cube, and in a set of target objects composed of the water bottle and the cube, the probability that the robot grabs the water bottle is 80% and the probability that the robot grabs the cube is 20%, then the probabilities of the water bottle and the cube in the generated three-dimensional training data are 80% and 20%, respectively.

In the embodiment of the application, based on a depth estimation method, three-dimensional point cloud information of a training object is constructed according to an input depth image of the training object; the depth image of the training object is image data with depth information, the object of a real scene is photographed through an image signal perception module of the robot, and sampling and obtaining are carried out according to the occurring statistical information. And then packaging the point cloud information of the training object to obtain data for capturing and evaluating model training.

In the embodiment of the application, before a capture mode evaluation model is trained, feature information of an input picture of a training object needs to be acquired, feature extraction is performed on three-dimensional point cloud data of a depth image of the training object in a feature extraction mode, the three-dimensional point cloud data of the depth image can be input into a convolutional neural network model during feature extraction, feature information is extracted by a forward propagation method, rendering and packaging are performed to obtain image features (training scene data), and feature mapping of related branches is calculated through convolution kernels of each candidate block branch.

Here, it should be noted that, for the other steps and flows of generating the training scene data in step S301, reference may be made to the step and the flow of generating the input data of the capturing evaluation model from the target object in step S101, and details are not repeated here.

And S302, iteratively updating the capture mode evaluation model based on deep learning according to the training scene data and a preset loss function.

In the embodiment of the present application, the goal of the grasping manner evaluation model is to make the grasping evaluation value and the set tag value (the grasping success rate S of manual labeling) as close as possible. In this case, the loss function is passed

And checking the deviation of the grabbing evaluation value and the set tag value, wherein the loss function convergence indicates that the grabbing evaluation value is similar or similar to the set tag value, and the grabbing mode corresponding to the grabbing evaluation value meets the actual grabbing requirement of the target object.

In the embodiment of the application, the loss function is a cross entropy loss function, and specifically, the weights and bias values of each layer in the capture mode evaluation model based on the deep convolutional neural network are iteratively updated by a random gradient descent method according to training scene data and a preset cross entropy loss function.

In the embodiment of the application, the parameters of the grasping mode evaluation model are updated based on the defined cross entropy loss function, and the parameter updating process is to update the weight and the offset value of each layer in the deep convolutional neural network, so that the final loss function value is minimum. Here, initial values of the weight and the offset value of each layer in the deep convolutional neural network are randomly generated.

Further, a grabbing mode evaluation model is constructed based on a TensorFlow learning framework. Namely, the grasping mode evaluation model adopts a deep learning framework TensorFlow, the definition of the model structure and the calculation of the loss function, and is realized by using a built-in related method of the TensorFlow. Specifically, the grabbing mode evaluation model is optimized in a random gradient descending mode, loss function values are generated for training objects of each batch, gradient information is generated, model parameters are optimized through back propagation and automatic updating, and the loss function values are reduced in the iterative training process.

In the embodiment of the application, the grabbing mode evaluation model is iteratively updated by adopting a random gradient descent method, and in the initial stage of the grabbing mode evaluation model training, a proper learning speed is selected to enable the grabbing mode evaluation model to quickly find the optimal direction of parameter optimization. When the training progresses to a certain degree, the learning rate is gradually reduced to carry out finer learning, and the problem that the loss function cannot be converged due to overlarge parameter fluctuation is avoided. Therefore, after each iteration, the grasping mode evaluation model updates the learning rate according to the test parameter configuration and the convergence condition of the current model, and updates the weight according to the learning rate.

In the embodiment of the application, a large amount of training data meeting the use requirement of an actual use scene is provided through a generation mode of a virtual target object, a grabbing mode evaluation model for evaluating the effectiveness of grabbing modes is used, a plurality of grabbing modes are evaluated at the same time, and the training and multi-task optimization of the model are promoted; finally, a multi-task grabbing mode decision model which can adapt to complex environments and adapt to different objects is obtained, decision basis of a robot grabbing mode is provided in an auxiliary mode, and quick analysis and effective decision of selecting a grabbing mode when a three-dimensional object is faced in a real scene are achieved; compared with the traditional mode based on object analysis or experience learning, the method can more robustly cope with the possible situation in the actual scene, simultaneously reduces the workload of manually marking data, and achieves better effect in the application process of the object grabbing mode.

In the application, a virtual three-dimensional object probability sampling grabbing mode multi-task learning technology based on deep learning is used for rapidly generating a large number of effective virtual three-dimensional objects meeting actual scene probability distribution by utilizing a three-dimensional data reconstruction technology, generating a large number of labeled data in a simulation rendering mode, and performing parameter training on a grabbing mode evaluation model, so that the three-dimensional object grabbing mode multi-task learning based on the deep learning is realized, more accurate and robust grabbing mode evaluation compared with the traditional method is provided, and the decision of a robot for the grabbing behavior of a target object is more in line with the objective conditions of an actual scene.

Fig. 5 is a schematic structural diagram of a deep learning-based robotic grasping system according to some embodiments of the present application; as shown in fig. 5, the deep learning based robot grasping system includes: an input unit 501, an evaluation unit 502 and a decision unit 503. The input unit 501 is configured to construct three-dimensional point cloud information of a target object according to an input depth image, and encapsulate the three-dimensional point cloud information to obtain input data of the target object; the evaluation unit 502 is configured to obtain grasping evaluation values of different grasping modes in candidate grasping modes of the target object in the target scene based on a pre-trained grasping mode evaluation model according to the input data of the target object; the decision unit 503 is configured to select the highest grasping mode of the grasping evaluation values to grasp the target object in the target scene.

The robot grasping system based on deep learning provided by the embodiment of the application can realize the steps and the flow of any one of the embodiments of the robot grasping method based on deep learning, and achieve the same technical effects, which are not described in detail herein.

FIG. 6 is a schematic structural diagram of an electronic device provided in accordance with some embodiments of the present application; as shown in fig. 6, the electronic apparatus includes:

one or more processors 601;

a computer readable medium may be configured to store one or more programs 602, which when executed by one or more processors 601, implement the steps of: constructing three-dimensional point cloud information of a target object according to the input depth image, and encapsulating to obtain input data of the target object; obtaining grabbing evaluation values of different grabbing modes in candidate grabbing modes of the target object under the target scene based on a pre-trained grabbing mode evaluation model according to input data of the target object; and selecting the grabbing mode with the highest grabbing evaluation value to grab the target object in the target scene.

FIG. 7 is a hardware block diagram of an electronic device provided in accordance with some embodiments of the present application; as shown in fig. 7, the hardware structure of the electronic device may include: a processor 701, a communication interface 702, a computer-readable medium 703, and a communication bus 704.

The processor 701, the communication interface 702, and the computer-readable medium 703 are all configured to communicate with each other via a communication bus 704.

Alternatively, the communication interface 702 may be an interface of a communication module, such as an interface of a GSM module.

The processor 701 may be specifically configured to: constructing three-dimensional point cloud information of a target object according to the input depth image, and encapsulating to obtain input data of the target object; obtaining grabbing evaluation values of different grabbing modes in candidate grabbing modes of the target object under the target scene based on a pre-trained grabbing mode evaluation model according to input data of the target object; and selecting the grabbing mode with the highest grabbing evaluation value to grab the target object in the target scene.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc., and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) A mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., IPhone), multimedia phones, functional phones, and low-end phones, etc.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as Ipad.

(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio and video players (e.g., iPod), handheld game players, electronic books, and smart toys and portable car navigation devices.

(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic devices with data interaction functions.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, or two or more components/steps or partial operations of the components/steps may be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine storage medium and to be stored in a local recording medium downloaded through a network, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It is understood that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the deep learning based robot crawling methods described herein. Further, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the particular application of the solution and the constraints involved. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and elements not shown as separate may or may not be physically separate, and elements not shown as unit hints may or may not be physical elements, may be located in one place, or may be distributed across multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of the embodiments of the present application should be defined by the claims.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A robot grabbing method based on deep learning is characterized by comprising the following steps:

constructing three-dimensional point cloud information of a target object according to an input depth image, and encapsulating to obtain input data of the target object;

obtaining grabbing evaluation values of different grabbing modes in candidate grabbing modes of the target object under the target scene based on a pre-trained grabbing mode evaluation model according to the input data of the target object;

selecting a grabbing mode with the highest grabbing evaluation value to grab the target object in the target scene;

the robot grabbing method based on deep learning further comprises the following steps:

generating training scene data of the grabbing mode evaluation model by a probability sampling method according to three-dimensional point cloud data of a training object obtained in advance;

iteratively updating the capture mode evaluation model based on deep learning according to the training scene data and a preset loss function;

the method for generating the training scene data of the grabbing mode evaluation model according to the three-dimensional point cloud data of the training object obtained in advance through a probability sampling method comprises the following steps:

selecting a training object corresponding to the target scene from a training object set according to the target scene in a preset application scene set; wherein the set of training objects comprises a plurality of different classes of the training objects; the preset application scene set comprises a plurality of different target scenes, and the different target scenes correspond to the training objects of different categories respectively;

generating three-dimensional point cloud data of the training object based on a rendering engine or a three-dimensional reconstruction method; wherein the three-dimensional point cloud data of the training object is in a format that can be used for rendering;

rendering the three-dimensional point cloud data, performing probability sampling on the training object according to the statistical data of the training object in the target scene, and generating training scene data of the grabbing mode evaluation model, wherein the training scene data meet preset conditional probability distribution.

2. The robot grabbing method based on deep learning of claim 1, wherein the building of three-dimensional point cloud information of a target object according to an input depth image and the packaging of the three-dimensional point cloud information to obtain input data of the target object comprises:

constructing point cloud information of the target object according to the input depth image based on a depth estimation method;

and encapsulating the point cloud information of the target object to obtain the input data of the target object.

3. The deep learning-based robot grabbing method according to claim 1, wherein the grabbing evaluation values of different grabbing modes in candidate grabbing modes of the target object in a target scene are obtained based on a pre-trained grabbing mode evaluation model according to the input data of the target object, specifically:

based on the pre-trained grasping mode evaluation model, performing model forward propagation operation on input data of the target object, and acquiring grasping evaluation values of different grasping modes in candidate grasping modes of the target object in a target scene according to influence factors of characteristic attributes of the target object; wherein the characteristic attributes of the target object include: the shape, attitude, position of the target object.

4. The robot grabbing method based on deep learning of claim 1, wherein the grabbing mode evaluation model based on deep learning is iteratively updated according to the training scene data and a preset loss function, specifically:

and according to the training scene data and a preset cross entropy loss function, carrying out iterative updating on the weight and the offset value of each layer in the grasping mode evaluation model based on the deep convolutional neural network by a random gradient descent method.

5. The deep learning based robot crawling method according to any one of claims 1 to 4, wherein the deep learning based robot crawling method further comprises: and constructing the grasping mode evaluation model based on a TensorFlow learning framework.

6. A robot grasping system based on deep learning, characterized by comprising:

the input unit is configured to construct three-dimensional point cloud information of a target object according to an input depth image, and package the three-dimensional point cloud information to obtain input data of the target object;

the evaluation unit is configured to obtain grabbing evaluation values of different grabbing modes in candidate grabbing modes of the target object under a target scene based on a pre-trained grabbing mode evaluation model according to the input data of the target object;

the decision unit is configured to select a grabbing mode with the highest grabbing evaluation value and grab the target object in the target scene;

the deep learning based robotic grasping system is further configured to:

7. A computer-readable storage medium on which a computer program is stored, the program being characterized in that the program is a deep learning based robot crawling method according to any of claims 1-5.

8. An electronic device, comprising: a memory, a processor, and a program stored in the memory and executable on the processor, the processor implementing the deep learning based robot crawling method according to any one of claims 1-5 when executing the program.