CN117885101A

CN117885101A - Robot gripping planning method, apparatus, electronic device, storage medium and computer program product

Info

Publication number: CN117885101A
Application number: CN202410200749.4A
Authority: CN
Inventors: 徐志远; 黎金铭; 祝毅晨; 车正平; 刘冬; 奉飞飞; 唐剑
Original assignee: Midea Group Co Ltd; Midea Group Shanghai Co Ltd
Current assignee: Midea Group Co Ltd; Midea Group Shanghai Co Ltd
Priority date: 2024-02-22
Filing date: 2024-02-22
Publication date: 2024-04-16

Abstract

The application relates to the technical field of computers, and provides a robot grabbing planning method, a robot grabbing planning device, electronic equipment, a storage medium and a computer program product, wherein the robot grabbing planning method comprises the following steps: acquiring RGB images of a target object and body perception information of a robot; inputting the RGB image and the body perception information into a robot grabbing visual representation model to acquire grabbing action information of the robot output by the robot grabbing visual representation model; and determining grabbing planning information of the robot based on the grabbing action information. According to the application, the depth information is learned by utilizing a large-scale data set in the pre-training stage, and then the 3D visual representation learning is applied to the 2D representation learning controlled by the robot, so that the visual perception information can be better captured, and meanwhile, the operation precision and efficiency of the robot are improved in a body perception information fusion mode, so that the grabbing is realized under the condition of no 3D information, and the grabbing accuracy is improved.

Description

Robot gripping planning method, apparatus, electronic device, storage medium and computer program product

Technical Field

The present application relates to the field of computer technology, and in particular, to a robot gripping planning method, apparatus, electronic device, storage medium, and computer program product.

Background

Visually guided robotic arms mimic learning and have found widespread use in gripping and recognition in complex environments, however, how to more efficiently utilize visual information remains a leading edge problem of current research. Among other things, pre-trained visual representation is an important direction of current research, which provides a powerful capability for robots to better understand and apply visual information in both simulated and real environments. However, the existing pre-trained visual representation depends on a self-supervision learning target, and has the problems of limitation of capturing and understanding three-dimensional information, difficulty in deploying computing resources on edge equipment, adaptability of the self-supervision learning target and the like.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides a robot grabbing planning method, which utilizes a large-scale data set to learn depth information in a pre-training stage, and then applies the 3D visual representation learning to 2D representation learning controlled by a robot, so that visual perception information can be better captured, meanwhile, the operation precision and efficiency of the robot are improved in a body perception information fusion mode, and grabbing under the condition of no 3D information is realized, so that the grabbing accuracy is improved.

The application also provides a robot gripping and planning device, electronic equipment, a storage medium and a computer program product.

According to an embodiment of the first aspect of the application, a robot gripping planning method comprises the following steps:

Acquiring RGB images of a target object and body perception information of a robot; the body perception information characterizes the state information of the robot and the perception information of surrounding environment and objects;

Inputting the RGB image and the body perception information into a robot grabbing visual representation model to acquire grabbing action information of the robot, which is output by the robot grabbing visual representation model; the robot grabbing visual characterization model is obtained by training based on sample ontology sensing information, a sample RGB image and a depth map corresponding to the sample RGB image;

And determining grabbing planning information of the robot based on the grabbing action information.

According to one embodiment of the application, the robot gripping vision characterization model is trained based on the following steps:

Determining a first loss function based on the sample RGB image and the depth map;

training an encoder based on the first loss function, and determining a feature map of the sample RGB image based on the trained encoder;

Determining a second loss function based on the feature map and the sample ontology sensing information;

and performing model training on the second loss function to obtain the robot grabbing visual representation model.

According to one embodiment of the application, the determining a first loss function based on the sample RGB image and the depth map comprises:

determining a sample judgment matrix based on the sample RGB image and the depth map; the sample judgment matrix comprises a positive sample and a negative sample;

determining a positive pixel group and a negative pixel group based on the sample judgment matrix;

determining a pixel level loss function based on the positive pixel group and the negative pixel group;

the first loss function is determined based on the pixel-level loss function and an instance-level loss function.

According to one embodiment of the present application, the determining the second loss function based on the feature map and the sample ontology sensing information includes:

carrying out average pooling treatment on the feature map to obtain feature vectors of the feature map;

determining a cross attention result of a feature vector sequence of the feature map and a matrix sequence of the sample body perception information;

and determining the second loss function based on the feature vector, the cross attention result and a set task target, wherein the set task target represents the position where the target object is stored.

According to one embodiment of the present application, the determining a sample judgment matrix based on the sample RGB image and the depth map includes:

Clipping the sample RGB image and the depth map to obtain a first view and a second view; the first view includes a first RGB view and a first depth view; the second view includes a second RGB view and a second depth view;

determining a distance between the first RGB view and the second RGB view, and a depth value between the first depth view and the second depth view;

Determining a first sample judgment matrix of the sample RGB image based on the distance and a distance threshold;

Determining a second sample judgment matrix of the depth map based on the depth value and the depth threshold value;

And determining the sample judgment matrix based on the first sample judgment matrix and the second sample judgment matrix.

According to one embodiment of the application, the determining a pixel level loss function based on the positive pixel group and the negative pixel group includes:

Determining a loss function of the first view and a loss function of the second view based on the positive pixel group and the negative pixel group;

the pixel level loss function is determined based on the loss function of the first view and the loss function of the second view.

According to one embodiment of the application, the step of determining the instance level loss function is as follows:

Encoding different view angles of the feature map of the first RGB view to obtain a third RGB view;

Encoding different view angles of the feature map of the second RGB view to obtain a fourth RGB view;

determining a feature matrix of the third RGB view and a feature matrix of the fourth RGB view based on a prediction head network;

The instance level loss function is determined based on the third RGB view and its feature matrix, the fourth RGB view and its feature matrix.

According to a second aspect of the present application, a robot gripping and planning apparatus includes:

The acquisition module is used for acquiring the RGB image of the target object and the body perception information of the robot; the body perception information characterizes the state information of the robot and the perception information of surrounding environment and objects;

The identification module is used for inputting the RGB image and the body perception information into a robot grabbing visual representation model and acquiring grabbing action information of the robot, which is output by the robot grabbing visual representation model; the robot grabbing visual characterization model is obtained by training based on sample ontology sensing information, a sample RGB image and a depth map corresponding to the sample RGB image;

and the planning module is used for determining the grabbing planning information of the robot based on the grabbing action information.

An electronic device according to an embodiment of the third aspect of the present application comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the robot gripping planning method according to any of the above when executing the program.

A non-transitory computer readable storage medium according to an embodiment of the fourth aspect of the present application has stored thereon a computer program which, when executed by a processor, implements a robot gripping planning method as described in any of the above.

A computer program product according to an embodiment of the fifth aspect of the application comprises a computer program which, when executed by a processor, implements a robot gripping planning method as described in any of the above.

The above technical solutions in the embodiments of the present application have at least one of the following technical effects:

The depth information is learned by utilizing a large-scale data set in the pre-training stage, and then the 3D visual representation learning is applied to robot-controlled 2D representation learning, so that visual perception information can be better captured, and better feature representations can be extracted under the same network architecture. By means of the transfer learning mode, training time and cost can be effectively saved, and better performance can be obtained under the condition of insufficient training data.

The robot body perception information and the visual perception information are fused in a robot body perception information fusion mode, so that the effect of the mechanical arm grabbing path planning can be effectively improved. Physical limitation and environmental constraint of the robot can be better considered by fusing the body perception information of the robot, so that a grabbing path can be accurately planned. The body perception information fusion mode can play an important role in robot operation, and the operation precision and efficiency of the robot are improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a robot grabbing planning method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of constructing a positive/negative sample judgment matrix according to an embodiment of the present application;

FIG. 3 is a flow chart of a robotic grasping and planning method with depth-aware pre-training provided by an embodiment of the application;

Fig. 4 is a schematic block diagram of a robot gripping and planning apparatus according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the application but are not intended to limit the scope of the application.

In describing embodiments of the present application, it should be noted that the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In embodiments of the application, unless expressly specified and limited otherwise, a first feature "up" or "down" on a second feature may be that the first and second features are in direct contact, or that the first and second features are in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Currently, pre-trained visual representations rely on self-supervised learning objectives and are trained using large-scale 2D image datasets or self-supervised video presentations to understand the environment through learning of large amounts of data. However, this method has some limitations in that the robot needs to operate in a three-dimensional space, and it is often difficult for the 2D image to accurately capture three-dimensional information. An alternative approach is to use 3D knowledge, depth images, point cloud data or reconstructed three-dimensional scenes to assist the robot in performing tasks in three-dimensional space. However, processing 3D models typically requires greater computational resources, is difficult to deploy on edge devices, and acquiring high-precision depth sensors is also a challenge.

Based on the above problems, the embodiment of the application provides a robot grabbing planning method.

Fig. 1 is a flow chart of a robot gripping planning method according to an embodiment of the present application. Referring to fig. 1, an embodiment of the present application provides a robot gripping planning method, including:

Step 100, obtaining an RGB image of a target object and body perception information of a robot.

The acquisition of RGB images of a target object, which is an object to be grasped, and the body-aware information of a robot depend on different devices. For example, an RGB image may be acquired in the following manner:

Mode 1: a camera head: the robot is equipped with one or more cameras to acquire image information in the environment, wherein the cameras can be ordinary RGB cameras, and the appearance information of the object is acquired by capturing optical images.

Mode 2: depth camera: the robot may acquire RGB images using a depth camera, wherein the depth camera may provide a depth value for each pixel point, thereby generating a two-dimensional image and a three-dimensional point cloud.

The body sensing information characterizes the state information of the robot and the sensing information of the surrounding environment and objects, for example, the body sensing information can be obtained in the following manner:

a. Position and attitude information: the robot can acquire the position and posture information of the robot, such as coordinates, angles, speeds and the like, through sensors such as a laser radar, a camera, an inertial measurement unit and the like, and specifically comprises joint positions, joint speeds, clamping jaw positions and the like.

B. Environmental awareness information: the robot can acquire information of surrounding environment, such as a map, an obstacle, a human body and the like through sensing devices such as a laser radar, a camera, an infrared sensor and the like.

C. object identification and tracking information: the robot may identify and track surrounding objects, such as people, vehicles, furniture, etc., through visual or other sensors.

D. action execution feedback information: the robot can acquire feedback information after performing actions, such as execution force, speed, acceleration, and the like, through the sensors.

Through understanding the body perception information of robot, the robot can plan the action of snatching better, avoids the collision, improves and snatches the success rate to adapt to different scenes and objects.

Step 200, inputting the RGB image and the body perception information into a robot grabbing visual representation model, and obtaining grabbing action information of the robot, which is output by the robot grabbing visual representation model.

The method comprises the steps of training a robot grabbing visual representation model in advance, inputting RGB (red, green and blue) images and ontology sensing information into the robot grabbing visual representation model, and obtaining grabbing action information of the robot output by the robot grabbing visual representation model, wherein the robot grabbing visual representation model is obtained through training based on sample ontology sensing information, sample RGB images and depth maps corresponding to the sample RGB images.

And 300, determining grabbing planning information of the robot based on the grabbing action information.

After the grabbing action information of the robot, which is output by the grabbing visual representation model of the robot, is obtained, the grabbing strategy and action which the robot should take are determined through an algorithm and a planning method according to the grabbing action information. The grabbing action information comprises the steps of moving, opening and closing the robot arm and the paw and the like so as to grab and operate the object.

For example, assuming that the robot needs to grasp a cup on a table, the model predicts a motion that moves the arm so that the gripper can grasp at the appropriate location around the cup. In order to realize the action, the mechanical arm performs track planning according to the prediction result of the model, calculates the motion track of the joint of the mechanical arm, then moves the mechanical arm based on the motion track, and moves the paw to a proper position. The robotic arm opens the gripper for insertion around the cup according to the predicted action of the model, which again predicts how the robot should adjust the pose of the arm and gripper to ensure that the cup is firmly grasped. The robotic arm adjusts according to the prediction and closes the jaws to securely grasp the cup. Finally, the robotic arm may move the cup to the target location or perform other manipulation operations, such as placing the cup on another table.

According to the robot grabbing planning method provided by the embodiment of the application, the RGB image of the target object and the body perception information of the robot are obtained; the body perception information represents the state information of the robot and the perception information of surrounding environment and objects; inputting the RGB image and the body perception information into a robot grabbing visual representation model to acquire grabbing action information of the robot output by the robot grabbing visual representation model; the robot grabbing visual characterization model is obtained by training based on sample ontology sensing information, a sample RGB image and a depth map corresponding to the sample RGB image; and determining grabbing planning information of the robot based on the grabbing action information. According to the embodiment of the application, the depth information is learned by utilizing a large-scale data set in the pre-training stage, and then the 3D visual representation learning is applied to the 2D representation learning controlled by the robot, so that the visual perception information can be better captured, meanwhile, the operation precision and efficiency of the robot are improved in a body perception information fusion mode, and the grabbing is realized under the condition of no 3D information, so that the grabbing accuracy is improved.

Based on the above embodiment, the robot gripping vision characterization model is trained based on the following steps:

step 211 of determining a first loss function based on the sample RGB image and the depth map;

step 212, training an encoder based on the first loss function, and determining a feature map of the sample RGB image based on the trained encoder;

step 213, determining a second loss function based on the feature map and the sample ontology sensing information;

And step 214, performing model training on the second loss function to obtain the robot grabbing visual representation model.

And acquiring a sample RGB image and a corresponding depth map thereof, wherein the depth map represents the distance or depth information of the object from the camera in the form of gray values. Then, the sample RGB image and the depth map are cut out respectively to obtain a first view and a second view, for example, the sample RGB image is cut out first to obtain the first RGB view and the second RGB view, and then the depth map is cut out in the same cutting mode to obtain the first depth view and the second depth view. Because the clipping mode is the same, the first RGB view corresponds to the first depth view, and the two views have the same content but different view attributes (such as RGB view and depth view); the second RGB view corresponds to the second depth view, and the two views have the same content but different view attributes. Wherein the first view comprises a first RGB view and a first depth view and the second view comprises a second RGB view and a second depth view.

Further determining a distance between the first RGB view and the second RGB view, and a depth value between the first depth view and the second depth view; then, determining a first sample judgment matrix of the sample RGB image based on the distance and the distance threshold value; determining a second sample judgment matrix of the depth map based on the depth value and the depth threshold value; finally, determining a sample judgment matrix based on the first sample judgment matrix and the second sample judgment matrix, wherein the sample judgment matrix comprises a positive sample and a negative sample.

For example, the first sample judgment matrix is:

Wherein a _image(i₁,j₁) represents a positive/negative sample judgment matrix of the RGB image, i.e., a first sample judgment matrix, i ₁ represents an index of a vector of the first RGB view, j ₁ represents an index of a vector of the second RGB view, dist represents a normalized euclidean distance between i ₁ and j ₁ in the feature space, T represents a distance threshold, and T is set to {0.3,0.5,0.7}.

The second sample judgment matrix is:

Wherein a _depth(i₂,j₂) represents a positive/negative sample judgment matrix of the depth map, i.e., a second sample judgment matrix, i ₂ represents an index of a vector of the first depth view, j ₂ represents an index of a vector of the second depth view, depth represents a normalized depth value, T 'represents a threshold of the depth map, and T' is set to {0.3,0.5,0.7}.

The sample judgment matrix is as follows:

A(i,j)＝A_image(i₁,j₁)×A_depth(i₂,j₂)；

wherein a (i, j) represents a sample judgment matrix, i.e., a positive/negative sample judgment matrix; i denotes the index of the vector of the first view and j denotes the index of the vector of the second view.

Further, a positive pixel group and a negative pixel group are determined based on the sample judgment matrix. For example, on the basis of the depth information, for pixel i 'in the first view, pixel i' is also located in the second view, the positive and negative pixel groups are defined as:

Wherein denotes a positive pixel group,/> denotes a negative pixel group, p denotes a pixel group in the first view, and n denotes a pixel group in the second view.

A pixel level loss function is determined based on the positive and negative pixel groups. Specifically, a loss function of the first view and a loss function of the second view are determined based on the positive pixel group and the negative pixel group, and then a pixel level loss function is determined based on the loss function of the first view and the loss function of the second view. For example, the formula for contrast learning loss at the pixel level is:

Where x _i represents the normalized representation of the original sample, x '_j represents the normalized representation of the positive sample, x' _k represents the normalized representation of the negative sample, and τ represents the temperature overshoot. Wherein, the smaller the temperature super parameter tau, the closer the activation function softmax is to the real max function, and the larger the temperature super parameter tau is, the closer the temperature super parameter tau is to a uniform distribution.

The embodiment of the application mainly uses a pixel level contrast learning method and an example level contrast learning method in contrast learning for training, wherein the pixel level loss function is as follows:

Where L _pix denotes the pixel level loss function, L _pix (x) denotes the loss function of the first view, and L _pix (j) denotes the loss function of the second view, where L _pix (x) and L _pix (j) are calculated based on the formula of the contrast learning loss at the pixel level.

Encoding different view angles of the feature map of the first RGB view to obtain a third RGB view, and encoding different view angles of the feature map of the second RGB view to obtain a fourth RGB view; and then determining a characteristic matrix of the third RGB view and a characteristic matrix of the fourth RGB view based on the prediction head network, and finally determining an instance-level loss function based on the third RGB view and the characteristic matrix thereof, the fourth RGB view and the characteristic matrix thereof.

For example, in a self-supervised learning task, different views of the same image need to be encoded to generate different view representations. To achieve this, separate projection heads may be used for the output of the encoder, generating two views, respectively, including a third RGB view q and a fourth RGB view q'. Specifically, in ResNet, this goal can be achieved by adding two separate fully connected layers, called projection heads, that map the output of the encoder to two 512-dimensional feature vector spaces, each generating a representation of two views, which can be seen as different perspectives of the same image, for contrast learning.

Meanwhile, a prediction head network follows the projection head, and a feature matrix k of the third RGB view and a feature matrix k' of the fourth RGB view are generated. Where the prediction head network is an additional network layer that receives as input the feature vectors derived from the projection head and generates additional predictions that can be used for different tasks such as image classification, object detection or other self-supervised learning tasks. Specifically, in ResNet, the structure of the prediction head network may include multiple convolution layers and full connection layers to map the input features to a 7x7 output matrix. For each input view, the prediction head network will extract features and capture context information in the image, and by introducing convolutional and pooling layers in the network, the network can learn higher level feature representations and map them to the final 7x7 output matrix through the fully connected layer.

By adding a prediction head network after the projection head, the feature representation learned by the encoder can be utilized to solve multiple tasks, thereby improving the expressive and generalization capabilities of the model. This method of multitasking can help the model better understand the semantic information of the image and share and pass this learned knowledge among the different tasks.

The example level loss function is:

L_ins＝(q^Tk'+k^Tq')/2；

Where L _ins represents an example level loss function, q ^T represents a transpose of the third RGB view q, and k ^T represents a transpose of the feature matrix k of the third RGB view.

Finally, a first loss function is determined based on the pixel-level loss function and the instance-level loss function. For example, the final pre-trained loss function (i.e., the first loss function) is as follows:

L＝L_pix+αL_ins；

Where α represents a weight, and α is set to 1.

Further, the encoder is trained based on the first loss function, and a feature map of the sample RGB image is determined based on the trained encoder. For example, in contrast learning, the encoder is trained using a first loss function so that the encoder can efficiently convert an input image into a feature vector having distinctiveness. The loss function in contrast learning is mainly used for measuring similarity or distance between feature vectors generated by the encoder so as to train the encoder to extract distinguishing features.

A second loss function is determined based on the feature map and the sample ontology sensing information. Specifically, carrying out average pooling treatment on the feature map to obtain feature vectors of the feature map; determining a cross attention result of a feature vector sequence of the feature map and a matrix sequence of sample ontology sensing information; a second loss function is determined based on the feature vector, the cross-attention result, and the set task objective. The method comprises the steps of setting a position of a task target representation storage target object, namely, after a robot successfully grabs an object, placing the object at a specific position, wherein the position can be a preset fixed position or a position dynamically determined according to task requirements. For example, for a fixed location, the robot is programmed or configured to place objects in a fixed location, such as on a particular table, shelf, etc.; for dynamic positions, the robot dynamically determines the placement positions according to task requirements and environmental conditions, for example, the robot needs to place an object in an idle area nearest to the current position, or selects a proper placement position according to the attribute of the object; for target positions, the robot needs to place the object at a predetermined target position, such as in a designated mark or area.

For example, the ontology-aware input is represented as a set of different states s= { s ₁,s₂,...,s_n }, where each s _i represents a state, such as joint velocity, joint position, etc. A set of Linear layers is used, denoted as linear= { l ₁,l₂,...,l_n }, where the dimension of each l _i is [ dim (s _i), 256,8], where dim (s _i) represents the dimension size of state s _i. Then, obtain the corresponding output o= { l ₁(s₁),l₂(s₂),...,l_n(s_n) }, connect the mapped states together in series to obtain a sequence of size n×8, connect the body information and the picture feature extracted by resnet using the method of cross attention, and construct the following formula:

Wherein Attn (z, o) represents the cross-attention result, z represents the feature vector sequence of the feature map, Q _z represents the matrix representation of the feature vector sequence z, K _o represents the matrix representation of the keys in the output sequence o, V _o represents the matrix representation of the values in the output sequence o, d represents the dimension of the keys, (K _o)^T represents the transpose of K _o).

In the cross attention, performing similarity calculation by using keys (Key) in the Q _z sequence and the o sequence to obtain a score matrix; then, normalizing the score matrix to obtain an attention weight matrix; finally, the values (Value) in the o-sequence are weighted and summed by using the attention weight matrix to obtain the cross attention result. Based on this, the information of the Q _z sequence and the o sequence can be combined by a cross-attention mechanism to obtain a more comprehensive representation. The output of cross-attention can be seen as a weighted representation of the Q _z sequence, where the weights reflect the correlation information between the Q _z sequence and the o sequence, i.e., cross-attention improves the performance of the model over various tasks by capturing the correlation information between different sequences.

The mapped states are connected in series, the input MLP carries out decision output action, finally, the simulated learned loss function is used for guiding training, and the second loss function is as follows:

L_BC=‖π(z,Attn(z,o),goal)-a||²；

where L _BC represents a second loss function, goal represents a set task objective, a represents an action in expert data, and pi represents a training strategy. The strategy pi is parameterized as a three-layer multi-layer perceptron (MLP), where z, attn (z, o) are processed by resnet pooling layers, both of which are connected together with the target goal as inputs.

And finally, performing model training on the second loss function to obtain a robot grabbing visual representation model.

According to the embodiment of the application, the depth information is learned by utilizing a large-scale data set in the pre-training stage, and then the 3D visual representation learning is applied to the robot-controlled 2D representation learning, so that the visual perception information can be better captured, and better characteristic representation can be extracted under the same network architecture. By means of the transfer learning mode, training time and cost can be effectively saved, and better performance can be obtained under the condition of insufficient training data. On the other hand, the body perception information of the robot is fused with the visual perception information by adopting a body perception information fusion mode of the robot, so that the effect of the mechanical arm grabbing path planning can be effectively improved. Physical limitation and environmental constraint of the robot can be better considered by fusing the body perception information of the robot, so that a grabbing path can be accurately planned. The body perception information fusion mode can play an important role in robot operation, and the operation precision and efficiency of the robot are improved.

For further explanation of the robot gripping planning method according to the present application, refer to fig. 2 to 3 and the following embodiments.

The embodiment of the application particularly provides a robot grabbing planning method with depth perception pre-training, which realizes grabbing by using a visual model under the condition of no 3D information. By constructing a depth aware pre-training framework, visual robotic tasks can be facilitated without depth information in both strategic training and inference stages. Meanwhile, a self-sensing injection method is provided, which can extract useful representations in the state of the robot and promote information fusion on the deep neural network. The method specifically comprises the following steps:

(1) Depth perception self-supervision contrast learning.

Contrast learning at pixel level: given an input RGB image and a corresponding depth map thereof, firstly cutting the RGB image into two views to obtain a first RGB view and a second RGB view, and enhancing the cut RGB views; then, the resolutions of the two RGB views are adjusted to be consistent, the RGB views are encoded through an encoder and a momentum encoder, and meanwhile, a projector is used for projecting the feature map generated by encoding so as to determine the distance between the two RGB views. Wherein the positive and negative samples at the image level are as follows:

Where a _image(i₁,j₁) represents the positive/negative sample judgment matrix of the RGB image, i ₁ represents the index of the vector of the first RGB view, j ₁ represents the index of the vector of the second RGB view, dist represents the normalized euclidean distance between i ₁ and j ₁ in the feature space, T represents the distance threshold, and T is set to {0.3,0.5,0.7}.

And for the depth map, clipping the depth map from the same position as the RGB image to obtain a first depth view and a second depth view, and enhancing the clipped depth view by using the same enhancement strategy. The cropped depth view is then mapped to the size of 7*7 feature maps, based on which the inter-pixel differences between the adjusted depth views can be calculated. Wherein the positive and negative samples of the depth level are as follows:

Wherein a _depth(i₂,j₂) represents a positive/negative sample judgment matrix of the depth map, i ₂ represents an index of a vector of the first depth view, j ₂ represents an index of a vector of the second depth view, depth represents a normalized depth value, T 'represents a threshold of the depth map, and T' is set to {0.3,0.5,0.7}.

A _image(i₁,j₁) calculated from the three distance thresholds T, and a _depth(i₂,j₂) calculated from the three depth maps, and then a sample judgment matrix obtained by element-wise multiplying a _image(i₁,j₁) and a _depth(i₂,j₂), that is:

A(i,j)＝A_image(i₁,j₁)×A_depth(i₂,j₂)；

Wherein a (i, j) represents a sample judgment matrix including positive and negative samples, i.e., a positive/negative sample judgment matrix; i represents an index of a vector of a first view, j represents an index of a vector of a second view, wherein the first view comprises a first RGB view and a first depth view; the second view includes a second RGB view and a second depth view.

On the basis of the depth information, for pixel i' in the first view, which is also located in the second view, a positive pixel group and a negative pixel group are defined as:

Wherein denotes a positive pixel group,/> denotes a negative pixel group, p denotes a pixel group in the first view, and n denotes a pixel group in the second view. Wherein, the formula of the contrast learning loss at the pixel level is:

Where L _pix represents the pixel level penalty function, L _pix (x) represents the penalty function of the first view, and L _pix (j) represents the penalty function of the second view.

Because the method provided by the embodiment of the application does not depend entirely on pixel-level contrast learning, the robotic operation is less sensitive to pixel-level features. Thus, the described pixel-level training process uses the same data loader and backbone encoder in conjunction with instance-level pre-tasking. Specifically, separate projection heads are applied to the output of the encoder, generating two views, respectively, including a third RGB view q and a fourth RGB view q'. Meanwhile, a prediction head network follows the projection head, and a feature matrix k of the third RGB view and a feature matrix k' of the fourth RGB view are generated. Wherein the example level loss function is:

L_ins＝(q^Tk'+k^Tq')/2；

The final pre-trained loss function (i.e., the first loss function) is as follows:

L＝L_pix+αL_ins；

Where α represents a weight, and α is set to 1.

(2) Downstream robotic arm aware tasks.

The present embodiment uses a designed depth-aware pre-trained resent network, specifically, the ontology-aware input is represented as a set of different states s= { s ₁,s₂,...,s_n }, where each s _i represents a state, such as joint velocity, joint position, etc. A set of Linear layers is used, denoted as linear= { l ₁,l₂,...,l_n }, where the dimension of each l _i is [ dim (s _i), 256,8], where dim (s _i) represents the dimension size of state s _i. Then, obtain the corresponding output o= { l ₁(s₁),l₂(s₂),...,l_n(s_n) }, connect the mapped states together in series to obtain a sequence of size n×8, connect the body information and the picture feature extracted by resnet using the method of cross attention, and construct the following formula:

L_BC＝||π(z,Attn(z,o),goal)-a||²；

In one example, referring to fig. 3, a robotic grasping planning method with depth-aware pre-training includes the steps of:

Step one: RGB images of the target object (namely the object to be grabbed) and body sensing information of the robot, such as joint position, joint speed, clamping jaw position, set task target, current position distance and the like, are acquired.

Step two: after the RGB image and the body perception information are input into the robot grabbing visual representation model, the RGB image is encoded through an encoder, and a feature map of the RGB image is obtained.

Step three: and carrying out characterization processing on the body perception information to obtain a feature sequence, and simultaneously determining matrix representation K of keys in the feature sequence and matrix representation V of a median value in the feature sequence.

Step four: and performing cross attention operation and cross attention result on the feature graphs, K and V.

Step five: and carrying out pooling treatment on the feature map by adopting a global average pooling layer avg_pool to obtain a feature vector. In ResNet, avg_pool is used to convert the feature map output by the encoder into a fixed length feature vector. Specifically, the size of the feature map output by the last convolution layer of ResNet a 18 is 7×7×512 (width is 7, height is 7, channel number is 512), and the avg_pool operation will average pool the feature map, and average pool the feature map of each channel to obtain a feature vector of 1×1×512. It will be appreciated that the avg_pool operation averages the features of each channel in the feature map, averages the features of each channel, and then obtains a 512-dimensional one-dimensional matrix, where the 512-dimensional feature vector may be regarded as a global feature of the entire image, and is used to represent the content of the image. Through avg_pool operation, the feature of the image can be reduced to a feature vector with a fixed length, so that the subsequent classification or the processing of other tasks is convenient, the parameter number and the calculation complexity are reduced, and meanwhile, important global feature information is reserved.

Step six: and connecting the 512-dimensional one-dimensional matrix output by the avg_pool, the cross attention result and the set task target to form a vector or matrix which is used as the input of the neural Network model Policy Network. Where the Policy Network refers to a neural Network model used for decision making in reinforcement learning, the Policy Network may employ different architectures, such as a feed forward neural Network or a recurrent neural Network, with the training goal of learning an optimized strategy by maximizing jackpot so that the optimal action is selected in a given state.

For example, the connection modes can be simply connected in series, or can be combined through specific operations (such as splicing, superposition, etc.), and the connected input can be directly input into the Policy Network, or further feature extraction and conversion can be performed through a preprocessing layer (such as a full connection layer, a convolution layer, etc.), and then input into the Policy Network. The connected input can provide more comprehensive and comprehensive information, so that the Policy Network can better understand the structure and characteristics of the input data, and thus decision making is more accurate.

Step seven: the robot grabs the visual representation model to output the predicted action, and then planning is carried out through the mechanical arm.

Based on the above embodiments, an embodiment of the present application provides a robot gripping and planning device, including:

An acquisition module 401, configured to acquire an RGB image of a target object and body sensing information of a robot; the body perception information characterizes the state information of the robot and the perception information of surrounding environment and objects;

The recognition module 402 is configured to input the RGB image and the ontology sensing information to a robot capture vision characterization model, and obtain capture motion information of the robot output by the robot capture vision characterization model; the robot grabbing visual characterization model is obtained by training based on sample ontology sensing information, a sample RGB image and a depth map corresponding to the sample RGB image;

The planning module 403 is configured to determine grasping planning information of the robot based on the grasping action information.

According to the robot grabbing planning device provided by the embodiment of the application, the RGB image of the target object and the body perception information of the robot are obtained; the body perception information represents the state information of the robot and the perception information of surrounding environment and objects; inputting the RGB image and the body perception information into a robot grabbing visual representation model to acquire grabbing action information of the robot output by the robot grabbing visual representation model; the robot grabbing visual characterization model is obtained by training based on sample ontology sensing information, a sample RGB image and a depth map corresponding to the sample RGB image; and determining grabbing planning information of the robot based on the grabbing action information. According to the embodiment of the application, the depth information is learned by utilizing a large-scale data set in the pre-training stage, and then the 3D visual representation learning is applied to the 2D representation learning controlled by the robot, so that the visual perception information can be better captured, meanwhile, the operation precision and efficiency of the robot are improved in a body perception information fusion mode, and the grabbing is realized under the condition of no 3D information, so that the grabbing accuracy is improved.

In one embodiment, the robotic grasping and planning apparatus further comprises a model training module for:

In one embodiment, the model training module is further configured to:

Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform the following method:

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In still another aspect, an embodiment of the present application further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the robot gripping planning method provided in the above embodiments, for example, including:

In yet another aspect, embodiments of the present application further provide a computer program product, where the computer program product includes a computer program, where the computer program may be stored on a non-transitory computer readable storage medium, where the computer program, when executed by a processor, is capable of executing the steps of the robot gripping planning method provided in the foregoing embodiments, for example, including:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

The above embodiments are only for illustrating the present application, and are not limiting of the present application. While the application has been described in detail with reference to the embodiments, those skilled in the art will appreciate that various combinations, modifications, or equivalent substitutions can be made to the technical solutions of the present application without departing from the spirit and scope of the technical solutions of the present application, and it is intended to be covered by the scope of the claims of the present application.

Claims

1. A robot gripping and planning method, comprising:

2. The robot gripping planning method of claim 1, wherein the robot gripping visual representation model is trained based on the steps of:

3. The robotic grasping and planning method of claim 2, wherein the determining a first loss function based on the sample RGB image and the depth map comprises:

4. The robotic grasping and planning method of claim 2, wherein the determining a second loss function based on the feature map and the sample ontology sensing information comprises:

5. A robotic grasping and planning method according to claim 3, wherein the determining a sample judgment matrix based on the sample RGB image and the depth map includes:

6. The robotic grasping and planning method of claim 5, wherein the determining a pixel level loss function based on the positive pixel group and the negative pixel group comprises:

7. The robotic grasping and planning method of claim 5, wherein the step of determining the instance level loss function is as follows:

8. A robotic grasping and planning device, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the robot gripping planning method of any of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the robot gripping planning method according to any one of claims 1 to 7.

11. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the robot gripping planning method according to any of claims 1 to 7.