CN115578460A - Robot grabbing method and system based on multi-modal feature extraction and dense prediction - Google Patents

Robot grabbing method and system based on multi-modal feature extraction and dense prediction Download PDF

Info

Publication number
CN115578460A
CN115578460A CN202211407718.3A CN202211407718A CN115578460A CN 115578460 A CN115578460 A CN 115578460A CN 202211407718 A CN202211407718 A CN 202211407718A CN 115578460 A CN115578460 A CN 115578460A
Authority
CN
China
Prior art keywords
dimensional
pixel
dense
network
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211407718.3A
Other languages
Chinese (zh)
Other versions
CN115578460B (en
Inventor
袁小芳
刘学兵
朱青
王耀南
毛建旭
冯明涛
吴成中
周显恩
黄嘉男
周嘉铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202211407718.3A priority Critical patent/CN115578460B/en
Publication of CN115578460A publication Critical patent/CN115578460A/en
Application granted granted Critical
Publication of CN115578460B publication Critical patent/CN115578460B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J19/00Accessories fitted to manipulators, e.g. for monitoring, for viewing; Safety devices combined with or specially adapted for use in connection with manipulators
    • B25J19/02Sensing devices
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • B25J9/161Hardware, e.g. neural networks, fuzzy logic, interfaces, processor
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1661Programme controls characterised by programming, planning systems for manipulators characterised by task planning, object-oriented languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Automation & Control Theory (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a robot grabbing method and system based on multi-mode feature extraction and density prediction, wherein a scene color image and a depth image are obtained, a scene three-dimensional point cloud and adaptive convolution receptive fields with different scales are calculated from the depth image, and a surface normal vector image is obtained according to the scene three-dimensional point cloud; and constructing a multi-mode feature extraction and dense prediction network, processing the scene color image and the surface normal vector image to obtain dense three-dimensional attitude information and three-dimensional position information predicted by each object, calculating to obtain the three-dimensional attitude and the three-dimensional position of the corresponding object, forming a three-dimensional pose together with the three-dimensional attitude, and sending the three-dimensional pose to a robot grabbing system to finish the grabbing task of the corresponding object in the scene. The method disclosed by the invention integrates multi-mode color and depth data, retains two-dimensional plane characteristics and depth information in feature extraction, has a simple structure and high prediction precision, and is suitable for a robot grabbing task in a complex scene.

Description

Robot grabbing method and system based on multi-mode feature extraction and dense prediction
Technical Field
The invention relates to the field of robot three-dimensional vision, object three-dimensional pose estimation and grabbing application, in particular to a robot grabbing method and system based on multi-modal feature extraction and density prediction.
Background
Robot grabbing is an important task in the field of industrial automation and is used for replacing manual work to finish complex and repeated operations in product production, such as part loading, assembly, sorting, carrying and the like. In order to accurately complete the grabbing task, the robot must identify a target object from a working scene by using a vision system of the robot and accurately estimate a three-dimensional pose of the target object, and then perform grabbing operation by using a motion control system of the robot. Generally, parts in an industrial scene are various in types, different in shapes, poor in surface texture and uneven in scene illumination, and the parts are randomly placed, so that great challenges are brought to visual identification and object three-dimensional pose estimation of a robot.
In recent years, small-sized low-cost three-dimensional cameras have been widely used with the development of sensor technology. Compared with a two-dimensional camera, the method can provide additional scene depth and object surface geometric texture information, enhance scene image information and improve target identification and pose estimation precision of a visual algorithm. At present, two three-dimensional image processing modes are mainly adopted, firstly, a scene depth image is used as an extra channel to be combined with a color image three channel to form a 4-channel image, and then feature extraction, information processing and the like are carried out; and secondly, converting the color image and the depth image into a scene three-dimensional point cloud, and completing feature extraction, target identification and the like by using a point cloud data processing method. In a related processing algorithm, a template matching algorithm is generally adopted in a traditional mode, the best matching position of a pre-defined template of a target object is searched from scene data to identify the object and estimate the pose of the object, the calculation of the template depends on manual design, the influence of noise, illumination and texture characteristics is large, and the algorithm robustness is poor.
In recent years, thanks to the development of deep learning technology, the image processing method based on the convolutional neural network is widely applied, and the effect is remarkably improved. Densefusion is used as a leader in the object three-dimensional pose estimation method, and through a mode of combining two three-dimensional data processing, color image information is processed by a two-dimensional convolution network and point cloud data converted from a depth image is processed by a point cloud convolution network, and then different dimensional features are fused, so that the performance is remarkably improved. However, in the process of converting image data from a two-dimensional image to a serialized point cloud, scene two-dimensional structure information is lost, feature extraction is influenced, physical information quantization difference exists between color images and depth images, and robust features cannot be obtained through simple dimension fusion.
Therefore, how to solve the problems of feature extraction and information fusion between different dimensions and characteristic images in three-dimensional vision and the design of a target object pose parameter regression model, and meeting the requirement of robot high-precision grabbing becomes a problem to be solved by the technical staff in the field.
Disclosure of Invention
The invention aims to provide a robot grabbing method and system based on multi-modal feature extraction and dense prediction, which effectively meet the pose estimation requirements of weak textures and complex and diverse parts in an industrial scene by adopting a robot three-dimensional vision technology.
In order to solve the technical problems, the invention provides a robot grasping method based on multi-modal feature extraction and dense prediction, which comprises the following steps:
s1, acquiring a color image and a depth image of a robot in a multi-class object capture scene;
s2, calculating scene three-dimensional point clouds and adaptive convolution receptive fields of different scales from the depth image, and obtaining a surface normal vector image according to the scene three-dimensional point clouds;
s3, constructing a multi-mode feature extraction and dense prediction network by combining self-adaptive convolution receptive fields of different scales, inputting a preset training set into the network for training to obtain the trained multi-mode feature extraction and dense prediction network, calculating a total loss value of the network according to a preset loss function, and reversely propagating and updating network parameters of the network to obtain an updated multi-mode feature extraction and dense prediction network;
s4, processing the scene color image and the surface normal vector image through the updated multi-modal feature extraction and dense prediction network to obtain dense three-dimensional attitude information and dense three-dimensional position information predicted by each object;
and S5, calculating to obtain the three-dimensional gesture of the corresponding object according to the predicted dense three-dimensional gesture information of each type of object, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, forming the three-dimensional gesture of the corresponding object together with the three-dimensional position, and sending the three-dimensional gesture to a robot grabbing system to finish the grabbing task of the corresponding object in the scene.
Preferably, the multi-mode feature extraction and dense prediction network in S3 includes a multi-mode feature extraction network and three regression branch networks, the multi-mode feature extraction network is configured to perform feature extraction and feature fusion from the scene color image and the surface normal vector image to obtain multi-mode features, and the three regression branch networks are configured to respectively predict multi-class semantic information, three-dimensional posture information, and three-dimensional position information of the pixel-by-pixel target object from the multi-mode features.
Preferably, the multi-modal feature extraction network comprises a first convolution network, a second convolution network and a multi-scale feature fusion module, wherein the first convolution network extracts multi-scale color convolution features from a scene color image under the guidance of adaptive convolution receptive fields of different scales, the second convolution network extracts multi-scale normal vector convolution features from a surface normal vector image under the guidance of adaptive convolution receptive fields of different scales, and the multi-scale feature fusion module fuses the multi-scale color convolution features and the multi-scale normal vector convolution features to obtain the multi-modal features.
Preferably, the first convolution network and the second convolution network respectively use ResNet-18 as a backbone network, a third layer and subsequent convolution layers of the backbone network are abandoned, adaptive deep convolution receptive fields of different scales are used for replacing an original conventional convolution receptive field of the network, the multi-scale feature fusion module comprises a first sub-module and a second sub-module, the first sub-module is used for performing multi-mode convolution feature fusion on color convolution features and normal vector convolution features of the same scale in different scales to obtain multi-mode features of different scales, and the second sub-module performs up-sampling and scale information fusion on the obtained multi-mode features of different scales by adopting a feature pyramid structure to obtain the scene pixel-by-pixel multi-mode features.
Preferably, the three regression branch networks are a pixel-by-pixel semantic prediction network, a pixel-by-pixel three-dimensional attitude prediction network and a pixel-by-pixel three-dimensional position prediction network respectively, the pixel-by-pixel semantic prediction network performs intensive pixel-by-pixel semantic information prediction on the input multi-modal features to obtain pixel-by-pixel multi-class semantic information, the pixel-by-pixel three-dimensional attitude prediction network performs intensive pixel-by-pixel three-dimensional attitude prediction on the input multi-modal features to obtain pixel-by-pixel three-dimensional attitude information, and the pixel-by-pixel three-dimensional position prediction network performs intensive pixel-by-pixel three-dimensional position prediction on the input multi-modal features to obtain pixel-by-pixel three-dimensional position information.
Preferably, the scene three-dimensional point cloud is calculated from the depth image in S2, and the specific formula is as follows:
Figure 709789DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure 892509DEST_PATH_IMAGE002
is the coordinate of the three-dimensional point cloud,
Figure 813192DEST_PATH_IMAGE003
Figure 739559DEST_PATH_IMAGE004
Figure 819511DEST_PATH_IMAGE005
Figure 799230DEST_PATH_IMAGE006
is an internal reference of the camera and is used as a reference of the camera,
Figure 331843DEST_PATH_IMAGE007
and
Figure 796322DEST_PATH_IMAGE008
is the coordinates of the depth image and,
Figure 871726DEST_PATH_IMAGE009
is the depth of the depth image;
in S2, self-adaptive convolution receptive fields of different scales are obtained by calculation from the depth image, and the specific formula is as follows:
Figure 130669DEST_PATH_IMAGE010
in the formula (I), the compound is shown in the specification,
Figure 150577DEST_PATH_IMAGE011
is a pixel
Figure 543381DEST_PATH_IMAGE012
Corresponding adaptive depth convolution receptive fields of different scales,
Figure 332346DEST_PATH_IMAGE013
is a pixel
Figure 168715DEST_PATH_IMAGE012
The corresponding conventional convolution receptive field is the same as the conventional convolution receptive field,
Figure 675920DEST_PATH_IMAGE014
is a pixel
Figure 216622DEST_PATH_IMAGE012
An offset of the position;
in S2, a surface normal vector image is obtained according to the scene three-dimensional point cloud, and the specific formula is as follows:
Figure 125673DEST_PATH_IMAGE015
wherein, the first and the second end of the pipe are connected with each other,
Figure 352517DEST_PATH_IMAGE016
in the formula (I), the compound is shown in the specification,
Figure 612597DEST_PATH_IMAGE017
is the image of the normal vector of the surface,
Figure 956990DEST_PATH_IMAGE018
for all of the three-dimensional point clouds in the scene,
Figure 861493DEST_PATH_IMAGE019
the number of the point clouds is calculated,
Figure 898719DEST_PATH_IMAGE020
is composed of
Figure 114936DEST_PATH_IMAGE021
All-in-one food
Figure 653234DEST_PATH_IMAGE022
And (5) vector quantity.
Preferably, the predicted dense three-dimensional posture information of each type of object in S4 specifically includes:
Figure 271297DEST_PATH_IMAGE023
in the formula (I), the compound is shown in the specification,
Figure 479425DEST_PATH_IMAGE024
is a pixel
Figure 323884DEST_PATH_IMAGE025
The three-dimensional posture of the object is taken,
Figure 10080DEST_PATH_IMAGE026
the gesture is represented by a gesture that is,
Figure 748229DEST_PATH_IMAGE027
is a pixel
Figure 18936DEST_PATH_IMAGE028
In the form of quaternions of the three-dimensional attitude of the object
Figure 209745DEST_PATH_IMAGE029
A value;
the predicted dense three-dimensional position information of each type of object in the S4 specifically comprises the following steps:
Figure 840578DEST_PATH_IMAGE030
wherein the content of the first and second substances,
Figure 167654DEST_PATH_IMAGE031
is a pixel
Figure 717584DEST_PATH_IMAGE025
The three-dimensional position of the object at (a),
Figure 785903DEST_PATH_IMAGE032
the position is indicated by a position indication,
Figure 79481DEST_PATH_IMAGE033
is a pixel
Figure 526643DEST_PATH_IMAGE025
Three-dimensional positional shift of the object, representing a pixel
Figure 122841DEST_PATH_IMAGE025
At the 3D point of the corresponding object
Figure 288243DEST_PATH_IMAGE034
Three-dimensional position from object
Figure 119933DEST_PATH_IMAGE032
Unitized three-dimensional offset of (1).
Preferably, in S5, the three-dimensional posture of the corresponding object is calculated according to the predicted dense three-dimensional posture information of each type of object, and the specific formula is as follows:
Figure 687180DEST_PATH_IMAGE035
wherein the content of the first and second substances,
Figure 205011DEST_PATH_IMAGE036
for the three-dimensional pose of the object of the category obj,
Figure 857710DEST_PATH_IMAGE037
the corresponding dense prediction number for the class obj object;
and S5, according to the predicted dense three-dimensional position information of each type of object, calculating to obtain the three-dimensional position of the corresponding object, wherein the specific formula is as follows:
Figure 493090DEST_PATH_IMAGE038
in the formula (I), the compound is shown in the specification,
Figure 790211DEST_PATH_IMAGE039
to the three-dimensional position of the object of the category obj,
Figure 852845DEST_PATH_IMAGE040
in order to predict the dense three-dimensional position,
Figure 117473DEST_PATH_IMAGE037
the corresponding dense predicted number for the class obj object.
Preferably, the loss function preset in S3 is specifically:
Figure 556544DEST_PATH_IMAGE041
Figure 832805DEST_PATH_IMAGE042
Figure 207286DEST_PATH_IMAGE043
in the formula (I), the compound is shown in the specification,
Figure 568997DEST_PATH_IMAGE044
for the total loss of the network to be,
Figure 546180DEST_PATH_IMAGE045
Figure 556906DEST_PATH_IMAGE046
and
Figure 226922DEST_PATH_IMAGE047
weight factors of the semantic prediction branch, the three-dimensional attitude prediction branch and the three-dimensional position prediction branch respectively,
Figure 75929DEST_PATH_IMAGE048
for the loss function of the semantic prediction network, a cross entropy loss function is adopted,
Figure 732170DEST_PATH_IMAGE048
a network loss function is predicted for the three-dimensional pose,
Figure 248602DEST_PATH_IMAGE049
a network loss function is predicted for the three-dimensional location,
Figure 948573DEST_PATH_IMAGE050
and
Figure 284877DEST_PATH_IMAGE051
respectively a predicted value and a true value of the three-dimensional attitude prediction network,
Figure 869442DEST_PATH_IMAGE052
and
Figure 850167DEST_PATH_IMAGE053
respectively predicting values and truth values of the three-dimensional position prediction network,
Figure 596406DEST_PATH_IMAGE054
as to the number of object classes in the scene,
Figure 420006DEST_PATH_IMAGE055
the corresponding dense predicted number for each class of objects.
A robot grabbing system adopts a robot grabbing method based on multi-modal feature extraction and dense prediction to grab a target object in a scene, and comprises a robot pose calculation module, a communication module, a grabbing module and an image acquisition module,
the image acquisition module is used for acquiring color images and depth images under multi-class object scenes in real time and sending the color images and the depth images to the pose calculation module;
the pose calculation module calculates the pose of the target object by adopting a robot grabbing method based on multi-mode feature extraction and density prediction and sends the pose to the grabbing module through the communication module;
the grabbing module receives the 6D pose information of the target object and grabs the target object.
According to the robot grabbing method and system based on multi-mode feature extraction and dense prediction, a scene depth image is converted into a scene three-dimensional point cloud and a surface normal vector image, and multi-mode feature extraction is performed on the surface normal vector image and a scene color image together.
Drawings
FIG. 1 is a flowchart of a robot grasping method based on multi-modal feature extraction and dense prediction according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a multi-modal feature extraction and dense prediction network in an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention is further described in detail below with reference to the accompanying drawings.
In one embodiment, the robot grasping method based on multi-modal feature extraction and dense prediction specifically includes:
s1, acquiring a color image and a depth image of a robot in a multi-class object capture scene;
s2, calculating scene three-dimensional point clouds and adaptive convolution receptive fields of different scales from the depth image, and obtaining a surface normal vector image according to the scene three-dimensional point clouds;
s3, constructing a multi-mode feature extraction and dense prediction network by combining self-adaptive convolution receptive fields of different scales, inputting a preset training set into the network for training to obtain the trained multi-mode feature extraction and dense prediction network, calculating a total loss value of the network according to a preset loss function, and reversely propagating and updating network parameters of the network to obtain an updated multi-mode feature extraction and dense prediction network;
s4, processing the scene color image and the surface normal vector image through the updated multi-modal feature extraction and dense prediction network to obtain dense three-dimensional attitude information and dense three-dimensional position information predicted by each object;
and S5, calculating to obtain the three-dimensional posture of the corresponding object according to the predicted dense three-dimensional posture information of each type of object, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, forming the three-dimensional posture of the corresponding object by the three-dimensional posture and the three-dimensional position, and sending the three-dimensional posture to a robot grabbing system to finish the grabbing task of the corresponding object in the scene.
Specifically, referring to fig. 1, fig. 1 is a flowchart of a robot crawling method based on multi-modal feature extraction and dense prediction in an embodiment of the present invention; FIG. 2 is a schematic structural diagram of a multi-modal feature extraction and dense prediction network in an embodiment of the present invention.
A robot grabbing method based on multi-modal feature extraction and dense prediction includes the steps that firstly, color images and depth images of a robot under a multi-class object grabbing scene are obtained; then, calculating a scene three-dimensional point cloud and adaptive convolution receptive fields with different scales from the depth image, and calculating a surface normal vector image (the same as the normal vector image in the attached figure 2) through scene three-dimensional point cloud data; then, constructing a multi-modal feature extraction and dense prediction network by combining the self-adaptive convolution receptive fields with different scales and training; then processing the scene color image and the surface normal vector image through the trained network to obtain the predicted dense three-dimensional attitude information and dense three-dimensional position information of each type of object; calculating the three-dimensional attitude of the corresponding object by adopting a mean mode according to the predicted dense three-dimensional attitude information of each type of object, calculating the three-dimensional position offset of each pixel according to the diameter of the object according to the predicted dense three-dimensional position information of each type of object, adding the three-dimensional position offset and the corresponding three-dimensional point cloud to obtain the dense three-dimensional position of the object, calculating the three-dimensional position of the corresponding object by adopting the mean mode, and forming the three-dimensional attitude of the corresponding object by the three-dimensional attitude and the three-dimensional position; and finally, sending the three-dimensional pose to a robot grabbing system to finish the grabbing task of the corresponding object in the scene.
In one embodiment, the multi-mode feature extraction and dense prediction network in S3 includes a multi-mode feature extraction network and three regression branch networks, the multi-mode feature extraction network is configured to perform feature extraction and feature fusion from the scene color image and the surface normal vector image to obtain multi-mode features, and the three regression branch networks are configured to respectively predict multi-class semantic information, three-dimensional posture information, and three-dimensional position information of the pixel-by-pixel target object from the multi-mode features.
In one embodiment, the multi-modal feature extraction network comprises a first convolution network, a second convolution network and a multi-scale feature fusion module, wherein the first convolution network extracts multi-scale color convolution features from a scene color image under the guidance of adaptive convolution receptive field with different scales, the second convolution network extracts multi-scale normal vector convolution features from a surface normal vector image under the guidance of adaptive convolution receptive field with different scales, and the multi-scale feature fusion module fuses the multi-scale color convolution features and the multi-scale normal vector convolution features to obtain the multi-modal features.
In one embodiment, the first convolution network and the second convolution network respectively use ResNet-18 as a main network, a third layer and a subsequent convolution layer of the main network are abandoned, adaptive depth convolution receptive fields with different scales are used for replacing an original conventional convolution receptive field of the network, the multi-scale feature fusion module comprises a first sub-module and a second sub-module, the first sub-module is used for performing multi-mode convolution feature fusion on color convolution features and normal vector convolution features with the same scale in different scales to obtain multi-mode features with different scales, and the second sub-module performs up-sampling and scale information fusion on the obtained multi-mode features with different scales by adopting a feature pyramid structure to obtain multi-mode features with pixel-by-pixel scenes.
Specifically, the first convolution network and the second convolution network are two identical convolution neural networksThe networks, all using ResNet-18 as the main network, abandon the third layer and the subsequent convolutional layer of the main network (i.e. abandon the 1/16 scale layer and the subsequent convolutional layer), and use the adaptive deep convolutional receptive field
Figure 168781DEST_PATH_IMAGE056
Replace the original conventional convolution receptive field of the network
Figure 128647DEST_PATH_IMAGE057
Completing convolution guide, then respectively extracting convolution characteristics of the scene color image and the surface normal vector image to obtain respective multi-scale convolution characteristic layers, wherein the multi-scale characteristic fusion module comprises two sub-modules: the first sub-module conducts multi-mode convolution feature fusion on color and normal vector convolution features of the same scale in different scales to obtain multi-mode features of different scales, and the second sub-module conducts up-sampling and scale information fusion on the obtained multi-mode convolution features of different scales by adopting a feature pyramid structure to obtain scene pixel-by-pixel multi-mode features.
In one embodiment, the three regression branch networks are a pixel-by-pixel semantic prediction network, a pixel-by-pixel three-dimensional posture prediction network and a pixel-by-pixel three-dimensional position prediction network respectively, the pixel-by-pixel semantic prediction network performs intensive pixel-by-pixel semantic information prediction on input multi-modal characteristics to obtain pixel-by-pixel multi-category semantic information, the pixel-by-pixel three-dimensional posture prediction network performs intensive pixel-by-pixel three-dimensional posture prediction on the input multi-modal characteristics to obtain pixel-by-pixel three-dimensional posture information, and the pixel-by-pixel three-dimensional position prediction network performs intensive pixel-by-pixel three-dimensional position prediction on the input multi-modal characteristics to obtain pixel-by-pixel three-dimensional position information.
Specifically, the three regression networks are a pixel-by-pixel semantic prediction network, a pixel-by-pixel three-dimensional attitude prediction network and a pixel-by-pixel three-dimensional position prediction network respectively, wherein the pixel-by-pixel semantic prediction network performs intensive pixel-by-pixel semantic information prediction on input pixel-by-pixel multi-modal characteristics to obtain pixel-by-pixel multi-category semantic information; the pixel-by-pixel three-dimensional attitude prediction network carries out intensive pixel-by-pixel three-dimensional attitude prediction on the input pixel-by-pixel multi-modal characteristics to obtain pixel-by-pixel three-dimensional attitude information; and carrying out intensive pixel-by-pixel three-dimensional position prediction on the input pixel-by-pixel multi-mode characteristics by the pixel-by-pixel three-dimensional position prediction network to obtain pixel-by-pixel three-dimensional position information. In addition, the predicted pixel-by-pixel multi-class semantic information is utilized to cut out the dense three-dimensional attitude information and the dense three-dimensional position information corresponding to each class of objects from the pixel-by-pixel three-dimensional attitude and position information.
In one embodiment, in S2, a scene three-dimensional point cloud is calculated from the depth image, and the specific formula is as follows:
Figure 311367DEST_PATH_IMAGE058
in the formula (I), the compound is shown in the specification,
Figure 497629DEST_PATH_IMAGE002
is the coordinate of the three-dimensional point cloud,
Figure 158417DEST_PATH_IMAGE003
Figure 238368DEST_PATH_IMAGE004
Figure 716623DEST_PATH_IMAGE005
Figure 249236DEST_PATH_IMAGE006
is a reference for the camera to be used,
Figure 979294DEST_PATH_IMAGE007
and
Figure 523539DEST_PATH_IMAGE008
is the coordinates of the depth image and,
Figure 48061DEST_PATH_IMAGE009
is the depth of the depth image;
in S2, self-adaptive convolution receptive fields of different scales are obtained by calculation from the depth image, and the specific formula is as follows:
Figure 67970DEST_PATH_IMAGE010
in the formula (I), the compound is shown in the specification,
Figure 962239DEST_PATH_IMAGE011
is a pixel
Figure 16783DEST_PATH_IMAGE012
Corresponding adaptive depth convolution receptive fields of different scales,
Figure 712206DEST_PATH_IMAGE013
is a pixel
Figure 829198DEST_PATH_IMAGE012
The corresponding conventional convolution receptive field is the same as the conventional convolution receptive field,
Figure 901059DEST_PATH_IMAGE014
is a pixel
Figure 810109DEST_PATH_IMAGE012
An offset of the position;
in S2, a surface normal vector image is obtained according to the scene three-dimensional point cloud, and the specific formula is as follows:
Figure 535489DEST_PATH_IMAGE059
wherein, the first and the second end of the pipe are connected with each other,
Figure 529990DEST_PATH_IMAGE060
in the formula (I), the compound is shown in the specification,
Figure 139962DEST_PATH_IMAGE017
is the image of the normal vector of the surface,
Figure 778885DEST_PATH_IMAGE018
for all the three-dimensional point clouds in the scene,
Figure 816111DEST_PATH_IMAGE019
the number of the point clouds is calculated,
Figure 297908DEST_PATH_IMAGE020
is composed of
Figure 72092DEST_PATH_IMAGE021
All-in-one medicine
Figure 690155DEST_PATH_IMAGE022
And (5) vector quantity.
Specifically, a scene three-dimensional point cloud and a surface normal vector image are calculated from a depth image, and the scene three-dimensional point cloud
Figure 163861DEST_PATH_IMAGE061
By
Figure 742741DEST_PATH_IMAGE062
Is obtained in which
Figure 428938DEST_PATH_IMAGE003
Figure 432666DEST_PATH_IMAGE004
Figure 670749DEST_PATH_IMAGE005
And
Figure 861559DEST_PATH_IMAGE006
is a reference for the camera to be used,
Figure 351446DEST_PATH_IMAGE007
and
Figure 85047DEST_PATH_IMAGE008
is the coordinates of the depth image and,
Figure 634977DEST_PATH_IMAGE009
for depth image coordinates
Figure 313083DEST_PATH_IMAGE063
The corresponding depth value;
surface normal vector image
Figure 341082DEST_PATH_IMAGE064
By
Figure 679922DEST_PATH_IMAGE066
Is obtained wherein
Figure 666332DEST_PATH_IMAGE018
For all of the three-dimensional point clouds in the scene,
Figure 707100DEST_PATH_IMAGE067
Figure 538790DEST_PATH_IMAGE019
the number of the point clouds is calculated,
Figure 106038DEST_PATH_IMAGE020
is composed of
Figure 122404DEST_PATH_IMAGE021
All-in-one medicine
Figure 775102DEST_PATH_IMAGE022
The vector of the vector is then calculated,
Figure 410483DEST_PATH_IMAGE068
is a natural number set;
computing adaptive depth convolution receptive fields of different scales from depth images
Figure 707603DEST_PATH_IMAGE069
Which can be expressed in the conventional convolution receptive field
Figure 770237DEST_PATH_IMAGE070
On the basis of adding an offset
Figure 644652DEST_PATH_IMAGE071
The concrete formula is as follows:
Figure 975402DEST_PATH_IMAGE010
in the formula (I), the compound is shown in the specification,
Figure 251662DEST_PATH_IMAGE072
is a position of a pixel, and is,
Figure 485198DEST_PATH_IMAGE013
is a pixel
Figure 987854DEST_PATH_IMAGE012
The corresponding conventional convolution receptive field is the same as the conventional convolution receptive field,
Figure 699458DEST_PATH_IMAGE011
is a pixel
Figure 361384DEST_PATH_IMAGE012
Corresponding adaptive depth convolution receptive fields of different scales,
Figure 890454DEST_PATH_IMAGE014
is a pixel
Figure 739461DEST_PATH_IMAGE012
The offset of the corresponding position is such that,
Figure 520336DEST_PATH_IMAGE073
Figure 646555DEST_PATH_IMAGE074
Figure 487472DEST_PATH_IMAGE075
is the size of the convolution kernel and is,
Figure 731765DEST_PATH_IMAGE076
Figure 50750DEST_PATH_IMAGE077
is the convolutional network feature layer size.
Further, the offset of the position
Figure 297055DEST_PATH_IMAGE078
The calculation process is specifically as follows:
1) And obtaining pixels
Figure 43294DEST_PATH_IMAGE012
To
Figure 788265DEST_PATH_IMAGE079
Neighborhood 2D plane corresponding to 3D plane normal vector
Figure 910942DEST_PATH_IMAGE080
Using in-camera participation depth image coordinates
Figure 11753DEST_PATH_IMAGE081
At corresponding depth value
Figure 928894DEST_PATH_IMAGE082
Reconstruction pixel
Figure 143186DEST_PATH_IMAGE012
Is arranged at
Figure 803974DEST_PATH_IMAGE079
A 3D point
Figure 618347DEST_PATH_IMAGE084
Figure 112913DEST_PATH_IMAGE085
Then extracted through all
Figure 645526DEST_PATH_IMAGE086
3D plane of (a)
Figure 844426DEST_PATH_IMAGE087
Normal vector of it
Figure 169097DEST_PATH_IMAGE088
Namely the obtained normal phasor, the specific calculation formula is as follows:
Figure 693619DEST_PATH_IMAGE090
wherein the content of the first and second substances,
Figure 713528DEST_PATH_IMAGE091
is a pixel
Figure 122643DEST_PATH_IMAGE092
Corresponding to the 3D points;
2) And calculating a 3D plane
Figure 911608DEST_PATH_IMAGE087
Medium orthogonal base
Figure 341452DEST_PATH_IMAGE093
Figure 474755DEST_PATH_IMAGE094
Fixed in the horizontal direction, shaped as
Figure 546617DEST_PATH_IMAGE095
According to the 3D plane theorem,
Figure 190087DEST_PATH_IMAGE096
Figure 197358DEST_PATH_IMAGE097
Figure 191859DEST_PATH_IMAGE098
from which can calculate
Figure 926465DEST_PATH_IMAGE094
. Then according to
Figure 955601DEST_PATH_IMAGE099
Can calculate out
Figure 727248DEST_PATH_IMAGE100
The concrete formula is as follows:
Figure 84411DEST_PATH_IMAGE101
Figure 498075DEST_PATH_IMAGE102
3) Will be 3D planar
Figure 116138DEST_PATH_IMAGE087
Projection onto a 2D plane:
constructing 3D planes
Figure 950364DEST_PATH_IMAGE087
3D mesh in (1), according to camera internal reference and pinhole camera projection principle
Figure 653878DEST_PATH_IMAGE103
Projecting to a 2D plane to obtain pixels
Figure 605654DEST_PATH_IMAGE012
To
Figure 219169DEST_PATH_IMAGE014
.3D plane
Figure 598197DEST_PATH_IMAGE087
The 3D mesh in (1) may be represented as:
Figure 789007DEST_PATH_IMAGE104
in the formula (I), the compound is shown in the specification,
Figure 669107DEST_PATH_IMAGE103
as a 3D plane
Figure 261763DEST_PATH_IMAGE087
In the 3D mesh, a, b are mesh coefficients,
Figure 952638DEST_PATH_IMAGE105
Figure 365165DEST_PATH_IMAGE106
Is a function of the scale factor and is,
Figure 658743DEST_PATH_IMAGE107
Figure 997583DEST_PATH_IMAGE108
4) Repetition of 1) -3), obtaining the offsets for all pixel positions
Figure 718414DEST_PATH_IMAGE109
In one embodiment, the predicted dense three-dimensional posture information of each type of object in S4 specifically includes:
Figure 883816DEST_PATH_IMAGE110
in the formula (I), the compound is shown in the specification,
Figure 715506DEST_PATH_IMAGE024
is a pixel
Figure 158120DEST_PATH_IMAGE025
The three-dimensional posture of the object is taken,
Figure 49852DEST_PATH_IMAGE026
the gesture is represented by a gesture that is,
Figure 436971DEST_PATH_IMAGE027
is a pixel
Figure 462565DEST_PATH_IMAGE028
In the form of quaternions of the three-dimensional attitude of the object
Figure 884319DEST_PATH_IMAGE029
A value;
the predicted dense three-dimensional position information of each type of object in the S4 specifically comprises the following steps:
Figure 946953DEST_PATH_IMAGE111
wherein, the first and the second end of the pipe are connected with each other,
Figure 962314DEST_PATH_IMAGE031
is a pixel
Figure 135806DEST_PATH_IMAGE025
The three-dimensional position of the object at (a),
Figure 412067DEST_PATH_IMAGE032
the position is indicated by a position indication,
Figure 537280DEST_PATH_IMAGE033
is a pixel
Figure 898991DEST_PATH_IMAGE025
Three-dimensional positional shift of the object, representing a pixel
Figure 876174DEST_PATH_IMAGE025
At the 3D point of the corresponding object
Figure 413466DEST_PATH_IMAGE034
Three-dimensional position from object
Figure 817902DEST_PATH_IMAGE032
Unitized three-dimensional offset of (1).
In one embodiment, in S5, the three-dimensional posture of the corresponding object is calculated according to the predicted dense three-dimensional posture information of each type of object, and the specific formula is as follows:
Figure 401331DEST_PATH_IMAGE112
wherein the content of the first and second substances,
Figure 306839DEST_PATH_IMAGE036
the three-dimensional pose of the object of the category obj,
Figure 823270DEST_PATH_IMAGE037
the dense predicted number corresponding to the class obj object;
and S5, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, wherein the specific formula is as follows:
Figure 398608DEST_PATH_IMAGE113
in the formula (I), the compound is shown in the specification,
Figure 610278DEST_PATH_IMAGE039
to the three-dimensional position of the object of the category obj,
Figure 929264DEST_PATH_IMAGE115
in order to predict the dense three-dimensional position,
Figure 34623DEST_PATH_IMAGE037
the corresponding dense predicted quantities for the class obj object.
Specifically, the three-dimensional posture of the target object is obtained from the three-dimensional postures of the dense prediction objects by using an averaging mode.
When calculating the three-dimensional position of the target object, firstly, calculating the predicted dense three-dimensional position information of each type of object, specifically expressed as:
Figure 406961DEST_PATH_IMAGE116
in the formula (I), the compound is shown in the specification,
Figure 496140DEST_PATH_IMAGE117
is a pixel
Figure 353237DEST_PATH_IMAGE025
The three-dimensional position of the object at (a),
Figure 719628DEST_PATH_IMAGE032
the position is indicated by a position indication,
Figure 636768DEST_PATH_IMAGE033
is a pixel
Figure 72297DEST_PATH_IMAGE025
Three-dimensional positional shift of the object, i.e. pixel
Figure 733086DEST_PATH_IMAGE025
At the 3D point of the corresponding object
Figure 813037DEST_PATH_IMAGE034
Three-dimensional position from object
Figure 42024DEST_PATH_IMAGE032
The unitized three-dimensional offset of (1) is specifically as follows:
Figure 574637DEST_PATH_IMAGE119
wherein, the first and the second end of the pipe are connected with each other,
Figure 304696DEST_PATH_IMAGE120
the diameter of the three-dimensional model corresponding to the object of the category obj;
the predicted dense three-dimensional position of each type of object is then calculated
Figure 865252DEST_PATH_IMAGE032
Figure 124195DEST_PATH_IMAGE121
Finally, calculating the three-dimensional position of the object by means of mean value
Figure 409683DEST_PATH_IMAGE122
Figure 553220DEST_PATH_IMAGE123
In one embodiment, the loss function preset in S3 is specifically:
Figure 342184DEST_PATH_IMAGE041
Figure 37608DEST_PATH_IMAGE125
Figure 669446DEST_PATH_IMAGE126
in the formula (I), the compound is shown in the specification,
Figure 475728DEST_PATH_IMAGE044
for the total loss of the network to be,
Figure 119199DEST_PATH_IMAGE045
Figure 126469DEST_PATH_IMAGE046
and
Figure 855391DEST_PATH_IMAGE047
weight factors of the semantic prediction branch, the three-dimensional attitude prediction branch and the three-dimensional position prediction branch respectively,
Figure 465364DEST_PATH_IMAGE048
for the loss function of the semantic prediction network, a cross entropy loss function is adopted,
Figure 855019DEST_PATH_IMAGE048
a network loss function is predicted for the three-dimensional pose,
Figure 892245DEST_PATH_IMAGE049
a network loss function is predicted for the three-dimensional location,
Figure 108463DEST_PATH_IMAGE050
and
Figure 397493DEST_PATH_IMAGE051
respectively a predicted value and a true value of the three-dimensional attitude prediction network,
Figure 15556DEST_PATH_IMAGE052
and
Figure 223683DEST_PATH_IMAGE053
the predicted value and the true value of the three-dimensional position prediction network are respectively,
Figure 317410DEST_PATH_IMAGE054
as to the number of object classes in the scene,
Figure 269186DEST_PATH_IMAGE055
the corresponding dense predicted number for each class of objects.
Specifically, a multi-modal feature extraction and dense prediction network is built, and the network consists of a multi-modal feature extraction network and three regression branch networks, wherein the multi-modal feature extraction network consists of two identical convolution networks and a multi-scale feature fusion module. The method comprises the steps of training a built multi-modal feature extraction and dense prediction network by using a training data set, supervising network learning by using provided scene color and depth images and semantic masks and three-dimensional pose true values of all target objects to obtain optimal weight parameters, presetting a loss function for each regression branch, calculating a total loss value of the network according to the preset loss function, reversely propagating and updating network parameters of the network, and obtaining an updated multi-modal feature extraction and dense prediction network.
A robot grasping system adopts a robot grasping method based on multi-modal feature extraction and density prediction to grasp a target object in a scene, and comprises a robot pose calculation module, a communication module, a grasping module and an image acquisition module,
the image acquisition module is used for acquiring color images and depth images under multi-class object scenes in real time and sending the color images and the depth images to the pose calculation module;
the pose calculation module calculates the pose of the target object by adopting a robot grabbing method based on multi-mode feature extraction and density prediction and sends the pose to the grabbing module through the communication module;
the grabbing module receives the 6D pose information of the target object and grabs the target object.
The robot grabbing method and system based on multi-mode feature extraction and dense prediction firstly acquire scene color and depth images, then calculate scene three-dimensional point clouds, surface normal vector images and adaptive convolution receptive fields with different scales from the depth images, then preset a multi-mode feature extraction and dense prediction network, train and update the network, process the scene color images and the surface normal vector images by adopting the updated network, and obtain the predicted dense three-dimensional attitude information and dense three-dimensional position information of each type of objects; and calculating to obtain the three-dimensional attitude of the corresponding object according to the predicted dense three-dimensional attitude information of each type of object, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, wherein the three-dimensional attitude and the three-dimensional position jointly form the three-dimensional attitude of the corresponding object, and sending the three-dimensional attitude to a robot grabbing system to finish the grabbing task of the corresponding object in the scene. The method integrates multi-mode color and depth data, adopts a two-dimensional convolution structure, retains two-dimensional plane characteristics and depth information in feature extraction, has a simple structure and high prediction precision, and is suitable for robot grabbing tasks in complex scenes.
It should be noted that, for those skilled in the art, without departing from the principle of the present invention, it is possible to make various improvements and modifications to the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (10)

1. The robot grasping method based on multi-modal feature extraction and dense prediction is characterized by comprising the following steps of:
s1, acquiring a color image and a depth image of a robot under a multi-class object grabbing scene;
s2, calculating scene three-dimensional point clouds and adaptive convolution receptive fields with different scales from the depth images, and obtaining a surface normal vector image according to the scene three-dimensional point clouds;
s3, constructing a multi-modal feature extraction and dense prediction network by combining self-adaptive convolution receptive fields of different scales, inputting a preset training set into the network for training to obtain the trained multi-modal feature extraction and dense prediction network, calculating a total loss value of the network according to a preset loss function, and reversely propagating and updating network parameters of the network to obtain an updated multi-modal feature extraction and dense prediction network;
s4, processing the scene color image and the surface normal vector image through the updated multi-modal feature extraction and dense prediction network to obtain dense three-dimensional attitude information and dense three-dimensional position information predicted by each type of object;
and S5, calculating to obtain the three-dimensional posture of the corresponding object according to the predicted dense three-dimensional posture information of each type of object, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, wherein the three-dimensional posture and the three-dimensional position jointly form the three-dimensional posture of the corresponding object, and sending the three-dimensional posture to a robot grabbing system to finish the grabbing task of the corresponding object in the scene.
2. The robot grasping method based on multi-modal feature extraction and dense prediction according to claim 1, wherein the multi-modal feature extraction and dense prediction network in S3 includes a multi-modal feature extraction network for feature extraction and feature fusion from the scene color image and the surface normal vector image to obtain multi-modal features, and three regression branch networks for predicting multi-class semantic information, three-dimensional pose information, and three-dimensional position information of pixel-by-pixel target objects from the multi-modal features, respectively.
3. The robot grasping method based on multi-modal feature extraction and dense prediction as claimed in claim 2, wherein the multi-modal feature extraction network includes a first convolution network, a second convolution network and a multi-scale feature fusion module, wherein the first convolution network extracts multi-scale color convolution features from the scene color image under guidance of different-scale adaptive convolution receptive fields, the second convolution network extracts multi-scale normal vector convolution features from the surface normal vector image under guidance of different-scale adaptive convolution receptive fields, and the multi-scale feature fusion module fuses the multi-scale color convolution features and the multi-scale normal vector convolution features to obtain multi-modal features.
4. The robot grasping method based on multi-modal feature extraction and dense prediction as claimed in claim 3, wherein the first convolution network and the second convolution network respectively use ResNet-18 as a backbone network, a third layer and subsequent convolution layers of the backbone network are discarded, and an original conventional convolution receptive field of the network is replaced by an adaptive deep convolution receptive field of different scales, the multi-scale feature fusion module includes a first sub-module and a second sub-module, the first sub-module is used for performing multi-modal convolution feature fusion on color convolution features and normal vector convolution features of the same scale in different scales to obtain multi-modal features of different scales, and the second sub-module performs up-sampling and scale information fusion on the obtained multi-modal features of different scales by using a feature pyramid structure to obtain scene pixel-by-pixel multi-modal features.
5. The robot grasping method based on multimodal feature extraction and dense prediction as claimed in claim 3, wherein the three regression branch networks are a pixel-by-pixel semantic prediction network, a pixel-by-pixel three-dimensional pose prediction network and a pixel-by-pixel three-dimensional position prediction network, respectively, the pixel-by-pixel semantic prediction network performs dense pixel-by-pixel semantic information prediction on the input multimodal features to obtain pixel-by-pixel multi-class semantic information, the pixel-by-pixel three-dimensional pose prediction network performs dense pixel-by-pixel three-dimensional pose prediction on the input multimodal features to obtain pixel-by-pixel three-dimensional pose information, and the pixel-by-pixel three-dimensional position prediction network performs dense pixel-by-pixel three-dimensional position prediction on the input multimodal features to obtain pixel-by-pixel three-dimensional position information.
6. The robot grasping method based on the multi-modal feature extraction and the dense prediction as claimed in claim 1, wherein the scene three-dimensional point cloud is calculated from the depth image in S2, and the specific formula is as follows:
Figure 282618DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure 140852DEST_PATH_IMAGE002
is a three-dimensional point cloud coordinate system,
Figure 753099DEST_PATH_IMAGE003
Figure 434135DEST_PATH_IMAGE004
Figure 727713DEST_PATH_IMAGE005
Figure 440454DEST_PATH_IMAGE006
is a reference for the camera to be used,
Figure 426865DEST_PATH_IMAGE007
and
Figure 654584DEST_PATH_IMAGE008
is the coordinates of the depth image and,
Figure 751853DEST_PATH_IMAGE009
as depth of depth image;
In the step S2, the adaptive convolution receptive fields of different scales are obtained by calculation from the depth image, and the specific formula is as follows:
Figure 319100DEST_PATH_IMAGE010
in the formula (I), the compound is shown in the specification,
Figure 476412DEST_PATH_IMAGE011
is a pixel
Figure 863531DEST_PATH_IMAGE012
Corresponding adaptive depth convolution receptive fields of different scales,
Figure 764491DEST_PATH_IMAGE013
is a pixel
Figure 451824DEST_PATH_IMAGE012
The corresponding conventional convolution receptive field is the same as the conventional convolution receptive field,
Figure 576775DEST_PATH_IMAGE014
is a pixel
Figure 716769DEST_PATH_IMAGE012
An offset of the position;
in the S2, a surface normal vector image is obtained according to the scene three-dimensional point cloud, and the specific formula is as follows:
Figure 155841DEST_PATH_IMAGE015
wherein the content of the first and second substances,
Figure 435031DEST_PATH_IMAGE016
in the formula (I), the compound is shown in the specification,
Figure 934146DEST_PATH_IMAGE017
is the image of the normal vector of the surface,
Figure 358174DEST_PATH_IMAGE018
for all the three-dimensional point clouds in the scene,
Figure 600936DEST_PATH_IMAGE019
the number of the point clouds is calculated,
Figure 262862DEST_PATH_IMAGE020
is composed of
Figure 667298DEST_PATH_IMAGE021
All-in-one medicine
Figure 516306DEST_PATH_IMAGE022
And (5) vector quantity.
7. The robot grasping method based on the multi-modal feature extraction and the dense prediction as recited in claim 6, wherein the dense three-dimensional posture information predicted for each type of object in S4 is specifically:
Figure 562759DEST_PATH_IMAGE023
in the formula (I), the compound is shown in the specification,
Figure 79191DEST_PATH_IMAGE024
is a pixel
Figure 716846DEST_PATH_IMAGE025
The three-dimensional posture of the object is taken,
Figure 53149DEST_PATH_IMAGE026
the gesture is represented by a gesture that is,
Figure 372135DEST_PATH_IMAGE027
is a pixel
Figure 743073DEST_PATH_IMAGE028
In the form of quaternions of the three-dimensional attitude of the object
Figure 754892DEST_PATH_IMAGE029
A value;
the predicted dense three-dimensional position information of each type of object in the step S4 specifically includes:
Figure 632019DEST_PATH_IMAGE030
wherein the content of the first and second substances,
Figure 754696DEST_PATH_IMAGE031
is a pixel
Figure 245720DEST_PATH_IMAGE025
The three-dimensional position of the object at (a),
Figure 428440DEST_PATH_IMAGE032
the position is indicated by a position indication,
Figure 473756DEST_PATH_IMAGE033
is a pixel
Figure 400124DEST_PATH_IMAGE025
Three-dimensional positional shift of the object, representing a pixel
Figure 745655DEST_PATH_IMAGE025
At the 3D point of the corresponding object
Figure 896013DEST_PATH_IMAGE034
Three-dimensional position from object
Figure 694205DEST_PATH_IMAGE032
Unit of (2)Three-dimensional offset is realized.
8. The robot grasping method based on the multi-modal feature extraction and the dense prediction as recited in claim 7, wherein the three-dimensional pose of the corresponding object is calculated in S5 according to the predicted dense three-dimensional pose information of each type of object, and the specific formula is as follows:
Figure 424263DEST_PATH_IMAGE035
wherein the content of the first and second substances,
Figure 358721DEST_PATH_IMAGE036
for the three-dimensional pose of the object of the category obj,
Figure 883244DEST_PATH_IMAGE037
the corresponding dense prediction number for the class obj object;
and in the S5, according to the predicted dense three-dimensional position information of each type of object, calculating to obtain the three-dimensional position of the corresponding object, wherein the specific formula is as follows:
Figure 168732DEST_PATH_IMAGE038
in the formula (I), the compound is shown in the specification,
Figure 702481DEST_PATH_IMAGE039
to the three-dimensional position of the object of the category obj,
Figure 556692DEST_PATH_IMAGE040
in order to predict the dense three-dimensional position,
Figure 252116DEST_PATH_IMAGE037
the corresponding dense predicted quantities for the class obj object.
9. The robot grasping method based on the multi-modal feature extraction and the dense prediction as claimed in claim 1, wherein the loss function preset in S3 is specifically:
Figure 24900DEST_PATH_IMAGE041
Figure 831182DEST_PATH_IMAGE042
Figure 536969DEST_PATH_IMAGE043
in the formula (I), the compound is shown in the specification,
Figure 403294DEST_PATH_IMAGE044
for the total loss of the network to be,
Figure 397795DEST_PATH_IMAGE045
Figure 70085DEST_PATH_IMAGE046
and
Figure 833641DEST_PATH_IMAGE047
weight factors of the semantic prediction branch, the three-dimensional attitude prediction branch and the three-dimensional position prediction branch respectively,
Figure 870868DEST_PATH_IMAGE048
for the loss function of the semantic prediction network, a cross entropy loss function is adopted,
Figure 352665DEST_PATH_IMAGE048
a network loss function is predicted for the three-dimensional pose,
Figure 563066DEST_PATH_IMAGE049
a network loss function is predicted for the three-dimensional location,
Figure 184059DEST_PATH_IMAGE050
and
Figure 454503DEST_PATH_IMAGE051
respectively a predicted value and a true value of the three-dimensional attitude prediction network,
Figure 423596DEST_PATH_IMAGE052
and
Figure 640951DEST_PATH_IMAGE053
the predicted value and the true value of the three-dimensional position prediction network are respectively,
Figure 113521DEST_PATH_IMAGE054
as to the number of object classes in the scene,
Figure 758129DEST_PATH_IMAGE055
the corresponding dense predicted number for each class of objects.
10. A robot grasping system, which adopts the robot grasping method based on multi-modal feature extraction and dense prediction as claimed in any one of claims 1 to 9 to perform the grasping task of the target object in the scene, is characterized in that the system comprises a robot pose calculation module, a communication module, a grasping module and an image acquisition module,
the image acquisition module is used for acquiring color images and depth images under multi-class object scenes in real time and sending the color images and the depth images to the pose calculation module;
the pose calculation module calculates the pose of the target object by the method according to any one of claims 1 to 9 and sends the pose to the grab module through the communication module;
the grabbing module receives the 6D pose information of the target object and grabs the target object.
CN202211407718.3A 2022-11-10 2022-11-10 Robot grabbing method and system based on multi-mode feature extraction and dense prediction Active CN115578460B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211407718.3A CN115578460B (en) 2022-11-10 2022-11-10 Robot grabbing method and system based on multi-mode feature extraction and dense prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211407718.3A CN115578460B (en) 2022-11-10 2022-11-10 Robot grabbing method and system based on multi-mode feature extraction and dense prediction

Publications (2)

Publication Number Publication Date
CN115578460A true CN115578460A (en) 2023-01-06
CN115578460B CN115578460B (en) 2023-04-18

Family

ID=84588865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211407718.3A Active CN115578460B (en) 2022-11-10 2022-11-10 Robot grabbing method and system based on multi-mode feature extraction and dense prediction

Country Status (1)

Country Link
CN (1) CN115578460B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116494253A (en) * 2023-06-27 2023-07-28 北京迁移科技有限公司 Target object grabbing pose acquisition method and robot grabbing system
CN117934478A (en) * 2024-03-22 2024-04-26 腾讯科技(深圳)有限公司 Defect detection method, device, equipment and medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180085923A1 (en) * 2016-09-29 2018-03-29 Seiko Epson Corporation Robot control device, robot, and robot system
CN110363815A (en) * 2019-05-05 2019-10-22 东南大学 The robot that Case-based Reasoning is divided under a kind of haplopia angle point cloud grabs detection method
WO2020130085A1 (en) * 2018-12-21 2020-06-25 株式会社日立製作所 Three-dimensional position/attitude recognition device and method
US20200361083A1 (en) * 2019-05-15 2020-11-19 Nvidia Corporation Grasp generation using a variational autoencoder
CN113658254A (en) * 2021-07-28 2021-11-16 深圳市神州云海智能科技有限公司 Method and device for processing multi-modal data and robot
CN114663514A (en) * 2022-05-25 2022-06-24 浙江大学计算机创新技术研究院 Object 6D attitude estimation method based on multi-mode dense fusion network
CN114998573A (en) * 2022-04-22 2022-09-02 北京航空航天大学 Grabbing pose detection method based on RGB-D feature depth fusion
CN115082885A (en) * 2022-06-27 2022-09-20 深圳见得空间科技有限公司 Point cloud target detection method, device, equipment and storage medium
CN115147488A (en) * 2022-07-06 2022-10-04 湖南大学 Workpiece pose estimation method based on intensive prediction and grasping system
CN115256377A (en) * 2022-07-12 2022-11-01 同济大学 Robot grabbing method and device based on multi-source information fusion

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180085923A1 (en) * 2016-09-29 2018-03-29 Seiko Epson Corporation Robot control device, robot, and robot system
WO2020130085A1 (en) * 2018-12-21 2020-06-25 株式会社日立製作所 Three-dimensional position/attitude recognition device and method
CN110363815A (en) * 2019-05-05 2019-10-22 东南大学 The robot that Case-based Reasoning is divided under a kind of haplopia angle point cloud grabs detection method
US20200361083A1 (en) * 2019-05-15 2020-11-19 Nvidia Corporation Grasp generation using a variational autoencoder
CN113658254A (en) * 2021-07-28 2021-11-16 深圳市神州云海智能科技有限公司 Method and device for processing multi-modal data and robot
CN114998573A (en) * 2022-04-22 2022-09-02 北京航空航天大学 Grabbing pose detection method based on RGB-D feature depth fusion
CN114663514A (en) * 2022-05-25 2022-06-24 浙江大学计算机创新技术研究院 Object 6D attitude estimation method based on multi-mode dense fusion network
CN115082885A (en) * 2022-06-27 2022-09-20 深圳见得空间科技有限公司 Point cloud target detection method, device, equipment and storage medium
CN115147488A (en) * 2022-07-06 2022-10-04 湖南大学 Workpiece pose estimation method based on intensive prediction and grasping system
CN115256377A (en) * 2022-07-12 2022-11-01 同济大学 Robot grabbing method and device based on multi-source information fusion

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116494253A (en) * 2023-06-27 2023-07-28 北京迁移科技有限公司 Target object grabbing pose acquisition method and robot grabbing system
CN116494253B (en) * 2023-06-27 2023-09-19 北京迁移科技有限公司 Target object grabbing pose acquisition method and robot grabbing system
CN117934478A (en) * 2024-03-22 2024-04-26 腾讯科技(深圳)有限公司 Defect detection method, device, equipment and medium

Also Published As

Publication number Publication date
CN115578460B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN109255813B (en) Man-machine cooperation oriented hand-held object pose real-time detection method
CN112070818B (en) Robot disordered grabbing method and system based on machine vision and storage medium
CN109344882B (en) Convolutional neural network-based robot control target pose identification method
CN100407798C (en) Three-dimensional geometric mode building system and method
CN103778635B (en) For the method and apparatus processing data
CN113450408B (en) Irregular object pose estimation method and device based on depth camera
CN115578460B (en) Robot grabbing method and system based on multi-mode feature extraction and dense prediction
CN113065546B (en) Target pose estimation method and system based on attention mechanism and Hough voting
CN110176032B (en) Three-dimensional reconstruction method and device
CN110211180A (en) A kind of autonomous grasping means of mechanical arm based on deep learning
CN112836734A (en) Heterogeneous data fusion method and device and storage medium
CN111553949B (en) Positioning and grabbing method for irregular workpiece based on single-frame RGB-D image deep learning
CN110509273B (en) Robot manipulator detection and grabbing method based on visual deep learning features
CN113409384A (en) Pose estimation method and system of target object and robot
CN109325995B (en) Low-resolution multi-view hand reconstruction method based on hand parameter model
CN111998862B (en) BNN-based dense binocular SLAM method
CN111339870A (en) Human body shape and posture estimation method for object occlusion scene
CN112767467B (en) Double-image depth estimation method based on self-supervision deep learning
CN111524233A (en) Three-dimensional reconstruction method for dynamic target of static scene
CN109318227B (en) Dice-throwing method based on humanoid robot and humanoid robot
CN114882109A (en) Robot grabbing detection method and system for sheltering and disordered scenes
CN112750198A (en) Dense correspondence prediction method based on non-rigid point cloud
CN116129037B (en) Visual touch sensor, three-dimensional reconstruction method, system, equipment and storage medium thereof
CN113927597A (en) Robot connecting piece six-degree-of-freedom pose estimation system based on deep learning
CN115147488A (en) Workpiece pose estimation method based on intensive prediction and grasping system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant