CN115187781A - Six-degree-of-freedom grabbing detection algorithm based on semantic segmentation network - Google Patents

Six-degree-of-freedom grabbing detection algorithm based on semantic segmentation network Download PDF

Info

Publication number
CN115187781A
CN115187781A CN202210817429.4A CN202210817429A CN115187781A CN 115187781 A CN115187781 A CN 115187781A CN 202210817429 A CN202210817429 A CN 202210817429A CN 115187781 A CN115187781 A CN 115187781A
Authority
CN
China
Prior art keywords
grabbing
degree
freedom
semantic
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210817429.4A
Other languages
Chinese (zh)
Other versions
CN115187781B (en
Inventor
张向燕
张勤俭
李海源
沈勇
王勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Beijing Information Science and Technology University
Peking University School of Stomatology
Original Assignee
Beijing University of Posts and Telecommunications
Beijing Information Science and Technology University
Peking University School of Stomatology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications, Beijing Information Science and Technology University, Peking University School of Stomatology filed Critical Beijing University of Posts and Telecommunications
Priority to CN202210817429.4A priority Critical patent/CN115187781B/en
Publication of CN115187781A publication Critical patent/CN115187781A/en
Application granted granted Critical
Publication of CN115187781B publication Critical patent/CN115187781B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses a six-degree-of-freedom grabbing detection algorithm based on a semantic segmentation network, which comprises the following steps of: acquiring a six-degree-of-freedom grabbing detection data set, and converting a grabbing parameter array into a grabbing parameter semantic label image; carrying out scaling, normalization, classification reduction and division of a training test set on RGB images in the data set and the captured parameter semantic label images obtained by conversion; constructing a six-degree-of-freedom capture detection semantic segmentation network, and training the network; sending RGB image data into a trained six-degree-of-freedom capture detection semantic segmentation network to obtain a predicted capture parameter semantic image; converting the predicted semantic tag image of the grabbing parameters into grabbing parameters relative to a camera coordinate system; and driving the robot with the end effector to grab according to the grabbing parameters. By adopting the six-degree-of-freedom capture detection algorithm based on the semantic segmentation network, the network model is more stable and robust, and the prediction result is more accurate.

Description

Six-degree-of-freedom grabbing detection algorithm based on semantic segmentation network
Technical Field
The invention belongs to the technical field of computer vision and robot grabbing, and particularly relates to a six-degree-of-freedom grabbing detection algorithm based on a semantic segmentation network.
Background
The grabbing and the operation of the robot are important modes of interaction between the robot and the environment, the grabbing of the robot is a more basic and important link, and although a large number of researchers research the grabbing problem at present, the grabbing problem is a quite complex problem, which not only shows the diversity of grabbing scenes, but also shows the precision in the aspects of grabbing pose analysis, grabbing motion planning and the like.
The robot grabbing problem can be broken down into two main steps: firstly, detecting a grabbing pose; and secondly, performing exercise planning. The detection of the grabbing pose is the key to realize the successful grabbing of the robot. At present, the grabbing pose detection mainly comprises two modes: one is mathematical analysis and the other is data-driven.
The data analysis method needs to obtain the position and the posture of the robot for grabbing the target object through a series of complex operations, has low operation efficiency and complex analysis process, and is difficult to migrate to other new scenes.
With the continuous development of artificial intelligence and machine vision, a data-driven neural network is gradually applied to grasping pose detection research, and typical methods include: two-dimensional plane grabbing detection and six-degree-of-freedom grabbing detection. Two-dimensional plane grabbing is generally represented by directed rectangles, a large number of usable data sets for two-dimensional plane grabbing are available at present, a large number of researches are carried out, and a good prediction effect is obtained, but two-dimensional plane grabbing can only achieve grabbing perpendicular to a plane, and cannot adapt to more flexible and complex scenes. Six-degree-of-freedom grabbing can achieve grabbing of an object from any direction, grabbing is more flexible, however, most of current researches on six-degree-of-freedom grabbing detection take a depth image or three-dimensional point cloud as input, instability of prediction is increased, and the six-degree-of-freedom grabbing detection is easily influenced by depth image noise. In addition, in the six-degree-of-freedom grabbing detection, the grabbing attitude is a 3 × 3 matrix, and most researches directly use a regression mode to predict the grabbing attitude, which leads to a non-ideal prediction result.
Disclosure of Invention
The invention aims to provide a six-degree-of-freedom grabbing detection algorithm based on a semantic segmentation network, which can overcome the problems of the existing two-dimensional plane grabbing detection and the existing six-degree-of-freedom grabbing detection, improve the accuracy of a prediction result and adapt to more flexible and complex scenes.
In order to achieve the purpose, the invention provides a six-degree-of-freedom capture detection algorithm based on a semantic segmentation network, which comprises the following steps of:
step 1: acquiring a six-degree-of-freedom grabbing detection data set, and converting a grabbing parameter array into a grabbing parameter semantic label image by a designed data conversion method;
and 2, step: carrying out scaling, normalization, classification reduction and division of a training test set on RGB images in the data set and the captured parameter semantic label images obtained by conversion;
and 3, step 3: constructing a six-degree-of-freedom capture detection semantic segmentation network, and training the network by using the processed data;
and 4, step 4: sending RGB image data into a trained six-degree-of-freedom capture detection semantic segmentation network to obtain a predicted capture parameter semantic image;
and 5: converting the predicted semantic label image of the grabbing parameters through the proposed post-processing operation to obtain the grabbing parameters relative to a camera coordinate system;
step 6: and (4) driving the robot with the end effector to grab the object in the scene through the grabbing parameters obtained by post-processing.
Preferably, step 1 specifically comprises:
step 1.1: acquiring a GraspNet-1Billion six-degree-of-freedom grabbing detection data set, reading and converting grabbing labels in the data set;
step 1.2: converting the specific numerical values of the grabbing depth and the opening and closing width in the grabbing label obtained after the treatment into classifications;
step 1.3: converting the rotation attitude matrix in the captured label obtained after processing into a rotation vector, decomposing the rotation attitude matrix into a unit direction vector and an angle rotating around the unit direction vector, and converting to obtain a classification of the unit direction vector and the angle rotating around the unit direction vector through a classification matching template;
step 1.4: and converting the grabbing parameters into grabbing parameter semantic label images by constructing three-dimensional point cloud of the grabbing parameters.
Preferably, step 2 specifically comprises:
step 2.1: the RGB images in the data set and the captured parameter semantic label images obtained by conversion are reduced to 224 x 224 pixel size from the original 720 x 1280 pixel size in a neighboring point sampling mode, so that the model operation speed is improved, and the model is adaptive to a network structure;
step 2.2: dividing the data of the RGB image by 255 for normalization; converting the label images of the captured parameters into actual classification numerical values from the classification in the images;
step 2.3: and randomly scrambling the data in the first 100 scenes for training in the data set according to the proportion of 8:2, and dividing the data into training data and test data.
Preferably, step 3 specifically comprises:
step 3.1: constructing a six-degree-of-freedom capture detection semantic segmentation network;
step 3.2: setting an optimizer, a loss function, a batch of loading data and iteration times;
step 3.3: and loading the training data and the test data after the data processing by using an iterator so as to train the model.
Preferably, step 4 specifically comprises:
step 4.1: performing data preprocessing operation including picture reduction and normalization on the RGB image;
step 4.2: and sending the RGB image subjected to data preprocessing into a six-degree-of-freedom capture detection semantic segmentation network which completes training to obtain a predicted capture parameter semantic image.
Preferably, step 5 specifically comprises:
step 5.1: capturing a parameter semantic image with the predicted size of 224 x 224 pixels, and restoring the semantic image to the original size of 720 x 1280 pixels in a mode of sampling adjacent points;
step 5.2: screening out a series of grabbing parameters representing background classification;
step 5.3: through a dictionary or an array, restoring the screened grabbing parameters obtained through prediction into corresponding actual numerical values through classification;
step 5.4: through the input depth image, the depth of the corresponding pixel position in the depth image is obtained by the image information index of the grabbing point, and then the coordinate of the grabbing point under the three-dimensional coordinate is calculated through the depth value;
step 5.5: calculating to obtain a rotation vector through a unit direction vector R of the rotation vector and an angle theta rotating around the unit direction vector, and converting the rotation vector into a rotation attitude matrix R to obtain a grabbing parameter G = [ X, Y, Z, R, d, w ] in a camera coordinate system;
where (X, Y, Z) represents three-dimensional coordinates of the grasping point in the camera coordinate system, R represents a 3 × 3 rotation posture matrix, and d and w are a grasping depth of the gripper near the object in the approaching direction and an opening width of the gripper.
Preferably, step 6 specifically includes:
step 6.1: converting the predicted grabbing parameters relative to the camera coordinate system into grabbing parameters relative to the world coordinate system through a homogeneous transformation matrix of a depth camera used in an actual environment in the world coordinate system;
step 6.2: planning the tail end attitude and the running path of the robot by the predicted grabbing parameters and inverse kinematics calculation;
step 6.3: the robot is driven to the predicted grasp attitude, and the article is grasped by a gripper provided at the end of the robot.
Therefore, the six-degree-of-freedom grabbing detection algorithm based on the semantic segmentation network, which adopts the structure, takes the RGBD image as input, so that a network model is more stable and robust, and the grabbing posture is predicted in a classification rather than regression mode after being transformed, so that the prediction result is more accurate.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a schematic diagram illustrating steps of an embodiment of a six-degree-of-freedom capture detection algorithm based on a semantic segmentation network according to the present invention;
FIG. 2 is a schematic flow chart illustrating a process of acquiring a six-degree-of-freedom grabbed data set according to an embodiment of the present invention;
FIG. 3 is a flowchart of a data conversion method for converting a capture parameter array into a capture parameter semantic tag image according to an embodiment of the present invention;
FIG. 4 is an exemplary diagram of a captured parameter semantic tag image obtained by conversion according to an embodiment of the present invention;
FIG. 5 is a six-degree-of-freedom capture detection semantic segmentation model according to an embodiment of the present invention;
FIG. 6 is a flow chart of model training, prediction and data post-processing according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating a display of a predicted capture in a three-dimensional scene point cloud after a post-processing and encoding/decoding semantic segmentation network according to an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is further illustrated by the accompanying drawings and examples.
Examples
As shown in fig. 1, a six-degree-of-freedom capture detection algorithm based on a semantic segmentation network includes the following steps:
step 1: and acquiring a six-degree-of-freedom grabbing detection data set, and converting the grabbing parameter array into a grabbing parameter semantic label image by a designed data conversion method.
The step 1 specifically comprises the following steps:
step 1.1: as shown in fig. 2, a GraspNet-1Billion six-degree-of-freedom grabbing detection dataset is obtained, and grabbing tags in the dataset are read and converted.
The grabbing parameters of the GraspNet-1Billion six-degree-of-freedom grabbing detection data set comprise: collision tags, grab tags, and tags of the number and pose of objects in the scene. The grabbing tag is a grabbing pose for different objects and comprises coordinates of grabbing points, an approach vector, a rotation angle in a plane, a grabbing depth, an opening and closing width and a grabbing fraction. The approach vector here refers to a direction vector in which the gripper approaches the object, and the rotation angle in the plane refers to an angle around which the gripper rotates. The collision tag gives a binary boolean property that different objects are placed in the scene. The object number and the pose label store the number of the object put in different scenes and the pose of the object in the scenes.
Six-degree-of-freedom grabbing can realize that the end effector can grab an object from any direction, and the six-degree-of-freedom grabbing method is expressed as follows:
G=[X,Y,Z,R,d,w]
where (X, Y, Z) represents three-dimensional coordinates of the grasping point in the camera coordinate system, R represents a 3 × 3 rotation posture matrix, and d and w are a grasping depth of the gripper near the object in the approaching direction and an opening width of the gripper.
The model output obtained by the six-degree-of-freedom grabbing detection semantic segmentation network is as follows:
G_cls=[point,r,θ,d,w]
the output gripping parameters are all classification numbers, where point denotes whether or not it is a gripping point, r and θ denote classification of a unit direction vector of a rotation vector after converting the rotation posture matrix into the rotation vector and classification of an angle of rotation around the unit direction vector, respectively, and d and w are classification of a gripping depth and classification of an opening width of the gripper.
The specific process of acquiring and screening the grabbing parameters is as follows:
firstly, according to the scene number and the camera view angle number in the data set, the collision tag of the scene, the number of the object in the scene and the pose of the object under the camera view angle corresponding to the scene are indexed.
And secondly, reading in the grabbing label corresponding to the object according to the serial number of the object.
Converting the three-dimensional coordinates of the grabbing points in the grabbing label relative to the object coordinate system into the three-dimensional coordinates of the grabbing points in the camera coordinate system through the postures of the objects, wherein the calculation formula is as follows:
Figure BDA0003741315860000061
in the formula, point _ obj refers to three-dimensional coordinates of a capture Point relative to an object coordinate system under a capture tag in a data set, point _ cam is three-dimensional coordinates of the capture Point relative to a camera coordinate system, and obj _ position is a 4 × 4 matrix of the object relative to the camera coordinate system.
And fourthly, calculating to obtain a rotation attitude matrix relative to the object coordinate system according to the number of the proximity vector in the grabbing label and the number of the rotation angle in the plane.
Fifthly, converting the calculated rotation attitude matrix into a rotation attitude matrix under a camera coordinate system through the position and attitude of the object, wherein the calculation formula is as follows;
Figure BDA0003741315860000071
R=R obj_pose ·R_obj
where view and ang denote the rotation angles in the plane and the proximity vector in the grab tag of the dataset, R 1 、R 2 Is a 3 × 3 rotation matrix, R, generated therefrom obj_pose A 3 × 3 rotation matrix in the pose matrix representing the object, R _ obj is represented by R 1 And R 2 A rotation attitude matrix obtained by calculation relative to the object coordinate system, and R is a phase obtained by the rotation matrix of the objectA rotation attitude matrix for the camera coordinate system.
And sixthly, screening and retaining a series of grabbing parameters which are not collided and have the highest grabbing scores according to the collision labels and the grabbing scores.
Step 1.2: and converting the specific numerical values of the grabbing depth and the opening and closing width in the grabbing label obtained after the processing into classification.
Firstly, d and w in the grabbing parameters are converted into classification numbers with classification numbers of 4 and 16 respectively through a dictionary:
d_dict={0:0.01,1:0.02,2:0.03,3:0.04}
according to the conversion dictionary for grabbing the depth, the depth data in the d array is converted into corresponding classification from specific numerical values such as 0.01, 0.02 and the like.
Similarly, the opening/closing width w is converted into a classification by a dictionary, and before the classification is converted, a rounding operation for retaining two decimal places is also required for the opening/closing width.
w_dict={0:0,1:0.01,2:0.02,3:0.03,4:0.04;5:0.05,6:0.06,7:0.07;8:0.08;9:0.09,10:0.10;11:0.11;12:0.12;13:0.13;14:0.14;15:0.15}
Step 1.3: and converting the rotation posture matrix in the captured label obtained after the processing into a rotation vector, decomposing the rotation vector into a unit direction vector and an angle rotating around the unit direction vector, and converting to obtain the classification of the unit direction vector and the angle rotating around the unit direction vector through a classification matching template.
To convert the rotation attitude matrix into a classification, the rotation attitude matrix is first converted into a rotation vector, which is denoted by v = θ · r, where r denotes a unit direction vector of the rotation vector and θ denotes an angle of rotation around the unit direction vector. θ can be solved by rotating the trace of the attitude matrix:
Figure BDA0003741315860000081
a unit direction vector of the rotation vector is further defined as r = (r) x ,r y ,r z ) Then, thenr can be calculated by the following formula:
Figure BDA0003741315860000082
next, a unit direction vector and a classification matching template rotated by an angle around the unit direction vector are constructed, and the classification of the template is set to 255 types.
The classification matching template of the unit direction vector is to uniformly sample 255 points in a three-dimensional unit sphere space, the points and the origin of the center of the sphere form 255 unit direction vectors, and the calculation formula is as follows:
Figure BDA0003741315860000083
in the formula (x) n ,y n ,z n ) A coordinate value representing a three-dimensional space, N being the total number of sample points, here 255, N representing the nth sample,
Figure BDA0003741315860000084
representing the golden section ratio, here taking the value
Figure BDA0003741315860000085
The sampling points and the sampling center origin point obtained in the above way can form 255 unit direction vectors which are uniformly distributed in a three-dimensional space.
The classification matching template of the vector rotation angle around the unit direction is a classification matching template array obtained by equally dividing the rotation angle in the range of [0, pi ] according to the classification interval of pi/254.
And finally, matching the unit direction vector r of the rotation vector and the angle theta rotated around the unit direction vector with the unit direction vector and the classification matching template rotated around the unit direction vector one by one respectively. The matching method of the unit direction vector and the unit direction vector classification matching template is determined by calculating cosine similarity of the unit vector and the template vector, the cosine similarity is an index for judging the similarity degree of the two vectors by calculating cosine values of included angles of the two vectors, the closer the calculated cosine values are to 1, the more similar the directions of the two vectors are, and the calculation formula is as follows:
Figure BDA0003741315860000091
and calculating cosine similarity to obtain the vector with the highest similarity in the unit direction vector of the rotation vector and the unit vector of the unit direction vector classification matching template, and taking the index value corresponding to the vector with the highest similarity as the classification number of the unit direction vector, namely converting the direction vector into a classification integer.
Similarly, the difference between the angle of rotation around the unit direction vector and the angle in the classification matching template array is calculated and squared, the angle of rotation around the unit direction vector and the angle in the classification matching template array with the minimum difference are found, and the index corresponding to the angle is used as the classification number of the angle of rotation around the unit direction vector.
Step 1.4: and converting the grabbing parameters into grabbing parameter semantic label images by constructing three-dimensional point cloud of the grabbing parameters.
The grabbing parameters are converted into the grabbing parameter semantic label images, the grabbing problems can be converted into semantic segmentation problems, the training effect and the prediction precision of the model are improved, and the grabbing parameters are stored in an image mode, so that the memory space is saved compared with the grabbing parameters stored in an array mode.
As shown in fig. 3, the capture parameters are converted into the capture parameter semantic label image by constructing a three-dimensional point cloud of the capture parameters and a coordinate transformation relationship from camera coordinates to image coordinates. And taking the three-dimensional coordinates of the grabbing points in the grabbing parameters as a position array of the three-dimensional point cloud, and firstly converting the grabbing parameters of whether the grabbing points are the grabbing points, the unit direction vectors, the rotating angles around the unit direction vectors, the grabbing depth and the opening and closing width into a range of [0,1], and then setting the color attributes of the point cloud so as to realize the corresponding matching of the grabbing points and other grabbing parameters. And (4) the point cloud with the color information is subjected to internal reference of the camera, and other capture parameters can be stored into an image by virtue of conversion of an Open3d library.
Transformation formulas among a camera coordinate system, an image coordinate system and a pixel coordinate system are as follows:
Figure BDA0003741315860000101
in the formula, s x ,s y Representing the transform coefficients of the image with a unit pixel length corresponding to the actual length, c x ,c y ,f x ,f y Representing the internal parameters of the camera. (X, Y, Z) represents coordinates in a camera coordinate system, (X, Y) represents coordinates in an image coordinate system, and (u, v) represents coordinates in a pixel coordinate system. Through the formula, the three-dimensional coordinates of the grabbing point relative to the camera coordinate system can be converted into two-dimensional coordinates in the image coordinate system.
And (3) converting the classification of the grabbing parameters of whether the grabbing points are the grabbing points, the unit direction vector, the rotating angle around the unit direction vector, the grabbing depth and the opening and closing width into a range of [0,1] by the following formula to be used as a color array of the point cloud:
Figure BDA0003741315860000102
wherein color represents the color array of the range of [0,1] of different grabbing parameters, num _ cls represents the total classification number of the grabbing parameters, current _ cls represents the current classification number, and 1 is added to distinguish the classification of the grabbing parameters from the black background. And (4) converting the obtained capture parameter semantic label image as shown in FIG. 4.
And 2, step: carrying out scaling, normalization, classification reduction and division on RGB images in the data set and captured parameter semantic label images obtained by conversion to obtain a training test set;
the step 2 specifically comprises the following steps:
step 2.1: the RGB images in the data set and the captured parameter semantic label images obtained by conversion are reduced to 224 x 224 pixel size from the original 720 x 1280 pixel size in a neighboring point sampling mode, so that the model operation speed is improved, and the model is suitable for a network structure.
Step 2.2: dividing the data of the RGB image by 255 for normalization so as to improve the model precision and the model convergence rate; converting the label image of the grabbing parameter from the classification in the image into an actual classification numerical value, wherein the conversion formula is as follows:
Figure BDA0003741315860000111
cls_r=cls_r_img
cls_θ=cls_θ_img
Figure BDA0003741315860000112
Figure BDA0003741315860000113
where img denotes a classification numerical value of a color in an image, r denotes a unit direction vector of a rotation vector, θ denotes an angle of rotation around the unit direction vector in the rotation vector, d denotes a grip depth at which the gripper approaches an object, and w denotes an opening and closing width of the gripper. Since the actual classification of depth is 4 and the classification of width is 16, the transformation equations for the classification of width and depth are multiplied by coefficients of 4 and 16, respectively, [ ] representing rounding.
Step 2.3: according to the following steps of 8: the ratio of 2 randomly scrambles the data in the first 100 scenes for training in the data set and divides the data into training data and test data.
And step 3: constructing a six-degree-of-freedom capture detection semantic segmentation network, and training the network;
the step 3 specifically comprises the following steps:
step 3.1: and (3) constructing a six-degree-of-freedom capture detection semantic segmentation network.
Semantic segmentation is an important branch of computer vision, and the essence of semantic segmentation is pixel-level classification. The input of the semantic segmentation is an RGB image, and the label is a mask image of a region of interest having the same size as the input image. Different from the classification problem of images, semantic segmentation can not only obtain the classification of different objects in a scene, but also obtain the region and the position of the objects.
A semantic segmentation network selects a U-Net model, and the model has the characteristics of simple network structure, high operation efficiency, suitability for a small sample training set and the like. U-Net adopts an Encoder-Decoder coding and decoding structure, which has similarities with a fully connected network FCN. The Encoder Encoder is mainly used for extracting the features of the image, and in the feature extraction process, the size of the image is gradually reduced, and the key features are continuously extracted; the Decoder is mainly used to restore the resolution of the image, and this process includes splicing with the image obtained from the encoding process, and also includes up-sampling the image with smaller size. The constructed U-Net semantic segmentation network applied to six-degree-of-freedom grabbing detection is shown in FIG. 5.
Step 3.2: setting an optimizer, a loss function, a batch of loaded data, and the number of iterations. The optimizer in the training process is an Adam optimizer, the initial learning rate is set to 0.0005, and the Adam optimizer can automatically adjust according to performances on a test set. Using cross entropy as a loss function:
Figure BDA0003741315860000121
in the formula, N is the number of samples, y is the predicted value, and a is the tag value. The evaluation index use accuracy is set to be 6 in batch of loaded data and 20 in iteration time epoch.
Step 3.3: and loading the training data and the test data after the data processing by using an iterator to train the model.
And 4, step 4: and sending the RGB image data into a trained six-degree-of-freedom capture detection semantic segmentation network to obtain a predicted capture parameter semantic image.
The step 4 specifically comprises the following steps:
step 4.1: the input data set is used for verifying RGB images in scene data, and the RGB images are subjected to data preprocessing operation including picture reduction and normalization.
And 4.2: and sending the RGB image subjected to data preprocessing into a trained six-degree-of-freedom capture detection semantic segmentation network to obtain a semantic image of the predicted capture parameters of whether the 224 multiplied by 224 pixel size is a capture point, a unit direction vector, an angle rotating around the unit direction vector, a capture depth and an opening and closing width.
And 5: and converting the predicted semantic tag image of the grabbing parameters into the grabbing parameters relative to the camera coordinate system through the proposed post-processing operation.
The step 5 specifically comprises the following steps:
step 5.1: and capturing the parameter semantic image with the predicted 224 × 224 pixel size, and restoring the semantic image to the original 720 × 1280 pixel size in a neighboring point sampling mode.
And step 5.2: a series of grab parameters representing the background classification are screened out.
Step 5.3: and restoring the screened grabbing parameters obtained through prediction into corresponding actual numerical values from classification through a dictionary or an array, wherein the process is the reverse process of converting the grabbing parameters into classification numbers.
Step 5.4: and through the input depth image, the depth of the corresponding pixel position in the depth image is obtained by the image information index of the capture point, and then the coordinate of the capture point under the three-dimensional coordinate is calculated through the depth value.
Step 5.5: calculating to obtain a rotation vector through a unit direction vector R of the rotation vector and an angle theta rotating around the unit direction vector, and converting the rotation vector into a rotation attitude matrix R to obtain a grabbing parameter G = [ X, Y, Z, R, d, w ] in a camera coordinate system;
step 6: and (4) driving the robot with the end effector to grab the object in the scene through the grabbing parameters obtained by post-processing.
The step 6 specifically comprises the following steps:
step 6.1: and converting the predicted grabbing parameters relative to the camera coordinate system into the grabbing parameters relative to the world coordinate system through a homogeneous transformation matrix of the depth camera used in the actual environment in the world coordinate system.
Step 6.2: and planning the tail end attitude and the running path of the robot by the predicted grabbing parameters and through inverse kinematics calculation.
Step 6.3: the robot is driven to the predicted gripping pose, and the article is gripped by a gripper disposed at the end of the robot.
Therefore, the six-degree-of-freedom grabbing detection algorithm based on the semantic segmentation network with the structure is adopted, the grabbing parameter array is converted into the grabbing parameter semantic label image, and the grabbing task is converted into the semantic segmentation problem; compared with a depth map or three-dimensional point cloud as input, the method has the advantages that the RGBD image is used as input, the method is more stable, the problem of large-scale six-degree-of-freedom grabbing pose detection under a dense and disordered scene is solved, a network model is more stable and robust, grabbing poses are predicted in a classification rather than regression mode after being transformed, and the prediction result is more accurate.
Finally, it should be noted that: the above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and although the present invention is described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the invention without departing from the spirit and scope of the invention.

Claims (7)

1. A six-degree-of-freedom grabbing detection algorithm based on a semantic segmentation network is characterized by comprising the following steps: the method comprises the following steps:
step 1: acquiring a six-degree-of-freedom grabbing detection data set, and converting a grabbing parameter array into a grabbing parameter semantic label image by a designed data conversion method;
and 2, step: scaling, normalizing, classifying and restoring the RGB images in the data set and the captured parameter semantic label images obtained by conversion and dividing a training test set;
and step 3: constructing a six-degree-of-freedom capture detection semantic segmentation network, and training the network by using the processed data;
and 4, step 4: sending RGB image data into a trained six-degree-of-freedom capture detection semantic segmentation network to obtain a predicted capture parameter semantic image;
and 5: converting the predicted semantic label image of the grabbing parameters through the proposed post-processing operation to obtain the grabbing parameters relative to a camera coordinate system;
step 6: and (4) driving a robot configured with an end effector to grab the object in the scene through the grabbing parameters obtained by post-processing.
2. The algorithm for six-degree-of-freedom grabbing detection based on the semantic segmentation network as claimed in claim 1, wherein: the step 1 specifically comprises:
step 1.1: acquiring a GraspNet-1Billion six-degree-of-freedom grabbing detection data set, reading and converting grabbing labels in the data set;
step 1.2: converting the specific numerical values of the grabbing depth and the opening and closing width in the grabbing label obtained after the treatment into classifications;
step 1.3: converting the rotation attitude matrix in the captured label obtained after processing into a rotation vector, decomposing the rotation attitude matrix into a unit direction vector and an angle rotating around the unit direction vector, and converting to obtain a classification of the unit direction vector and the angle rotating around the unit direction vector through a classification matching template;
step 1.4: and converting the grabbing parameters into grabbing parameter semantic label images by constructing three-dimensional point cloud of the grabbing parameters.
3. The algorithm for six-degree-of-freedom grabbing detection based on the semantic segmentation network as claimed in claim 2, wherein: the step 2 specifically comprises the following steps:
step 2.1: the RGB images in the data set and the captured parameter semantic label images obtained through conversion are reduced to 224 x 224 pixel sizes from the original 720 x 1280 pixel sizes through a neighboring point sampling mode, so that the model operation speed is increased, and the model is adaptive to a network structure;
step 2.2: dividing the data of the RGB image by 255 for normalization; converting the label images of the captured parameters into actual classification numerical values from the classification in the images;
step 2.3: and randomly scrambling the data in the first 100 scenes for training in the data set according to the proportion of 8:2, and dividing the data into training data and test data.
4. The algorithm for six-degree-of-freedom grabbing detection based on the semantic segmentation network as claimed in claim 3, wherein: the step 3 specifically comprises the following steps:
step 3.1: constructing a six-degree-of-freedom capture detection semantic segmentation network;
step 3.2: setting an optimizer, a loss function, a batch for loading data and iteration times;
step 3.3: and loading the training data and the test data after the data processing by using an iterator to train the model.
5. The algorithm for six-degree-of-freedom grabbing detection based on the semantic segmentation network as claimed in claim 4, wherein: the step 4 specifically comprises the following steps:
step 4.1: carrying out data preprocessing operation including picture reduction and normalization on the RGB image;
step 4.2: and sending the RGB image subjected to data preprocessing into a six-degree-of-freedom capture detection semantic segmentation network which completes training to obtain a predicted capture parameter semantic image.
6. The six-degree-of-freedom grab detection algorithm based on the semantic segmentation network as claimed in claim 5, wherein: the step 5 specifically comprises the following steps:
step 5.1: capturing a parameter semantic image with the predicted size of 224 x 224 pixels, and restoring the semantic image to the original size of 720 x 1280 pixels in a mode of sampling adjacent points;
step 5.2: screening out a series of grabbing parameters representing background classification;
step 5.3: through a dictionary or an array, restoring the screened grabbing parameters obtained through prediction into corresponding actual numerical values through classification;
step 5.4: through the input depth image, the depth of the corresponding pixel position in the depth image is obtained through the image information index of the grabbing point, and then the three-dimensional coordinate of the grabbing point in a camera coordinate system is calculated through the depth value;
step 5.5: calculating to obtain a rotation vector through a unit direction vector R of the rotation vector and an angle theta rotating around the unit direction vector, and converting the rotation vector into a rotation attitude matrix R to obtain all capture parameters G = [ X, Y, Z, R, d, w ] in a camera coordinate system;
where (X, Y, Z) represents three-dimensional coordinates of the grasping point in the camera coordinate system, R represents a 3 × 3 rotation posture matrix, and d and w are a grasping depth of the gripper near the object in the approaching direction and an opening width of the gripper.
7. The algorithm for six-degree-of-freedom grab detection based on semantic segmentation network as claimed in claim 6, wherein: the step 6 specifically comprises the following steps:
step 6.1: converting the predicted grabbing parameters relative to the camera coordinate system into grabbing parameters relative to the world coordinate system through a homogeneous transformation matrix of a depth camera used in an actual environment in the world coordinate system;
step 6.2: planning the tail end attitude and the running path of the robot by the predicted grabbing parameters and inverse kinematics calculation;
step 6.3: the robot is driven to the predicted gripping pose, and the article is gripped by a gripper disposed at the end of the robot.
CN202210817429.4A 2022-07-12 2022-07-12 Six-degree-of-freedom grabbing detection method based on semantic segmentation network Active CN115187781B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210817429.4A CN115187781B (en) 2022-07-12 2022-07-12 Six-degree-of-freedom grabbing detection method based on semantic segmentation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210817429.4A CN115187781B (en) 2022-07-12 2022-07-12 Six-degree-of-freedom grabbing detection method based on semantic segmentation network

Publications (2)

Publication Number Publication Date
CN115187781A true CN115187781A (en) 2022-10-14
CN115187781B CN115187781B (en) 2023-05-30

Family

ID=83517855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210817429.4A Active CN115187781B (en) 2022-07-12 2022-07-12 Six-degree-of-freedom grabbing detection method based on semantic segmentation network

Country Status (1)

Country Link
CN (1) CN115187781B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116664843A (en) * 2023-06-05 2023-08-29 北京信息科技大学 Residual fitting grabbing detection network based on RGBD image and semantic segmentation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108665496A (en) * 2018-03-21 2018-10-16 浙江大学 A kind of semanteme end to end based on deep learning is instant to be positioned and builds drawing method
CN111319044A (en) * 2020-03-04 2020-06-23 达闼科技(北京)有限公司 Article grabbing method and device, readable storage medium and grabbing robot
CN111383263A (en) * 2018-12-28 2020-07-07 阿里巴巴集团控股有限公司 System, method and device for grabbing object by robot
US20210382497A1 (en) * 2019-02-26 2021-12-09 Imperial College Of Science, Technology And Medicine Scene representation using image processing
CN114029941A (en) * 2021-09-22 2022-02-11 中国科学院自动化研究所 Robot grabbing method and device, electronic equipment and computer medium
CN114140418A (en) * 2021-11-26 2022-03-04 上海交通大学宁波人工智能研究院 Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108665496A (en) * 2018-03-21 2018-10-16 浙江大学 A kind of semanteme end to end based on deep learning is instant to be positioned and builds drawing method
CN111383263A (en) * 2018-12-28 2020-07-07 阿里巴巴集团控股有限公司 System, method and device for grabbing object by robot
US20210382497A1 (en) * 2019-02-26 2021-12-09 Imperial College Of Science, Technology And Medicine Scene representation using image processing
CN111319044A (en) * 2020-03-04 2020-06-23 达闼科技(北京)有限公司 Article grabbing method and device, readable storage medium and grabbing robot
CN114029941A (en) * 2021-09-22 2022-02-11 中国科学院自动化研究所 Robot grabbing method and device, electronic equipment and computer medium
CN114140418A (en) * 2021-11-26 2022-03-04 上海交通大学宁波人工智能研究院 Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116664843A (en) * 2023-06-05 2023-08-29 北京信息科技大学 Residual fitting grabbing detection network based on RGBD image and semantic segmentation
CN116664843B (en) * 2023-06-05 2024-02-20 北京信息科技大学 Residual fitting grabbing detection network based on RGBD image and semantic segmentation

Also Published As

Publication number Publication date
CN115187781B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
Khosla et al. Enhancing performance of deep learning models with different data augmentation techniques: A survey
Zhang et al. Loop closure detection for visual SLAM systems using convolutional neural network
Yu et al. A vision-based robotic grasping system using deep learning for 3D object recognition and pose estimation
Mobahi et al. Deep learning from temporal coherence in video
CN109815956B (en) License plate character recognition method based on self-adaptive position segmentation
CN111523486B (en) Mechanical arm grabbing detection method based on improved CenterNet
Goh et al. Mars terrain segmentation with less labels
CN115187781B (en) Six-degree-of-freedom grabbing detection method based on semantic segmentation network
CN107798329B (en) CNN-based adaptive particle filter target tracking method
Manzoor et al. Comparison of object recognition approaches using traditional machine vision and modern deep learning techniques for mobile robot
Salem et al. Semantic image inpainting using self-learning encoder-decoder and adversarial loss
CN116664843B (en) Residual fitting grabbing detection network based on RGBD image and semantic segmentation
Singh et al. Wavelet based histogram of oriented gradients feature descriptors for classification of partially occluded objects
CN116311345A (en) Transformer-based pedestrian shielding re-recognition method
CN114723010A (en) Automatic learning enhancement method and system for asynchronous event data
Tong et al. MBVCNN: joint convolutional neural networks method for image recognition
Hossain et al. A faster r-cnn approach for partially occluded robot object recognition
Cheng et al. Skeleton-based Action Recognition with Multi-scale Spatial-temporal Convolutional Neural Network
Ramos et al. A natural feature representation for unstructured environments
Deshapriya et al. Vec2Instance: Parameterization for deep instance segmentation
Wu et al. Real-Time Pixel-Wise Grasp Detection Based on RGB-D Feature Dense Fusion
Gepperth Object detection and feature base learning with sparse convolutional neural networks
Rahman et al. A Quantitative Analysis of Basic vs. Deep Learning-based Image Data Augmentation Techniques
Li et al. Affordance Action Learning with State Trajectory Representation for Robotic Manipulation
da Silva Mendes et al. Vegetation classification using DeepLabv3+ and YOLOv5

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant