CN117549307B - Robot vision grabbing method and system in unstructured environment - Google Patents

Robot vision grabbing method and system in unstructured environment Download PDF

Info

Publication number
CN117549307B
CN117549307B CN202311740217.1A CN202311740217A CN117549307B CN 117549307 B CN117549307 B CN 117549307B CN 202311740217 A CN202311740217 A CN 202311740217A CN 117549307 B CN117549307 B CN 117549307B
Authority
CN
China
Prior art keywords
grabbing
channel
convolution
information
pose
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311740217.1A
Other languages
Chinese (zh)
Other versions
CN117549307A (en
Inventor
高赫佳
赵俊杰
胡钜奇
孙长银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202311740217.1A priority Critical patent/CN117549307B/en
Publication of CN117549307A publication Critical patent/CN117549307A/en
Application granted granted Critical
Publication of CN117549307B publication Critical patent/CN117549307B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • B25J9/161Hardware, e.g. neural networks, fuzzy logic, interfaces, processor
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1694Programme controls characterised by use of sensors other than normal servo-feedback from position, speed or acceleration sensors, perception control, multi-sensor controlled systems, sensor fusion
    • B25J9/1697Vision controlled systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Automation & Control Theory (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to a robot vision grabbing method and a system under an unstructured environment in the technical field of intelligent control of robots, wherein the method comprises the steps of acquiring visual information of an external environment and a target object based on a depth camera, preprocessing, generating an image to be processed, inputting the image to be processed into a pre-constructed GARDSCN network model, outputting grabbing confidence information, grabbing angle information and grabbing width information of the target object, and selecting a grabbing angle and grabbing width corresponding to a pixel point with the highest grabbing confidence to form an optimal grabbing pose; and converting the two-dimensional grabbing pose into a three-dimensional target grabbing pose under a robot coordinate system, and controlling the robot to reach the target position based on the planned motion track to execute grabbing. According to the invention, the GARDSCN network model of the detection and reasoning module in the grabbing system focuses on the space and channel information of the characteristics of the target object, so that the interference of background information in a working scene is avoided, and the introduced residual structure can not cause gradient disappearance while deepening the network depth.

Description

Robot vision grabbing method and system in unstructured environment
Technical Field
The invention belongs to the technical field of intelligent control of robots, and particularly relates to a robot vision grabbing method and system in an unstructured environment.
Background
Currently, while many tasks have been accomplished by highly sophisticated robots, these robots are limited to performing a single and fixed task in a particular environment. When the working scene and the working object are changed, a considerable lifting space still exists for the task completion degree and the safety of the robot. The study of robot gripping in a structured environment is relatively simple, since the motion trail of the robot gripper only has to be designed according to the known gripping object model. Although the design concept is suitable for grabbing a single fixed target, the adaptability of the design concept is limited in changeable and complex practical application scenes. When the grabbing object is a strange object, the traditional machine learning method utilizes the support vector machine and other technologies to extract the mapping relation between the graphic features of the object to be grabbed and the grabbing position and the gesture of the clamp holder from the knowledge base. The method can transfer the pre-learned grabbing experience to the strange object without accurately knowing the model of the operation object in advance. However, such methods typically require specific designs, such as surface texture, geometry, etc., resulting in significant effort. In addition, since human knowledge of objects is limited, this makes the method not robust enough. Thus, such methods are typically used for grabbing work of some kind of object, such as grabbing and detecting similarly shaped products in a factory pipeline.
The deep learning method avoids the complex operation of artificial design features in the traditional machine learning, can automatically extract the deep features of the images and map the deep features into the gripping pose of the gripper, and the weight sharing and sparse connection characteristics of the convolutional network enable the convolutional neural network to learn deeper image features, so that the deep learning method has stronger generalization capability in unknown target gripping detection. However, the existing convolutional neural network has large parameters, wastes extremely large computing resources, seriously reduces the speed of grabbing detection, and cannot meet the real-time requirement in the real world.
Disclosure of Invention
The invention aims to provide a robot vision grabbing method and system in an unstructured environment, so as to solve the problems in the background technology.
The invention realizes the above purpose through the following technical scheme:
A robot vision gripping method in an unstructured environment, applied to a robot performing gripping, and a depth camera fixed at an end of a robot jig, the method comprising:
S1, acquiring visual information of an external environment and a target object based on a depth camera, preprocessing the visual information, and generating an image set to be processed;
s2, receiving the image set to be processed, inputting the image set to be processed into a pre-constructed convolutional neural network, outputting grabbing confidence information, grabbing angle information and grabbing width information of a target object, and determining an optimal grabbing pose according to the grabbing confidence information;
and S3, receiving the optimal grabbing pose, converting the optimal grabbing pose into a target pose of the tail end of the clamp under a robot coordinate system, and controlling the robot to move the target pose based on a pre-planned motion track by combining an inverse kinematics resolving method and a track planning algorithm so as to execute grabbing.
As a further optimization scheme of the invention, the visual information comprises an RGB image of the target object and a corresponding depth image; the pretreatment comprises the following steps: and performing completion, denoising, clipping and normalization on the RGB image and the depth image, wherein the size of the generated image set to be processed is 224 multiplied by 224.
As a further optimization scheme of the present invention, step S2 includes:
S201, extracting features of the image set to be processed through three convolution layers, and adding regularization and ReLu functions after each convolution layer to form a feature map;
S202, after three convolution layers are experienced, a compression and excitation module is introduced to enable a convolution neural network to pay attention to important characteristic channels in the characteristic map, channel contribution in the characteristic map is weighted and adjusted, spatial characteristic compression is firstly carried out on an input characteristic map, and global average pooling is achieved; then, carrying out channel feature learning on the compressed feature map, and obtaining a feature map with channel attention through FC full-connection layer operation learning; finally, multiplying the channel attention feature map and the original input feature map by weight coefficients channel by channel, and finally outputting the feature map with the channel attention;
s203, extracting features through the depth separable convolution layer in a refined mode so as to reduce the number of parameters;
s204, introducing a two-layer visual attention residual error multidimensional convolution module; firstly capturing the characteristic information of a plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channels by utilizing 1X 1 convolution under the condition of not changing the size of a characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, and distributing weights to the characteristic vectors through 1*1 convolution to finish the characteristic information reinforcement of the object to be captured on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;
S205: the original size of the image is restored by three deconvolution layers.
In step S2, the convolutional neural network output includes a grabbing confidence coefficient, a grabbing angle and a grabbing width, wherein the grabbing confidence coefficient is a scalar of [0,1], the grabbing angle is obtained through , and the grabbing width is in units of pixels; the optimal gripping pose is expressed by the following formula:
Zi=(Gii,Wi,Qi)
Wherein G i = (p, q) represents coordinates of a two-dimensional grasp point in the image coordinate system; Θ i is the rotation angle of the end jaw around the depth camera coordinate system, which is represented between ; w i is the pixel width of the jaw open in the image coordinate system, W i is the pixel width in the range of [0, W max ], W max is the maximum width of the end jaw open; q is the grasping mass fraction.
As a further optimization scheme of the present invention, step S3 includes:
S301, receiving an optimal grabbing pose, converting a two-dimensional feasible grabbing representation into a target pose of the tail end of the clamp under a robot base coordinate system, wherein the grabbing pose under the robot coordinate system is defined as:
Zt=(x,y,z,φt,Wt,Qt)
Wherein, (x, y, z) is the center position of the end jaw in a Cartesian coordinate system, phi t is the rotation of the jaw around the z-axis, W t is the actual width of the jaw open, and Q t is the grabbing mass fraction;
s302, converting an image coordinate system into a grabbing pose of a robot coordinate system by applying a depth camera internal parameter T n and a rotation matrix T x through the following transformation:
Zt=Tx(Tn(Zi))
S303, planning a safe collision-free motion track by using a motion planner through using an integrated ROS interface according to the converted three-dimensional optimal grabbing pose, and controlling the robot to reach a target position by a motion controller to execute grabbing and placing tasks, and synchronously monitoring the motion state of the robot.
A robotic vision gripping system in an unstructured environment for implementing a gripping method as described above, the system comprising:
the visual perception module is used for acquiring visual information of an external environment and a target object based on the depth camera, preprocessing the visual information and generating an image set to be processed;
The detection and reasoning module is used for receiving the image set to be processed, inputting the image set to be processed into a pre-constructed convolutional neural network, outputting grabbing confidence information, grabbing angle information and grabbing width information of a target object, and determining an optimal grabbing pose according to the grabbing confidence information;
The grabbing planning module is used for receiving the optimal grabbing pose and converting the optimal grabbing pose into a target pose of the tail end of the clamp under the robot coordinate system, and controlling the robot to move the target pose based on a pre-planned motion track by combining an inverse kinematics resolving method and a track planning algorithm so as to execute grabbing.
As a further optimization scheme of the invention, the convolutional neural network comprises a convolutional layer, a compression-excitation module, a depth separable convolutional layer, a visual attention residual multidimensional convolutional module and a deconvolution layer;
The convolution layers are used for extracting features of the image set to be processed through three convolution layers, and regularization and ReLu functions are added after each convolution layer to form a feature map;
The compression-excitation module is used for leading in the compression and excitation module after undergoing three convolution layers to enable the convolution neural network to pay attention to important characteristic channels in the characteristic map, weighting and adjusting channel contribution in the characteristic map, and firstly carrying out space characteristic compression on the input characteristic map to realize global average pooling; then, carrying out channel feature learning on the compressed feature map, and obtaining a feature map with channel attention through FC full-connection layer operation learning; finally, multiplying the channel attention feature map and the original input feature map by weight coefficients channel by channel, and finally outputting the feature map with the channel attention;
the depth separable convolution layer is used for extracting the features of the feature map in a refined mode so as to reduce the number of parameters;
The visual attention residual multidimensional convolution module is used for capturing the characteristic information of the plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channel by utilizing 1X 1 convolution under the condition of not changing the size of a characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, distributing weights to the characteristic vectors through 1*1 convolution, and finishing the characteristic information reinforcement of the captured object on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;
The deconvolution layer is used to restore the original size of the image.
The invention has the beneficial effects that:
(1) The invention breaks through the limit of grabbing a single fixed target in a single scene, and the designed grabbing method can be generalized to grabbing any type of object, can be applied to changeable and complex actual application scenes, and has wide applicability.
(2) The invention solves the problem that real-time performance is difficult to ensure in a real scene, and the detection and reasoning module in the designed grabbing system has high speed for generating grabbing pose, thereby greatly meeting the requirements of grabbing success rate and ensuring real-time performance.
(3) The GARDSCN network model of the detection and reasoning module in the grabbing system focuses on the space and channel information of the characteristics of the target object, avoids the interference of background information in a working scene, and can effectively grab the transparent object. The introduced residual structure can deepen the depth of the network and can not cause the problem of gradient disappearance.
(4) The grabbing planning module in the grabbing system integrates the ROS interface, and can be carried on any robot to execute a real grabbing task. The motion planner plans the motion trail, and the motion controller controls the robot to complete the whole grabbing task. The system can execute the grabbing task in a single object scene and a multi-object complex scene.
Drawings
FIG. 1 is a schematic diagram of an embodiment of a visual grabbing method in the present invention;
FIG. 2 is a schematic diagram of a visual intelligent grabbing system according to the present invention;
FIG. 3 is a schematic diagram of an adaptive residual depth separable convolutional neural network architecture in accordance with the present invention
FIG. 4 is a schematic diagram of a visual attention residual multidimensional convolution module in the present invention;
FIG. 5 is a schematic diagram of a robotic grasping procedure in accordance with the present invention;
FIG. 6 is a diagram of a robot gripping scene in an embodiment of the invention;
FIG. 7 is a schematic diagram of a robot in a single object scene according to an embodiment of the present invention;
fig. 8 is a schematic diagram of grabbing multiple objects in a complex scene of a robot in an embodiment of the invention.
Detailed Description
The present application will be described in further detail with reference to the accompanying drawings, wherein it is to be understood that the following detailed description is for the purpose of further illustrating the application only and is not to be construed as limiting the scope of the application, as various insubstantial modifications and adaptations of the application to those skilled in the art can be made in light of the foregoing disclosure.
Example 1
As shown in fig. 1, the present embodiment provides a robot vision gripping method in an unstructured environment, which is applied to a robot performing gripping and a depth camera fixed at the end of a robot fixture, and the method includes:
S1, acquiring visual information of an external environment and a target object based on a depth camera, preprocessing the visual information, and generating an image set to be processed;
s2, receiving an image set to be processed, inputting the image set to be processed into a pre-constructed convolutional neural network, outputting grabbing confidence information, grabbing angle information and grabbing width information of a target object, and determining an optimal grabbing pose according to the grabbing confidence information;
And S3, receiving the optimal grabbing pose, converting the optimal grabbing pose into a target pose of the tail end of the clamp under a robot coordinate system, and controlling the robot to move the target pose based on a pre-planned motion track by combining an inverse kinematics resolving method and a track planning algorithm so as to execute grabbing.
As a further implementation, the visual information includes an RGB image of the target object and a corresponding depth image; the pretreatment comprises the following steps: and performing completion, denoising, clipping and normalization on the RGB image and the depth image, wherein the size of the generated image set to be processed is 224 multiplied by 224.
As a further implementation, step S2 includes:
s201, extracting features of an image set to be processed through three layers of convolution layers, and adding regularization and ReLu functions after each layer of convolution layer to form a feature map;
S202, after three convolution layers are experienced, a compression and excitation module is introduced to enable the convolution neural network to pay attention to important characteristic channels in the characteristic map, channel contribution in the characteristic map is weighted and adjusted, spatial characteristic compression is firstly carried out on the input characteristic map, and global average pooling is achieved; then, carrying out channel feature learning on the compressed feature map, and obtaining a feature map with channel attention through FC full-connection layer operation learning; finally, multiplying the channel attention feature map and the original input feature map by weight coefficients channel by channel, and finally outputting the feature map with the channel attention;
s203, extracting features through the depth separable convolution layer in a refined mode so as to reduce the number of parameters;
S204, introducing a two-layer visual attention residual error multidimensional convolution module; firstly capturing the characteristic information of a plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channels by utilizing 1X 1 convolution under the condition of not changing the size of a characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, and distributing weights to the characteristic vectors through 1*1 convolution to finish the characteristic information reinforcement of the captured object on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;
S205: the original size of the image is restored by three deconvolution layers.
As a further implementation, in step S2, the convolutional neural network output includes a grabbing confidence, a grabbing angle and a grabbing width, the grabbing confidence is a scalar of [0,1], the grabbing angle is obtained through , and the grabbing width is in units of pixels; the optimal gripping pose is expressed by the following formula:
Zi=(Gii,Wi,Qi)
Wherein G i = (p, q) represents coordinates of a two-dimensional grasp point in the image coordinate system; Θ i is the rotation angle of the end jaw around the depth camera coordinate system, which is represented between ; w i is the pixel width of the jaw open in the image coordinate system, W i is the pixel width in the range of [0, W max ], W max is the maximum width of the end jaw open; q is the grasping mass fraction.
As a further implementation, step S3 includes:
S301, receiving an optimal grabbing pose, converting a two-dimensional feasible grabbing representation into a target pose of the tail end of the clamp under a robot base coordinate system, wherein the grabbing pose under the robot coordinate system is defined as:
Zt=(x,y,z,φt,Wt,Qt)
Wherein, (x, y, z) is the center position of the end jaw in a Cartesian coordinate system, phi t is the rotation of the jaw around the z-axis, W t is the actual width of the jaw open, and Q t is the grabbing mass fraction;
s302, converting an image coordinate system into a grabbing pose of a robot coordinate system by applying a depth camera internal parameter T n and a rotation matrix T x through the following transformation:
Zt=Tx(Tn(Zi))
S303, planning a safe collision-free motion track by using a motion planner through using an integrated ROS interface according to the converted three-dimensional optimal grabbing pose, and controlling the robot to reach a target position by a motion controller to execute grabbing and placing tasks, and synchronously monitoring the motion state of the robot.
Example 2
Based on the same inventive concept, a visual grabbing system corresponding to the visual grabbing method is also provided in this embodiment, and since the principle of solving the problem by the system in the embodiment of the present disclosure is similar to that of the visual grabbing method in the embodiment of the present disclosure, the implementation of the system may refer to the implementation of the method, and the repetition is omitted.
As shown in fig. 2, the present embodiment proposes a robot vision gripping system in an unstructured environment, for implementing the above vision gripping method, where the system includes:
the visual perception module is used for acquiring visual information of an external environment and a target object based on the depth camera, preprocessing the visual information and generating an image set to be processed;
The detection and reasoning module is used for receiving the image set to be processed, inputting the image set to be processed into a pre-constructed convolutional neural network, outputting grabbing confidence information, grabbing angle information and grabbing width information of the target object, and determining the optimal grabbing pose according to the grabbing confidence information;
The grabbing planning module is used for receiving the optimal grabbing pose and converting the optimal grabbing pose into a target pose of the tail end of the clamp under the robot coordinate system, and controlling the robot to move the target pose based on a pre-planned motion track by combining an inverse kinematics resolving method and a track planning algorithm so as to execute grabbing.
The vision sensing module is composed of a machine vision unit REALSENSE D i camera which is fixed at the tail end of the finger of the robot and moves along with the robot so as to update the camera vision information in real time. The visual information includes an RGB image of the object to be grabbed and a depth image corresponding thereto. After operations such as image completion, denoising, clipping, normalization and the like, the preprocessed RGB image and depth image information are output, and the size of the image is 224 multiplied by 224 at the moment, so that the network input format in a subsequent detection and reasoning module is met.
The detection and reasoning module acquires the RGB image from the visual perception module and the depth image corresponding to the RGB image, and detects to obtain the best feasible grabbing representation or outputs a signal without feasible grabbing to declare that the current object cannot be grabbed. This module contains a completely new generated adaptive residual depth separable convolutional neural network (GARDSCN) of design in which we design a completely new module embedded: the visual attention residual multidimensional convolution module (VARMC), whose network architecture is referenced in fig. 3.
As a further implementation, the convolutional neural network includes a convolutional layer, a compression-excitation module, a depth separable convolutional layer, a visual attention residual multidimensional convolutional module, and a deconvolution layer;
the convolution layers are used for extracting features of the image set to be processed through three convolution layers, and regularization and ReLu functions are added after each convolution layer to form a feature map;
The compression-excitation module is used for leading in the compression and excitation module after undergoing three convolution layers to enable the convolution neural network to pay attention to important characteristic channels in the characteristic map, weighting and adjusting channel contribution in the characteristic map, and firstly carrying out space characteristic compression on the input characteristic map to realize global average pooling; then, carrying out channel feature learning on the compressed feature map, and obtaining a feature map with channel attention through FC full-connection layer operation learning; finally, multiplying the channel attention feature map and the original input feature map by weight coefficients channel by channel, and finally outputting the feature map with the channel attention;
The depth separable convolution layer is used for extracting the features of the feature map in a refined mode so as to reduce the number of parameters;
The visual attention residual multidimensional convolution module is used for capturing the characteristic information of the plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channel by utilizing 1X 1 convolution under the condition of not changing the size of the characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, distributing weights to the characteristic vectors through 1*1 convolution, and finishing the characteristic information reinforcement of the captured object on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;
The deconvolution layer is used to recover the original size of the image.
Fig. 3 shows the generative model architecture we propose. Our network is suitable for input data for the theta channel (RGB, RGBD, D). The image of the theta channel is used as the input of the network and is subjected to characteristic extraction through three convolution layers. Regularization and ReLu activation functions are added to each convolution layer to increase the expression capacity and generalization capacity of the model, and then the output size is 56×56×128. However, the way image features are extracted by the convolution layer does not explicitly model the relationship between feature channels, resulting in some channels contributing less to the capture task. Thus, after three convolutions, a compression-excitation module (SEN) is introduced to focus the network on important feature channels, which adaptively learns the importance of each channel, weighting the channel contributions in the feature map. The SEN module carries out average pooling on the image features extracted by the convolution layers, then the dimension reduction and dimension increase of the feature images are realized by using two full-connection layers, and after the feature dimension of the first convolution layer is reduced to the original C/r, nonlinearity is increased by using Relu activation functions; the second convolution layer is used to recover the original feature dimension. The resulting set of weights are then weighted by multiplication, channel by channel, onto the original characteristic channel, outputting a 56 x 128 characteristic. To reduce the amount of computation and the number of network parameters, depth separable convolutions are added after SEN. The depth separable convolution consists of a 3 x 3 depth convolution and a 1 x 1 convolution. A 3 x 3 depth convolution is applied on each channel of the input so that spatial features can be extracted on each channel. The introduced 1*1 convolution performs cross-channel feature interaction, so that information of different channels is fused.
However, as the number of network layers increases, the gradient vanishes inevitably, so the application develops a new Visual Attention Residual Multidimensional Convolution (VARMC) module, which increases the characteristics in the channel dimension, reduces the calculation amount of the whole network, accelerates the calculation speed, enhances the saliency of the grabbed object in the scene, and effectively reduces the interference of background information. VARMC modules are shown in figure 4.
The above method is further described in connection with actual processing routines.
In order to verify the effectiveness of the proposed method, the proposed method is verified on a real robot, and the grabbing experiments are respectively carried out on a single object scene and a multi-object complex scene. The robot is a robot with three-finger clamping jaws at one end, wherein symmetrical two fingers can continue to synchronously act (simultaneously shrink inwards or expand outwards), the other third finger can be independently controlled, and in order to avoid displacement caused by contact or collision of the robot finger with a target object in a complex scene, all the clamping experiments only use the symmetrical two fingers for clamping.
The whole robot grabbing scene diagram is shown in fig. 6, target objects are randomly placed on a workbench, the workbench is fixed on the ground, a controller of the mechanical arm is integrated at the bottom of the mechanical arm, and the mechanical arm automatically detects and grabs the target objects in a posture perpendicular to the workbench.
Firstly, a calibration task between a depth camera and a finger at the tail end of a robot and a camera calibration task are carried out, so that the position relation between the depth camera and the finger is determined, and internal and external parameters and distortion parameters of the camera are solved.
(1) Single object scene capture
We choose living goods frequently encountered in the real world to perform single object experiments, and before each grab, randomly place the object at any position of the workbench and place it in any posture. As shown in fig. 7. The robot system obtains target information by using REALSENSE D435i depth cameras, the detection and reasoning module generates an optimal set of grabbing pose configuration, and the grabbing planning module generates grabbing motion tracks and controls the robot to execute tasks.
The 1 st row is a robot grabbing scene, the 2 nd row is an optimal grabbing pose generated by a detection and reasoning module of the grabbing system, the 3 rd row is a grabbing quality diagram corresponding to a pixel grabbing point, the 4 th row is a grabbing angle diagram corresponding to the pixel grabbing point, and the 5 th row is a grabbing width corresponding to the pixel grabbing point. As shown in fig. 7 (c), our system can accurately locate an object even if the object is grabbed with the same color and background, and output a grabbing frame with high confidence. As shown in fig. 7 (d), the system proposed by us also has excellent performance in gripping transparent objects.
(2) Multi-object complex scene grabbing
As shown in fig. 8, we created a series of multi-object scenes to perform a real grabbing experiment, and each multi-object complex scene randomly contains living common articles with different amounts and different shapes, materials, colors and attitudes. Fig. 8 (e) is an optimal gripping pose deduced by our robot system in a complex scene of multiple objects for the current gripping of the objects. Fig. 8 (f) is a diagram showing that the vision perception module is acquiring the vision information of the object to be grabbed, fig. 8 (g) is a diagram showing that the robot is approaching the object to be grabbed, fig. 8 (h) is a diagram showing that the robot converts the 2-dimensional grabbing pose into the three-dimensional grabbing pose in the real world and grabs the object successfully, fig. 8 (i) is a diagram showing that the robot is slowly lifted and calculates the grabbing moving path, fig. 8 (j) is a diagram showing that the robot system executes the grabbing path transmitted by the grabbing planning module, and fig. 8 (k) is a diagram showing that the robot system moves the object above the target position. In a series of established complex multi-object scenes, multiple experiments are carried out, and extremely high grabbing success rate is achieved.
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In addition, each functional module in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be appreciated by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (3)

1. A robot vision gripping method in an unstructured environment, applied to a robot performing gripping, and a depth camera fixed at the end of a robot clamp, the method comprising:
S1, acquiring visual information of an external environment and a target object based on a depth camera, preprocessing the visual information, and generating an image set to be processed;
s2, receiving the image set to be processed, inputting the image set to be processed into a pre-constructed convolutional neural network, outputting grabbing confidence information, grabbing angle information and grabbing width information of a target object, and determining an optimal grabbing pose according to the grabbing confidence information;
S3, receiving the optimal grabbing pose, converting the optimal grabbing pose into a target pose of the tail end of the clamp under a robot coordinate system, and controlling the robot to move the target pose based on a pre-planned motion track by combining an inverse kinematics resolving method and a track planning algorithm so as to execute grabbing;
The visual information comprises an RGB image and a corresponding depth image of the target object; the pretreatment comprises the following steps: performing complement, denoising, clipping and normalization on the RGB image and the depth image;
The step S2 comprises the following steps:
S201, extracting features of the image set to be processed through three convolution layers, and adding regularization and ReLu functions after each convolution layer to form a feature map;
S202, after three convolution layers are experienced, a compression and excitation module is introduced to enable a convolution neural network to pay attention to important characteristic channels in the characteristic map, channel contribution in the characteristic map is weighted and adjusted, spatial characteristic compression is firstly carried out on an input characteristic map, and global average pooling is achieved; then, carrying out channel feature learning on the compressed feature map, and obtaining a feature map with channel attention through FC full-connection layer operation learning; finally, multiplying the channel attention feature map and the original input feature map by weight coefficients channel by channel, and finally outputting the feature map with the channel attention;
s203, extracting features through the depth separable convolution layer in a refined mode so as to reduce the number of parameters;
s204, introducing a two-layer visual attention residual error multidimensional convolution module; firstly capturing the characteristic information of a plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channels by utilizing 1X 1 convolution under the condition of not changing the size of a characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, and distributing weights to the characteristic vectors through 1*1 convolution to finish the characteristic information reinforcement of the object to be captured on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;
S205: restoring the original size of the image through three deconvolution layers;
in step S2, the convolutional neural network output includes a grabbing confidence coefficient, a grabbing angle and a grabbing width, the grabbing confidence coefficient is a scalar of [0,1], the grabbing angle is obtained through , and the grabbing width is in units of pixels; the optimal gripping pose is expressed by the following formula:
Wherein,/> denotes the coordinates of the two-dimensional grabbing point under the image coordinate system; the/> is the rotation angle of the end jaw around the depth camera coordinate system, which is expressed between/> ; the pixel width of the clamping jaw opening in the image coordinate system is denoted by/> ,/> being the pixel width in the range of/> ,/> being the maximum width of the end clamping jaw opening; q is the grabbing mass fraction;
the step S3 comprises the following steps:
S301, receiving an optimal grabbing pose, converting a two-dimensional feasible grabbing representation into a target pose of the tail end of the clamp under a robot base coordinate system, wherein the grabbing pose under the robot coordinate system is defined as:
Wherein,/> is the center position of the end jaw in the Cartesian coordinate system,/> is the rotation of the jaw around the z-axis,/> is the actual width of the jaw open,/> is the grasping mass fraction;
S302, converting an image coordinate system into a grabbing pose of a robot coordinate system by applying a depth camera internal parameter and a rotation matrix/> through the following transformation:
S303, planning a safe collision-free motion track by using a motion planner through using an integrated ROS interface according to the converted three-dimensional optimal grabbing pose, and controlling the robot to reach a target position by a motion controller to execute grabbing and placing tasks, and synchronously monitoring the motion state of the robot.
2. A robotic vision gripping system in an unstructured environment for implementing the robotic vision gripping method of claim 1, the system comprising:
the visual perception module is used for acquiring visual information of an external environment and a target object based on the depth camera, preprocessing the visual information and generating an image set to be processed;
The detection and reasoning module is used for receiving the image set to be processed, inputting the image set to be processed into a pre-constructed convolutional neural network, outputting grabbing confidence information, grabbing angle information and grabbing width information of a target object, and determining an optimal grabbing pose according to the grabbing confidence information;
The grabbing planning module is used for receiving the optimal grabbing pose and converting the optimal grabbing pose into a target pose of the tail end of the clamp under the robot coordinate system, and controlling the robot to move the target pose based on a pre-planned motion track by combining an inverse kinematics resolving method and a track planning algorithm so as to execute grabbing.
3. The robotic vision gripping system in an unstructured environment of claim 2, wherein: the convolutional neural network comprises a convolutional layer, a compression-excitation module, a depth separable convolutional layer, a visual attention residual multidimensional convolutional module and a deconvolution layer;
The convolution layers are used for extracting features of the image set to be processed through three convolution layers, and regularization and ReLu functions are added after each convolution layer to form a feature map;
The compression-excitation module is used for leading in the compression and excitation module after undergoing three convolution layers to enable the convolution neural network to pay attention to important characteristic channels in the characteristic map, weighting and adjusting channel contribution in the characteristic map, and firstly carrying out space characteristic compression on the input characteristic map to realize global average pooling; then, carrying out channel feature learning on the compressed feature map, and obtaining a feature map with channel attention through FC full-connection layer operation learning; finally, multiplying the channel attention feature map and the original input feature map by weight coefficients channel by channel, and finally outputting the feature map with the channel attention;
the depth separable convolution layer is used for extracting the features of the feature map in a refined mode so as to reduce the number of parameters;
The visual attention residual multidimensional convolution module is used for capturing the characteristic information of the plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channel by utilizing 1X 1 convolution under the condition of not changing the size of a characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, distributing weights to the characteristic vectors through 1*1 convolution, and finishing the characteristic information reinforcement of the grabbed object on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;
The deconvolution layer is used to restore the original size of the image.
CN202311740217.1A 2023-12-15 2023-12-15 Robot vision grabbing method and system in unstructured environment Active CN117549307B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311740217.1A CN117549307B (en) 2023-12-15 2023-12-15 Robot vision grabbing method and system in unstructured environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311740217.1A CN117549307B (en) 2023-12-15 2023-12-15 Robot vision grabbing method and system in unstructured environment

Publications (2)

Publication Number Publication Date
CN117549307A CN117549307A (en) 2024-02-13
CN117549307B true CN117549307B (en) 2024-04-16

Family

ID=89814685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311740217.1A Active CN117549307B (en) 2023-12-15 2023-12-15 Robot vision grabbing method and system in unstructured environment

Country Status (1)

Country Link
CN (1) CN117549307B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257293A (en) * 2020-11-16 2021-01-22 江苏科技大学 Non-standard object grabbing method and device based on ROS
CN114782347A (en) * 2022-04-13 2022-07-22 杭州电子科技大学 Mechanical arm grabbing parameter estimation method based on attention mechanism generation type network
CN114851201A (en) * 2022-05-18 2022-08-05 浙江工业大学 Mechanical arm six-degree-of-freedom vision closed-loop grabbing method based on TSDF three-dimensional reconstruction
CN114912287A (en) * 2022-05-26 2022-08-16 四川大学 Robot autonomous grabbing simulation system and method based on target 6D pose estimation
CN116673962A (en) * 2023-07-12 2023-09-01 安徽大学 Intelligent mechanical arm grabbing method and system based on FasterR-CNN and GRCNN

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110000785B (en) * 2019-04-11 2021-12-14 上海交通大学 Agricultural scene calibration-free robot motion vision cooperative servo control method and equipment
CN112613478B (en) * 2021-01-04 2022-08-09 大连理工大学 Data active selection method for robot grabbing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257293A (en) * 2020-11-16 2021-01-22 江苏科技大学 Non-standard object grabbing method and device based on ROS
CN114782347A (en) * 2022-04-13 2022-07-22 杭州电子科技大学 Mechanical arm grabbing parameter estimation method based on attention mechanism generation type network
CN114851201A (en) * 2022-05-18 2022-08-05 浙江工业大学 Mechanical arm six-degree-of-freedom vision closed-loop grabbing method based on TSDF three-dimensional reconstruction
CN114912287A (en) * 2022-05-26 2022-08-16 四川大学 Robot autonomous grabbing simulation system and method based on target 6D pose estimation
CN116673962A (en) * 2023-07-12 2023-09-01 安徽大学 Intelligent mechanical arm grabbing method and system based on FasterR-CNN and GRCNN

Also Published As

Publication number Publication date
CN117549307A (en) 2024-02-13

Similar Documents

Publication Publication Date Title
JP6921151B2 (en) Deep machine learning methods and equipment for robot grip
CN110450153B (en) Mechanical arm object active picking method based on deep reinforcement learning
CN110000785B (en) Agricultural scene calibration-free robot motion vision cooperative servo control method and equipment
CN112605983B (en) Mechanical arm pushing and grabbing system suitable for intensive environment
CN110298886B (en) Dexterous hand grabbing planning method based on four-stage convolutional neural network
CN113284179B (en) Robot multi-object sorting method based on deep learning
CN111360862B (en) Method for generating optimal grabbing pose based on convolutional neural network
CN110238840B (en) Mechanical arm autonomous grabbing method based on vision
CN110605711B (en) Method, device and system for controlling cooperative robot to grab object
CN114912287A (en) Robot autonomous grabbing simulation system and method based on target 6D pose estimation
KR102228525B1 (en) Grasping robot, grasping method and learning method for grasp based on neural network
CN113762159B (en) Target grabbing detection method and system based on directional arrow model
CN114851201A (en) Mechanical arm six-degree-of-freedom vision closed-loop grabbing method based on TSDF three-dimensional reconstruction
CN116664843B (en) Residual fitting grabbing detection network based on RGBD image and semantic segmentation
CN117549307B (en) Robot vision grabbing method and system in unstructured environment
Van Molle et al. Learning to grasp from a single demonstration
CN117340929A (en) Flexible clamping jaw grabbing and disposing device and method based on three-dimensional point cloud data
CN115861780B (en) Robot arm detection grabbing method based on YOLO-GGCNN
CN117021099A (en) Human-computer interaction method oriented to any object and based on deep learning and image processing
CN115631401A (en) Robot autonomous grabbing skill learning system and method based on visual perception
CN115256377A (en) Robot grabbing method and device based on multi-source information fusion
Munoz et al. Image-driven drawing system by a NAO robot
An et al. An Autonomous Grasping Control System Based on Visual Object Recognition and Tactile Perception
CN113392703B (en) Mechanical arm autonomous grabbing method based on attention mechanism and unreasonable action inhibition
Wu et al. Intelligent Object Sorting Truck System Based on Machine Vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant