CN117549307B

CN117549307B - Robot vision grabbing method and system in unstructured environment

Info

Publication number: CN117549307B
Application number: CN202311740217.1A
Authority: CN
Inventors: 高赫佳; 赵俊杰; 胡钜奇; 孙长银
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-12-15
Filing date: 2023-12-15
Publication date: 2024-04-16
Anticipated expiration: 2043-12-15
Also published as: CN117549307A

Abstract

The invention belongs to a robot vision grabbing method and a system under an unstructured environment in the technical field of intelligent control of robots, wherein the method comprises the steps of acquiring visual information of an external environment and a target object based on a depth camera, preprocessing, generating an image to be processed, inputting the image to be processed into a pre-constructed GARDSCN network model, outputting grabbing confidence information, grabbing angle information and grabbing width information of the target object, and selecting a grabbing angle and grabbing width corresponding to a pixel point with the highest grabbing confidence to form an optimal grabbing pose; and converting the two-dimensional grabbing pose into a three-dimensional target grabbing pose under a robot coordinate system, and controlling the robot to reach the target position based on the planned motion track to execute grabbing. According to the invention, the GARDSCN network model of the detection and reasoning module in the grabbing system focuses on the space and channel information of the characteristics of the target object, so that the interference of background information in a working scene is avoided, and the introduced residual structure can not cause gradient disappearance while deepening the network depth.

Description

Robot vision grabbing method and system in unstructured environment

Technical Field

The invention belongs to the technical field of intelligent control of robots, and particularly relates to a robot vision grabbing method and system in an unstructured environment.

Background

Currently, while many tasks have been accomplished by highly sophisticated robots, these robots are limited to performing a single and fixed task in a particular environment. When the working scene and the working object are changed, a considerable lifting space still exists for the task completion degree and the safety of the robot. The study of robot gripping in a structured environment is relatively simple, since the motion trail of the robot gripper only has to be designed according to the known gripping object model. Although the design concept is suitable for grabbing a single fixed target, the adaptability of the design concept is limited in changeable and complex practical application scenes. When the grabbing object is a strange object, the traditional machine learning method utilizes the support vector machine and other technologies to extract the mapping relation between the graphic features of the object to be grabbed and the grabbing position and the gesture of the clamp holder from the knowledge base. The method can transfer the pre-learned grabbing experience to the strange object without accurately knowing the model of the operation object in advance. However, such methods typically require specific designs, such as surface texture, geometry, etc., resulting in significant effort. In addition, since human knowledge of objects is limited, this makes the method not robust enough. Thus, such methods are typically used for grabbing work of some kind of object, such as grabbing and detecting similarly shaped products in a factory pipeline.

The deep learning method avoids the complex operation of artificial design features in the traditional machine learning, can automatically extract the deep features of the images and map the deep features into the gripping pose of the gripper, and the weight sharing and sparse connection characteristics of the convolutional network enable the convolutional neural network to learn deeper image features, so that the deep learning method has stronger generalization capability in unknown target gripping detection. However, the existing convolutional neural network has large parameters, wastes extremely large computing resources, seriously reduces the speed of grabbing detection, and cannot meet the real-time requirement in the real world.

Disclosure of Invention

The invention aims to provide a robot vision grabbing method and system in an unstructured environment, so as to solve the problems in the background technology.

The invention realizes the above purpose through the following technical scheme:

A robot vision gripping method in an unstructured environment, applied to a robot performing gripping, and a depth camera fixed at an end of a robot jig, the method comprising:

S1, acquiring visual information of an external environment and a target object based on a depth camera, preprocessing the visual information, and generating an image set to be processed;

s2, receiving the image set to be processed, inputting the image set to be processed into a pre-constructed convolutional neural network, outputting grabbing confidence information, grabbing angle information and grabbing width information of a target object, and determining an optimal grabbing pose according to the grabbing confidence information;

and S3, receiving the optimal grabbing pose, converting the optimal grabbing pose into a target pose of the tail end of the clamp under a robot coordinate system, and controlling the robot to move the target pose based on a pre-planned motion track by combining an inverse kinematics resolving method and a track planning algorithm so as to execute grabbing.

As a further optimization scheme of the invention, the visual information comprises an RGB image of the target object and a corresponding depth image; the pretreatment comprises the following steps: and performing completion, denoising, clipping and normalization on the RGB image and the depth image, wherein the size of the generated image set to be processed is 224 multiplied by 224.

As a further optimization scheme of the present invention, step S2 includes:

S201, extracting features of the image set to be processed through three convolution layers, and adding regularization and ReLu functions after each convolution layer to form a feature map;

S202, after three convolution layers are experienced, a compression and excitation module is introduced to enable a convolution neural network to pay attention to important characteristic channels in the characteristic map, channel contribution in the characteristic map is weighted and adjusted, spatial characteristic compression is firstly carried out on an input characteristic map, and global average pooling is achieved; then, carrying out channel feature learning on the compressed feature map, and obtaining a feature map with channel attention through FC full-connection layer operation learning; finally, multiplying the channel attention feature map and the original input feature map by weight coefficients channel by channel, and finally outputting the feature map with the channel attention;

s203, extracting features through the depth separable convolution layer in a refined mode so as to reduce the number of parameters;

s204, introducing a two-layer visual attention residual error multidimensional convolution module; firstly capturing the characteristic information of a plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channels by utilizing 1X 1 convolution under the condition of not changing the size of a characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, and distributing weights to the characteristic vectors through 1*1 convolution to finish the characteristic information reinforcement of the object to be captured on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;

S205: the original size of the image is restored by three deconvolution layers.

In step S2, the convolutional neural network output includes a grabbing confidence coefficient, a grabbing angle and a grabbing width, wherein the grabbing confidence coefficient is a scalar of [0,1], the grabbing angle is obtained through , and the grabbing width is in units of pixels; the optimal gripping pose is expressed by the following formula:

Z_i＝(G_i,θ_i,W_i,Q_i)

Wherein G _i = (p, q) represents coordinates of a two-dimensional grasp point in the image coordinate system; Θ _i is the rotation angle of the end jaw around the depth camera coordinate system, which is represented between ; w _i is the pixel width of the jaw open in the image coordinate system, W _i is the pixel width in the range of [0, W _max ], W _max is the maximum width of the end jaw open; q is the grasping mass fraction.

As a further optimization scheme of the present invention, step S3 includes:

S301, receiving an optimal grabbing pose, converting a two-dimensional feasible grabbing representation into a target pose of the tail end of the clamp under a robot base coordinate system, wherein the grabbing pose under the robot coordinate system is defined as:

Z_t＝(x,y,z,φ_t,W_t,Q_t)

Wherein, (x, y, z) is the center position of the end jaw in a Cartesian coordinate system, phi _t is the rotation of the jaw around the z-axis, W _t is the actual width of the jaw open, and Q _t is the grabbing mass fraction;

s302, converting an image coordinate system into a grabbing pose of a robot coordinate system by applying a depth camera internal parameter T _n and a rotation matrix T _x through the following transformation:

Z_t＝T_x(T_n(Z_i))

S303, planning a safe collision-free motion track by using a motion planner through using an integrated ROS interface according to the converted three-dimensional optimal grabbing pose, and controlling the robot to reach a target position by a motion controller to execute grabbing and placing tasks, and synchronously monitoring the motion state of the robot.

A robotic vision gripping system in an unstructured environment for implementing a gripping method as described above, the system comprising:

the visual perception module is used for acquiring visual information of an external environment and a target object based on the depth camera, preprocessing the visual information and generating an image set to be processed;

The detection and reasoning module is used for receiving the image set to be processed, inputting the image set to be processed into a pre-constructed convolutional neural network, outputting grabbing confidence information, grabbing angle information and grabbing width information of a target object, and determining an optimal grabbing pose according to the grabbing confidence information;

The grabbing planning module is used for receiving the optimal grabbing pose and converting the optimal grabbing pose into a target pose of the tail end of the clamp under the robot coordinate system, and controlling the robot to move the target pose based on a pre-planned motion track by combining an inverse kinematics resolving method and a track planning algorithm so as to execute grabbing.

As a further optimization scheme of the invention, the convolutional neural network comprises a convolutional layer, a compression-excitation module, a depth separable convolutional layer, a visual attention residual multidimensional convolutional module and a deconvolution layer;

The convolution layers are used for extracting features of the image set to be processed through three convolution layers, and regularization and ReLu functions are added after each convolution layer to form a feature map;

The compression-excitation module is used for leading in the compression and excitation module after undergoing three convolution layers to enable the convolution neural network to pay attention to important characteristic channels in the characteristic map, weighting and adjusting channel contribution in the characteristic map, and firstly carrying out space characteristic compression on the input characteristic map to realize global average pooling; then, carrying out channel feature learning on the compressed feature map, and obtaining a feature map with channel attention through FC full-connection layer operation learning; finally, multiplying the channel attention feature map and the original input feature map by weight coefficients channel by channel, and finally outputting the feature map with the channel attention;

the depth separable convolution layer is used for extracting the features of the feature map in a refined mode so as to reduce the number of parameters;

The visual attention residual multidimensional convolution module is used for capturing the characteristic information of the plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channel by utilizing 1X 1 convolution under the condition of not changing the size of a characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, distributing weights to the characteristic vectors through 1*1 convolution, and finishing the characteristic information reinforcement of the captured object on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;

The deconvolution layer is used to restore the original size of the image.

The invention has the beneficial effects that:

(1) The invention breaks through the limit of grabbing a single fixed target in a single scene, and the designed grabbing method can be generalized to grabbing any type of object, can be applied to changeable and complex actual application scenes, and has wide applicability.

(2) The invention solves the problem that real-time performance is difficult to ensure in a real scene, and the detection and reasoning module in the designed grabbing system has high speed for generating grabbing pose, thereby greatly meeting the requirements of grabbing success rate and ensuring real-time performance.

(3) The GARDSCN network model of the detection and reasoning module in the grabbing system focuses on the space and channel information of the characteristics of the target object, avoids the interference of background information in a working scene, and can effectively grab the transparent object. The introduced residual structure can deepen the depth of the network and can not cause the problem of gradient disappearance.

(4) The grabbing planning module in the grabbing system integrates the ROS interface, and can be carried on any robot to execute a real grabbing task. The motion planner plans the motion trail, and the motion controller controls the robot to complete the whole grabbing task. The system can execute the grabbing task in a single object scene and a multi-object complex scene.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a visual grabbing method in the present invention;

FIG. 2 is a schematic diagram of a visual intelligent grabbing system according to the present invention;

FIG. 3 is a schematic diagram of an adaptive residual depth separable convolutional neural network architecture in accordance with the present invention

FIG. 4 is a schematic diagram of a visual attention residual multidimensional convolution module in the present invention;

FIG. 5 is a schematic diagram of a robotic grasping procedure in accordance with the present invention;

FIG. 6 is a diagram of a robot gripping scene in an embodiment of the invention;

FIG. 7 is a schematic diagram of a robot in a single object scene according to an embodiment of the present invention;

fig. 8 is a schematic diagram of grabbing multiple objects in a complex scene of a robot in an embodiment of the invention.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings, wherein it is to be understood that the following detailed description is for the purpose of further illustrating the application only and is not to be construed as limiting the scope of the application, as various insubstantial modifications and adaptations of the application to those skilled in the art can be made in light of the foregoing disclosure.

Example 1

As shown in fig. 1, the present embodiment provides a robot vision gripping method in an unstructured environment, which is applied to a robot performing gripping and a depth camera fixed at the end of a robot fixture, and the method includes:

s2, receiving an image set to be processed, inputting the image set to be processed into a pre-constructed convolutional neural network, outputting grabbing confidence information, grabbing angle information and grabbing width information of a target object, and determining an optimal grabbing pose according to the grabbing confidence information;

As a further implementation, the visual information includes an RGB image of the target object and a corresponding depth image; the pretreatment comprises the following steps: and performing completion, denoising, clipping and normalization on the RGB image and the depth image, wherein the size of the generated image set to be processed is 224 multiplied by 224.

As a further implementation, step S2 includes:

s201, extracting features of an image set to be processed through three layers of convolution layers, and adding regularization and ReLu functions after each layer of convolution layer to form a feature map;

S202, after three convolution layers are experienced, a compression and excitation module is introduced to enable the convolution neural network to pay attention to important characteristic channels in the characteristic map, channel contribution in the characteristic map is weighted and adjusted, spatial characteristic compression is firstly carried out on the input characteristic map, and global average pooling is achieved; then, carrying out channel feature learning on the compressed feature map, and obtaining a feature map with channel attention through FC full-connection layer operation learning; finally, multiplying the channel attention feature map and the original input feature map by weight coefficients channel by channel, and finally outputting the feature map with the channel attention;

S204, introducing a two-layer visual attention residual error multidimensional convolution module; firstly capturing the characteristic information of a plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channels by utilizing 1X 1 convolution under the condition of not changing the size of a characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, and distributing weights to the characteristic vectors through 1*1 convolution to finish the characteristic information reinforcement of the captured object on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;

S205: the original size of the image is restored by three deconvolution layers.

As a further implementation, in step S2, the convolutional neural network output includes a grabbing confidence, a grabbing angle and a grabbing width, the grabbing confidence is a scalar of [0,1], the grabbing angle is obtained through , and the grabbing width is in units of pixels; the optimal gripping pose is expressed by the following formula:

Z_i＝(G_i,θ_i,W_i,Q_i)

As a further implementation, step S3 includes:

Z_t＝(x,y,z,φ_t,W_t,Q_t)

Z_t＝T_x(T_n(Z_i))

Example 2

Based on the same inventive concept, a visual grabbing system corresponding to the visual grabbing method is also provided in this embodiment, and since the principle of solving the problem by the system in the embodiment of the present disclosure is similar to that of the visual grabbing method in the embodiment of the present disclosure, the implementation of the system may refer to the implementation of the method, and the repetition is omitted.

As shown in fig. 2, the present embodiment proposes a robot vision gripping system in an unstructured environment, for implementing the above vision gripping method, where the system includes:

The detection and reasoning module is used for receiving the image set to be processed, inputting the image set to be processed into a pre-constructed convolutional neural network, outputting grabbing confidence information, grabbing angle information and grabbing width information of the target object, and determining the optimal grabbing pose according to the grabbing confidence information;

The vision sensing module is composed of a machine vision unit REALSENSE D i camera which is fixed at the tail end of the finger of the robot and moves along with the robot so as to update the camera vision information in real time. The visual information includes an RGB image of the object to be grabbed and a depth image corresponding thereto. After operations such as image completion, denoising, clipping, normalization and the like, the preprocessed RGB image and depth image information are output, and the size of the image is 224 multiplied by 224 at the moment, so that the network input format in a subsequent detection and reasoning module is met.

The detection and reasoning module acquires the RGB image from the visual perception module and the depth image corresponding to the RGB image, and detects to obtain the best feasible grabbing representation or outputs a signal without feasible grabbing to declare that the current object cannot be grabbed. This module contains a completely new generated adaptive residual depth separable convolutional neural network (GARDSCN) of design in which we design a completely new module embedded: the visual attention residual multidimensional convolution module (VARMC), whose network architecture is referenced in fig. 3.

As a further implementation, the convolutional neural network includes a convolutional layer, a compression-excitation module, a depth separable convolutional layer, a visual attention residual multidimensional convolutional module, and a deconvolution layer;

The visual attention residual multidimensional convolution module is used for capturing the characteristic information of the plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channel by utilizing 1X 1 convolution under the condition of not changing the size of the characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, distributing weights to the characteristic vectors through 1*1 convolution, and finishing the characteristic information reinforcement of the captured object on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;

The deconvolution layer is used to recover the original size of the image.

Fig. 3 shows the generative model architecture we propose. Our network is suitable for input data for the theta channel (RGB, RGBD, D). The image of the theta channel is used as the input of the network and is subjected to characteristic extraction through three convolution layers. Regularization and ReLu activation functions are added to each convolution layer to increase the expression capacity and generalization capacity of the model, and then the output size is 56×56×128. However, the way image features are extracted by the convolution layer does not explicitly model the relationship between feature channels, resulting in some channels contributing less to the capture task. Thus, after three convolutions, a compression-excitation module (SEN) is introduced to focus the network on important feature channels, which adaptively learns the importance of each channel, weighting the channel contributions in the feature map. The SEN module carries out average pooling on the image features extracted by the convolution layers, then the dimension reduction and dimension increase of the feature images are realized by using two full-connection layers, and after the feature dimension of the first convolution layer is reduced to the original C/r, nonlinearity is increased by using Relu activation functions; the second convolution layer is used to recover the original feature dimension. The resulting set of weights are then weighted by multiplication, channel by channel, onto the original characteristic channel, outputting a 56 x 128 characteristic. To reduce the amount of computation and the number of network parameters, depth separable convolutions are added after SEN. The depth separable convolution consists of a 3 x 3 depth convolution and a 1 x 1 convolution. A 3 x 3 depth convolution is applied on each channel of the input so that spatial features can be extracted on each channel. The introduced 1*1 convolution performs cross-channel feature interaction, so that information of different channels is fused.

However, as the number of network layers increases, the gradient vanishes inevitably, so the application develops a new Visual Attention Residual Multidimensional Convolution (VARMC) module, which increases the characteristics in the channel dimension, reduces the calculation amount of the whole network, accelerates the calculation speed, enhances the saliency of the grabbed object in the scene, and effectively reduces the interference of background information. VARMC modules are shown in figure 4.

The above method is further described in connection with actual processing routines.

In order to verify the effectiveness of the proposed method, the proposed method is verified on a real robot, and the grabbing experiments are respectively carried out on a single object scene and a multi-object complex scene. The robot is a robot with three-finger clamping jaws at one end, wherein symmetrical two fingers can continue to synchronously act (simultaneously shrink inwards or expand outwards), the other third finger can be independently controlled, and in order to avoid displacement caused by contact or collision of the robot finger with a target object in a complex scene, all the clamping experiments only use the symmetrical two fingers for clamping.

The whole robot grabbing scene diagram is shown in fig. 6, target objects are randomly placed on a workbench, the workbench is fixed on the ground, a controller of the mechanical arm is integrated at the bottom of the mechanical arm, and the mechanical arm automatically detects and grabs the target objects in a posture perpendicular to the workbench.

Firstly, a calibration task between a depth camera and a finger at the tail end of a robot and a camera calibration task are carried out, so that the position relation between the depth camera and the finger is determined, and internal and external parameters and distortion parameters of the camera are solved.

(1) Single object scene capture

We choose living goods frequently encountered in the real world to perform single object experiments, and before each grab, randomly place the object at any position of the workbench and place it in any posture. As shown in fig. 7. The robot system obtains target information by using REALSENSE D435i depth cameras, the detection and reasoning module generates an optimal set of grabbing pose configuration, and the grabbing planning module generates grabbing motion tracks and controls the robot to execute tasks.

The 1 st row is a robot grabbing scene, the 2 nd row is an optimal grabbing pose generated by a detection and reasoning module of the grabbing system, the 3 rd row is a grabbing quality diagram corresponding to a pixel grabbing point, the 4 th row is a grabbing angle diagram corresponding to the pixel grabbing point, and the 5 th row is a grabbing width corresponding to the pixel grabbing point. As shown in fig. 7 (c), our system can accurately locate an object even if the object is grabbed with the same color and background, and output a grabbing frame with high confidence. As shown in fig. 7 (d), the system proposed by us also has excellent performance in gripping transparent objects.

(2) Multi-object complex scene grabbing

As shown in fig. 8, we created a series of multi-object scenes to perform a real grabbing experiment, and each multi-object complex scene randomly contains living common articles with different amounts and different shapes, materials, colors and attitudes. Fig. 8 (e) is an optimal gripping pose deduced by our robot system in a complex scene of multiple objects for the current gripping of the objects. Fig. 8 (f) is a diagram showing that the vision perception module is acquiring the vision information of the object to be grabbed, fig. 8 (g) is a diagram showing that the robot is approaching the object to be grabbed, fig. 8 (h) is a diagram showing that the robot converts the 2-dimensional grabbing pose into the three-dimensional grabbing pose in the real world and grabs the object successfully, fig. 8 (i) is a diagram showing that the robot is slowly lifted and calculates the grabbing moving path, fig. 8 (j) is a diagram showing that the robot system executes the grabbing path transmitted by the grabbing planning module, and fig. 8 (k) is a diagram showing that the robot system moves the object above the target position. In a series of established complex multi-object scenes, multiple experiments are carried out, and extremely high grabbing success rate is achieved.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In addition, each functional module in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be appreciated by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A robot vision gripping method in an unstructured environment, applied to a robot performing gripping, and a depth camera fixed at the end of a robot clamp, the method comprising:

S3, receiving the optimal grabbing pose, converting the optimal grabbing pose into a target pose of the tail end of the clamp under a robot coordinate system, and controlling the robot to move the target pose based on a pre-planned motion track by combining an inverse kinematics resolving method and a track planning algorithm so as to execute grabbing;

The visual information comprises an RGB image and a corresponding depth image of the target object; the pretreatment comprises the following steps: performing complement, denoising, clipping and normalization on the RGB image and the depth image;

The step S2 comprises the following steps:

S205: restoring the original size of the image through three deconvolution layers;

in step S2, the convolutional neural network output includes a grabbing confidence coefficient, a grabbing angle and a grabbing width, the grabbing confidence coefficient is a scalar of [0,1], the grabbing angle is obtained through , and the grabbing width is in units of pixels; the optimal gripping pose is expressed by the following formula:

Wherein,/> denotes the coordinates of the two-dimensional grabbing point under the image coordinate system; the/> is the rotation angle of the end jaw around the depth camera coordinate system, which is expressed between/> ; the pixel width of the clamping jaw opening in the image coordinate system is denoted by/> ,/> being the pixel width in the range of/> ,/> being the maximum width of the end clamping jaw opening; q is the grabbing mass fraction;

the step S3 comprises the following steps:

Wherein,/> is the center position of the end jaw in the Cartesian coordinate system,/> is the rotation of the jaw around the z-axis,/> is the actual width of the jaw open,/> is the grasping mass fraction;

S302, converting an image coordinate system into a grabbing pose of a robot coordinate system by applying a depth camera internal parameter and a rotation matrix/> through the following transformation:

2. A robotic vision gripping system in an unstructured environment for implementing the robotic vision gripping method of claim 1, the system comprising:

3. The robotic vision gripping system in an unstructured environment of claim 2, wherein: the convolutional neural network comprises a convolutional layer, a compression-excitation module, a depth separable convolutional layer, a visual attention residual multidimensional convolutional module and a deconvolution layer;

The visual attention residual multidimensional convolution module is used for capturing the characteristic information of the plane space on each input channel through DEPTHWISE convolution, then modifying the channel number of the input channel by utilizing 1X 1 convolution under the condition of not changing the size of a characteristic diagram, integrating the channel information, then respectively carrying out global maximum pooling and global average pooling on the extracted characteristics in the channel dimension, creating two 1-dimensional characteristic vectors, distributing weights to the characteristic vectors through 1*1 convolution, and finishing the characteristic information reinforcement of the grabbed object on the channel domain; secondly, carrying out maximum pooling and average pooling compression on the extracted features of the channel domain in the space dimension to generate a 2-dimensional feature map, then utilizing 7*7 convolution to distribute feature weights, and finally introducing a residual structure to solve the gradient disappearance caused by deepening of the network depth, wherein an input signal can be directly transmitted to a subsequent layer;

The deconvolution layer is used to restore the original size of the image.