CN110717532A

CN110717532A - Real-time detection method for robot target grabbing area based on SE-RetinaGrasp model

Info

Publication number: CN110717532A
Application number: CN201910925919.4A
Authority: CN
Inventors: 卢智亮; 曾碧; 林伟
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2020-01-21

Abstract

The invention provides a real-time detection method for a robot target grabbing area based on an SE-RetinaGrasp model, which comprises the following steps: downloading a training data set through an interface and acquiring an image containing a target object of a robot grabbing target object through a visual sensor to construct the training data set; preprocessing images in a training data set; constructing a grabbing detection model by adopting a RetinaNet model and a SENet module; inputting the preprocessed training data set into a grabbing detection model, and training the grabbing detection model by adopting a transfer learning method and a random gradient descent method; and acquiring a robot target grabbing image to be detected in real time through a visual sensor, and inputting the grabbing detection model to obtain a target grabbing area detection image with a grabbing frame. The method can improve the prediction effect and the detection accuracy of the grabbing area, and effectively enhance the grabbing capacity of the model to the detailed information.

Description

Real-time detection method for robot target grabbing area based on SE-RetinaGrasp model

Technical Field

The invention relates to the technical field of robot grabbing, in particular to a robot target grabbing area real-time detection method based on an SE-RetinaGrasp model.

Background

In the field of intelligent robots, autonomous robot grabbing is a key capability of an intelligent robot. The existing methods applied to robot target grabbing area detection comprise a grabbing area detection method based on a sliding window detection frame, a global grabbing prediction method, a second-order grabbing detection method and the like.

The method for detecting the grabbing area based on the sliding window detection frame adopts a sliding window method, so that the time consumed for searching the grabbing area is long, the calculated amount is large, and the real-time requirement of grabbing and detecting of the robot cannot be met; the global grabbing prediction method is easy to cause an average effect, a predicted grabbing frame tends to the center of an object, and the prediction effect is not ideal for the situation that grabbing parts such as plates are the edges of the object; the second-order grabbing detection method achieves higher accuracy, but at the cost of consuming detection time, the requirement of real-time grabbing by the robot cannot be met, the predicted grabbing frame is large, the detection performance of a small grabbing frame is insufficient, and the accuracy of the grabbing frame needs to be improved.

Disclosure of Invention

The invention provides a robot target grabbing area real-time detection method based on an SE-RetinaGrasp model, aiming at overcoming the defect of low accuracy of the grabbing area prediction effect in the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows:

the robot target grabbing area real-time detection method based on the SE-RetinaGrasp model comprises the following steps:

s1: downloading a training data set through an interface and acquiring an image containing a target object of a robot grabbing target object through a visual sensor to construct the training data set;

s2: preprocessing images in the training data set;

s3: constructing a grabbing detection model by adopting a RetinaNet model and a SENet module;

s4: inputting the preprocessed training data set into the grabbing detection model, and training the grabbing detection model by adopting a transfer learning method and a random gradient descent method;

s5: and acquiring a robot target grabbing image to be detected in real time through a visual sensor, and inputting the grabbing detection model to obtain a target grabbing area detection image with a grabbing frame.

In the technical scheme, the position and the grabbing angle of a grabbing frame are extracted on the basis of a first-order grabbing detection model RetinaNet model, the weight of grabbing key feature channels in an image to be detected is increased by embedding a SEnet module, the interdependence relation between the feature channels is established, the features which play a positive role in grabbing a detection task are promoted, and useless features are inhibited, so that the detection accuracy is improved; in the process of training the grabbing detection model, a training data set is downloaded through an interface, and an image containing a target object of the robot grabbing target object is acquired through a visual sensor to construct the training data set, so that the diversity of training samples is ensured; the training data set is preprocessed, so that the capture detection model can be captured conveniently to detect the capture area rapidly. After the grabbing detection model is trained, the robot can acquire a target grabbing image of the robot to be detected in real time through the visual sensor, and the trained grabbing detection model is input to obtain an RGB image which is used for generating a grabbing frame for grabbing a target object and contains the target object, so that real-time detection of a target grabbing area is realized.

Preferably, in the step S2, the specific step of preprocessing the images in the training data set includes:

s21: randomly translating the images in the training data set in n pixel points on an x axis and a y axis respectively, wherein n is a positive integer and is more than or equal to 50;

s22: carrying out random rotation within the range of 0-360 degrees on the image after the translation processing;

s23: performing center cutting on the image subjected to the rotation processing to obtain images with the same size;

s24: adjusting the image resolution of the cut image;

s25: performing data tagging processing on the image subjected to resolution adjustment: the 180 DEG angle and background classification is divided into a plurality of label categories, and then the angle value in the label is correspondingly allocated to the area with the nearest distance.

Preferably, in step S3, the RetinaNet model includes a rescnet 50 extraction feature network, a feature pyramid FPN structure, and 3 FCN subnetworks, where a SENet module is embedded after each residual block in the feature network is extracted by the rescnet 50; the output ends of SENEt modules of layers 3, 4 and 5 in the extraction feature network of ResNet50 are respectively connected with the input end of the feature pyramid FPN structure; the output ends of the feature pyramid FPN structure are connected to the input ends of 3 FCN subnetworks, respectively.

Preferably, the SEnet module performs the following operations on the feature map output by the residual block:

and (3) extrusion operation: performing global average pooling compression, and converting each characteristic diagram C into a real number array of 1 × 1 × C;

and (3) excitation operation: reducing feature dimension to original by convolution layer

Secondly, increasing nonlinearity through a Relu activation function, then restoring the original dimensionality of the feature graph after dimensionality reduction through a second convolution layer, and obtaining normalized weight through a Sigmoid function, wherein r represents a compression ratio;

characteristic recalibration operation: and weighting the original characteristic channel by channel through multiplication, and recalibrating the original characteristic.

Preferably, the output end of the SEnet module of the 5 th layer in the ResNet50 extracted feature network is sequentially connected with the convolution layer with convolution kernel of 3 × 3 and step size of 2, the Relu activation function layer, the convolution kernel of 3 × 3 and the convolution layer with step size of 2.

Preferably, the feature pyramid FPN structure adopts a balanced feature pyramid structure, wherein the feature pyramid FPN structure adopts a maximum pooling operation and an upsampling operation on the P3 feature map output by the SENet module of layer 3 and the P5 feature map output by the SENet module of layer 5 to adjust the resolution of the P3 and P5 feature maps to be consistent with the resolution of the P4 feature map output by the SENet module of layer 4, and then adds corresponding elements of the P3, P4 and P5 feature maps and averages the corresponding elements to obtain a balanced feature map P' for output; the expression formula is as follows:

where Y is the number of layers added, l_maxDenotes the highest number of layers,/_minDenotes the lowest number of layers, P_lIndicating the ith layer characteristics.

Preferably, in the step S4, the specific steps include:

s41: downloading a Microsoft COCO data set through an interface to train a ResNet50 extracted feature network in a capture detection model, and obtaining a parameter initial value of a ResNet50 extracted feature network;

s42: adjusting the initial parameter value of the grabbing detection model by adopting standard Gaussian distribution;

s43: and taking the images in the preprocessed training data set as input of a capture detection model in an RGB (red, green and blue) form, and training the model by adopting a random gradient descent method, wherein the learning rate is initialized to 0.0001, the learning rate attenuation factor is 5, the number of each batch of training images is set to be 2, and the epoch is initialized to 20.

Preferably, the step of S4 further includes the steps of: training a grabbing detection model by adopting a Focal local function as a classification loss function, wherein the Focal local function calculation formula is as follows:

wherein N is the number of samples, and T is the number of classifications; y is_i,tLabel, p, indicating that the ith sample is predicted as a class t object_i,tRepresenting the probability that the ith sample is predicted as the t-th class target; alpha is alpha_tThe balance parameter is used for adjusting the contribution of the positive and negative samples to the total classification loss; gamma is a hyperparameter.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the capturing detection model is constructed by combining the SEnet module, the feature pyramid FPN structure and the FCN sub-network, the capturing area prediction effect can be improved, the mutual dependency relationship among feature channels is established through the SEnet structure, the image feature of capturing detection can be enhanced, and therefore the detection accuracy is improved; the balance pyramid FPN structure is used for fusing different levels of feature information, the capturing capability of the model on detail information can be enhanced, and the capability of detecting small capturing frames is effectively enhanced.

Drawings

Fig. 1 is a flowchart of a robot target grabbing area real-time detection method based on an SE-RetinaGrasp model according to this embodiment.

Fig. 2 is a schematic structural diagram of the SE-RetinaGrasp model in this embodiment.

FIG. 3 is a schematic structural diagram of the SE-RetinaGrasp model of the present embodiment

Fig. 4 is a schematic structural diagram of the SENet module of the present embodiment.

Fig. 5 is a schematic structural diagram of the feature pyramid FPN structure of the present embodiment.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Fig. 1 is a flowchart of a real-time detection method for a target grabbing area of a robot based on an SE-RetinaGrasp model according to this embodiment.

The embodiment provides a robot target grabbing area real-time detection method based on an SE-RetinaGrasp model, which comprises the following steps:

s1: and downloading a training data set through an interface and acquiring an image containing the target object, captured by the robot, of the target object through a visual sensor to construct the training data set.

In this embodiment, the cornell grab data set is downloaded through the interface, and the image of the robot including the target object grabbed by the robot is acquired through the vision sensor. The Connell grasping data set comprises a plurality of objects, the objects are rich in types, the number of the objects is small, and the types of training samples can be enriched by combining the Connell grasping data set with images acquired by a visual sensor.

S2: preprocessing the images in the training dataset.

In this step, the specific steps of preprocessing the images in the training data set include:

s21: respectively carrying out random translation on the images in the training data set in 50 pixel points on an x axis and a y axis;

s23: performing center clipping on the image subjected to the rotation processing to obtain an image with the size of 321 x 321;

s24: adjusting the image resolution of the cropped image to 227 multiplied by 227 resolution;

s25: performing data tagging processing on the image subjected to resolution adjustment: the 180 ° angle and background classification is divided into 20 label categories, and then the angle values in the labels are correspondingly assigned to the nearest regions.

In the embodiment, the images in the training data set are subjected to random translation and random rotation, so that the images in the training data set cover various possible conditions as much as possible, and the robot can grasp objects in various and arbitrary postures in an accurate and stable manner; the image is cut and the resolution ratio is adjusted, so that the robot can rapidly detect the captured area; in the embodiment, the angle of 180 degrees and the background are classified into 19 regions in consideration of the symmetry of the angle, the background classification is added, and 20 types of classes are provided, wherein the angle value in the label is correspondingly distributed to the nearest region, and the original rectangular frame with the directivity is set as a rectangular frame without angle inclination, so that the rectangular frame perpendicular to the x axis of the image is fitted and the angle class of the rectangular frame is predicted when the capture detection model is trained subsequently.

S3: and constructing a grabbing detection model by adopting a RetinaNet model and a SEnet module to obtain an SE-RetinaGrasp model.

In this embodiment, the RetinaNet model includes a resenet 50 extracted feature network, a feature pyramid FPN structure, and 3 FCN subnetworks, wherein a SENet module is embedded after each residual block in the feature network is extracted by a resenet 50; the output ends of SENEt modules of layers 3, 4 and 5 in the extraction feature network of ResNet50 are respectively connected with the input end of the feature pyramid FPN structure; the output ends of the feature pyramid FPN structure are connected to the input ends of 3 FCN subnetworks, respectively.

Fig. 2 and 3 are schematic structural diagrams of the SE-RetinaGrasp model of the present embodiment. In the embodiment, the ResNet50 extracts feature maps only using the C3, C4 and C5 feature maps, so that the generation of anchors in the high-resolution C2 feature map can be avoided, and the model detection time can be reduced.

The output end of a SEnet module of the 5 th layer in the extracted feature network of the ResNet50 is sequentially connected with a convolution layer with convolution kernel of 3 x 3 and step length of 2, a Relu activation function layer and a convolution kernel of 3 x 3 and step length of 2. The C5 characteristic diagram is output from the output end of the SENET module at the 5 th layer, a P6 network structure is obtained through convolution operation of a convolution layer with convolution kernel of 3 multiplied by 3 and step length of 2, the P6 network structure is subjected to Relu activation function adding branching processing and then the same convolution operation is carried out to obtain a P7 network structure, and candidate areas with larger areas are generated in the P6 and P7 network structures, so that the performance of model detection of large objects can be enhanced.

In this embodiment, the capture detection is to detect a position available for capture in the object, which is different from the position of the object in the image detected in the object detection. Aiming at the characteristic that the image in the kanel captured data set only has a single target object, the RetinaNet model is better applied toIn the problem of grab detection, in the embodiment, grab candidate regions are generated only in three levels of P3, P4 and P5, and {8 } is adopted²,16²,32²Base size candidate window, join

And searching grabbing candidate frames with various sizes according to three different scales and three different length-width ratios of {1:2,1:1 and 2:1 }.

Fig. 4 is a schematic structural diagram of the SENet module of this embodiment. In this embodiment, the SEnet module performs the following operations on the feature map output by the residual block:

Fig. 5 is a schematic structural diagram of the feature pyramid FPN structure of the present embodiment. In this embodiment, the feature pyramid FPN structure adopts a balanced feature pyramid structure, wherein the feature pyramid FPN structure performs maximum pooling operation and upsampling operation on the P3 feature map output by the SENet module of the 3 rd layer and the P5 feature map output by the SENet module of the 5 th layer to adjust the resolution of the P3 and the P5 feature maps to be consistent with the resolution of the P4 feature map output by the SENet module of the 4 th layer, and then adds corresponding elements of the P3, the P4 and the P5 feature maps and averages the added elements to obtain a balanced feature map P' for output; the expression formula is as follows:

Inputting the balanced characteristic diagram P' into a two-dimensional convolutional layer, performing maximum pooling operation and up-sampling operation, adjusting the resolution, adding corresponding elements of the P3, P4 and P5 characteristic diagrams, and averaging to obtain balanced characteristic diagrams P corresponding to the P3, P4 and P5 characteristic diagrams respectively₃′、P₄′、P₅′。

S4: inputting the preprocessed training data set into the grabbing detection model, and training the grabbing detection model by adopting a transfer learning method and a random gradient descent method.

In this step, the specific steps include:

s43: taking the images in the preprocessed training data set as input of a capture detection model in an RGB (red, green and blue) form, and training the model by adopting a random gradient descent method, wherein the learning rate is initialized to 0.0001, the learning rate attenuation factor is 5, the number of each batch of training images is set to be 2, and the epoch is initialized to be 20;

s44: training a grabbing detection model by adopting a Focal local function as a classification loss function, wherein the Focal local function calculation formula is as follows:

In the embodiment, a RetinaNet model is taken as a basis, a SENet module, a feature pyramid FPN structure and an FCN sub-network are combined to construct a capture detection model, wherein mutual dependency among feature channels is established through the SENet structure, so that features which play a positive role in capturing detection images are improved, useless features are inhibited, and the detection accuracy is improved; by utilizing the FPN structure of the balance pyramid, on the premise of not increasing too many parameters, feature information of different levels is further fused, the capturing capability of the model on detail information is enhanced, and the capability of detecting small capturing frames is enhanced. The robot target grabbing area real-time detection method based on the SE-RetinaGrasp model provided by the embodiment realizes high detection accuracy, runs at a real-time detection speed and improves the fineness of a grabbing frame to a certain extent.

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. The robot target grabbing area real-time detection method based on the SE-RetinaGrasp model is characterized by comprising the following steps of:

s2: preprocessing images in the training data set;

2. The real-time detection method of the robot target gripping area according to claim 1, characterized in that: in the step S2, the specific step of preprocessing the images in the training data set includes:

s24: adjusting the image resolution of the cut image;

3. The real-time detection method of the robot target gripping area according to claim 2, characterized in that: in the step S3, the RetinaNet model includes a ResNet50 extraction feature network, a feature pyramid FPN structure, and 3 FCN subnetworks, where a SENet module is embedded after each residual block in the feature network is extracted by the ResNet 50; the output ends of SENEt modules of layers 3, 4 and 5 in the ResNet50 extraction feature network are respectively connected with the input end of the feature pyramid FPN structure; and the output ends of the characteristic pyramid FPN structure are respectively connected with the input ends of the 3 FCN sub-networks.

4. The real-time detection method of the robot target gripping area according to claim 3, characterized in that: the SEnet module performs the following operations on the feature map output by the residual block:

and (3) excitation operation: reducing feature dimension to original by convolution layerThen, increasing nonlinearity through a Relu activation function, then restoring the original dimensionality of the feature graph after dimensionality reduction through a second convolution layer, and obtaining normalized weight through a Sigmoid function, wherein r is a compression ratio;

5. The real-time detection method of the robot target gripping area according to claim 3, characterized in that: and the output end of the SEnet module of the 5 th layer in the ResNet50 extracted feature network is sequentially connected with a convolution layer with convolution kernel of 3 multiplied by 3 and step length of 2, a Relu activation function layer, a convolution kernel of 3 multiplied by 3 and a convolution layer with step length of 2.

6. The real-time detection method of the robot target gripping area according to claim 3, characterized in that: the feature pyramid FPN structure adopts a balanced feature pyramid structure, wherein the feature pyramid FPN structure adjusts the resolution of the P3 and P5 feature maps to be consistent with the resolution of the P4 feature map output by the SEnet module at the 4 th layer by adopting maximum pooling operation and up-sampling operation on the P3 feature map output by the SEnet module at the 3 rd layer and the P5 feature map output by the SEnet module at the 5 th layer, and then corresponding elements of the P3, P4 and P5 feature maps are added and averaged to obtain a balanced feature map P' for output; the expression formula is as follows:

7. The real-time detection method of the robot target gripping area according to claim 3, characterized in that: in the step S4, the specific steps include:

8. The real-time detection method of the robot target gripping area according to claim 7, characterized in that: in the step S4, the method further includes the steps of: training a grabbing detection model by adopting a Focal local function as a classification loss function, wherein the Focal local function calculation formula is as follows: