CN113370217B

CN113370217B - Object gesture recognition and grabbing intelligent robot method based on deep learning

Info

Publication number: CN113370217B
Application number: CN202110732696.7A
Authority: CN
Inventors: 杜广龙
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2023-06-16
Anticipated expiration: 2041-06-29
Also published as: CN113370217A

Abstract

The invention discloses a method for an intelligent robot for object gesture recognition and grabbing based on deep learning. The method comprises the following steps: building a virtual environment and a mechanical arm working platform model; carrying out randomization treatment on an object on a virtual model of a mechanical arm working platform based on the constructed virtual environment, and obtaining a camera shooting image to obtain a data set; constructing an object posture detector; based on the constructed object posture detector, constructing a neural network, and training the neural network by using the acquired data set; the object posture detector after training is migrated to the real platform, a large amount of data is obtained by simulating and generating random objects in a virtual environment and is used as a training data set, a special training method is adopted to obtain the object posture detector with stronger generalization capability, and then the object posture detector is migrated to the real platform, so that the posture identification and grabbing of basic objects are realized.

Description

Object gesture recognition and grabbing intelligent robot method based on deep learning

Technical Field

The invention relates to the field of intelligent robot grabbing, in particular to a method for realizing object gesture recognition and grabbing by an intelligent robot based on deep learning.

Background

Today in industry 4.0, diverse robots are beginning to walk into the factory for auxiliary production, which replace humans to perform dangerous or repetitive work tasks. It is apparent that intelligent robots do not feel tired, but only operate in compliance with trained neural networks or rules, and that excellent intelligent robots are favored by industry and are used in mass production.

However, with the popularization of industrial robots, problems regarding the training time period and the grasping efficiency of intelligent robots are also raised. Although researchers have ensured as rapid and easy training as possible in training a neural network of a robot, how to shorten the training time of the neural network and improve the grasping efficiency of the robot is still a concern of researchers.

At present, the training mode of the mainstream intelligent robot is still based on actual scenes, namely, the training scenes are randomized in real life, and the scenes are acquired to train the neural network. However, this method has a major drawback in that randomization of training scenes in real life consumes a lot of time, and the time consumed for generating each unit of training data is relatively long compared to computer perception. For the training of intelligent robots, the time spent in the training process is relatively negligible, but the time spent for generating the training dataset is relatively large, and for the use of intelligent robots it is not acceptable to spend a larger proportion of the time for producing the training dataset than for a specific training intelligent robot.

Of course, the training mode of the intelligent robot at present also has a scheme of using simulation to train the neural network of the intelligent robot, for example, using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping, IEEE International Conference on Robotics and Automation and 2018. An attractive alternative approach is presented in this document to use an off-the-shelf simulator to present the synthetic virtual dataset and automatically generate the underlying truth notes for it. The article believes that models trained solely from simulation data often cannot be generalized to the real world. This document studies how to extend the stochastic simulation environment and domain adaptation method to training a capture system to capture new targets from the original monocular RGB image. The article shows that by using virtual synthetic data and domain adaptation, the number of real world samples required to achieve a given performance level can be significantly reduced, mainly using randomly generated virtual data sets. However, this technique cannot be trained entirely without using the real-world data set, there is still a need for real-world data sets, and the problem of robot neural network overfitting is not optimized.

Disclosure of Invention

Therefore, aiming at the defects of the prior art, the invention discloses a method for identifying and grabbing an intelligent robot based on the object gesture of deep learning, wherein in a virtual environment, a random algorithm is used for enabling training scenes to be different in more scenes and target object factors, so that more possible training data are generated and used for covering different working scenes faced in the industrial production process as much as possible. And because of the virtual environment built by means of the computer, there is an advantage in terms of the speed and amount of data to be generated over conventional ways of generating data sets. The intelligent robot is trained in the mode, the training speed is faster than that of a traditional training mode, and when the model is migrated to the real robot for use, the model has better generalization capability and provides stronger using effect in shorter time due to wider coverage of a data set and optimization aiming at the over-fitting problem. The training of the intelligent robot neural network is completely based on the virtual data set, and does not need to rely on the real world data set, so that the training efficiency of the neural network is improved, and the neural network is forced to pay attention to the pose characteristics of the object instead of the relation between the pose and the background of the training data set, so that the problem of over-fitting is reduced.

The object of the invention is achieved by at least one of the following technical solutions.

The method for the intelligent robot for identifying and grabbing the object gesture based on the deep learning comprises the following steps:

s1: building a virtual environment and a mechanical arm working platform model;

s2: based on the virtual environment constructed in the step S1, carrying out randomization treatment on an object on the virtual model of the mechanical arm working platform, and obtaining a camera shooting image to obtain a data set;

s3: constructing an object posture detector;

s4: constructing a neural network based on the object posture detector constructed in the step S3, and training the neural network by using the data set obtained in the step S2;

s5: and (3) migrating the object posture detector trained in the step S4 to a real platform.

Further, step S1 includes the steps of:

s1.1, acquiring the size and the shape of a mechanical arm working platform in a real environment, and constructing the mechanical arm and the mechanical arm working platform in a virtual environment in a one-to-one manner; simultaneously constructing a plurality of object models;

s1.2, splicing the object model obtained in the step S1.1 in a virtual environment, and simulating a real mechanical arm working platform and an actual basic environment.

Further, in step S2, the randomization process includes:

randomizing the appearance and drop positions of a plurality of different object models;

randomizing the color and the material of the object model;

randomizing ambient light.

Further, in step S2, after the randomization processing, an RGB image of the camera lens angle in the virtual environment is obtained as a dataset, and a specific position of the object model in the image in the dataset is obtained for subsequent verification.

Further, in step S3, the object pose detector is constructed by using EPnP algorithm and ranac algorithm, where PnP (peer-n-point) is a known point pair of n spatial 3D points corresponding to the 2D points of the image, and there are various solutions for calculating the pose of the camera, or a class of problems of the pose of the object, such as: direct Linear Transformation (DLT), P3P, EPnP, UPnP and nonlinear optimization methods; ranac (Random sample consensus, random sampling coincidence algorithm) is to estimate parameters of a mathematical model from a set of observation data sets containing 'outlier' through an iterative mode, and is widely used in computer vision, and the algorithm can effectively improve the accuracy of EPnP object attitude estimation; the method comprises the following steps:

s3.1, adopting an EPnP algorithm, and randomly selecting n reference points in a space between a workbench and a camera in a virtual environment. Acquiring 3D coordinates of the reference points in a world coordinate system, and recording as

i=1, …, n, while acquiring the 2D coordinates of these reference points on the projection plane taken by the camera, noted +.>

i＝1,…,n；

S3.2, respectively selecting 4 control points in a world coordinate system and a camera projection plane by adopting a Principal Component Analysis (PCA) method through the selected n reference points, and respectively marking as:

j=1, …,4 and +.>

j=1, …,4. The method meets the following conditions:

wherein a is _ij Is a homogeneous barycentric coordinate; the condition indicates that the selected 4 control points can be used for representing 3D reference points under any world coordinate system through weighting; in the projection plane, the reference point and the control point have the same weighting relation;

s3.3, coordinates of 4 control points under a world coordinate system and a camera coordinate system are obtained in the step S3.1 and the step S3.2, and a rotation matrix R and a translation matrix t are obtained by utilizing a 3D-3D algorithm and are called a camera external parameter matrix;

s3.3, using a Ranac algorithm, using the camera external parameter matrix obtained in the step S3.3 as an initial hypothesis model in the Ranac algorithm to test the selected reference points in all other data sets, namely comparing the estimated 2D screen coordinates obtained by transforming the 3D space coordinates of the reference points through the camera external parameter matrix with the actual 2D screen coordinates of the reference points obtained in the step S3.1 to obtain an estimated-actual coordinate distance difference, and recording as D _mn Wherein m is the sequence number of the reference point selected from the single data in the data set, and n is the sequence number of the data in the data set; setting a threshold d according to the actual precision requirement ₀ If d _mn <＝d ₀ The reference point is considered as an intra-office point, otherwise, the reference point is considered as an extra-office point;

s3.4, in the first iteration, randomly selecting one data in the data set to start iteration, and setting the obtained camera external parameter matrix as an optimal external parameter matrix;

s3.5, repeating the Ranac algorithm for a plurality of Ranac iterations. Before performing the iteration, a threshold k is set for determining whether the number of local points obtained in one iteration meets the accuracy requirement. At the same time the threshold k cannot be set too high in order to prevent overfitting. In each Ranac iteration, if the ratio of the number of local points to the total number of reference points is greater than a threshold k and the number of local points is greater than the number of local points of the previous optimal external parameter matrix, setting the camera external parameter matrix of the iteration as the optimal external parameter matrix; continuously carrying out Ranac iteration until the iteration is finished, and obtaining an optimal external parameter matrix of the camera under the data set; the iteration times can be set according to actual conditions, and in general, the more the iteration times are, the higher the accuracy is, but the time cost is higher, and a reasonable value is required to be determined according to the actual conditions;

and S3.6, obtaining an optimal external parameter matrix to obtain the pose of the camera, and further obtaining the pose of the object model under a camera coordinate system.

Further, step S4 includes the steps of:

s4.1, adopting Python as a programming language, simultaneously considering flexibility and program size, and adopting an open-source PyTorch deep learning framework to construct a neural network;

s4.2, the invention aims to be applicable to a plurality of scenes, has stronger generalization capability, and the data set generated in the step S2 is single, so that the forced neural network is more concerned with estimating the pose characteristics of the object rather than the relation between the pose and the background in order to effectively prevent overfitting, and the data set generated in the step S2 is subjected to augmentation processing to obtain an augmented data set;

s4.3, training the neural network constructed in the step S4.1 by using the augmentation data set obtained in the step S4.2, wherein 20% of data is used for training, and the part of data becomes a training data set; 80% of the data is used for evaluation, this part of the data being referred to as the evaluation dataset;

s4.4, setting a standard to evaluate the final effect. Obtaining the estimated pose of the object model in the estimated data set through training and estimation in the step S4.3; in step S2, the actual specific coordinate position of the object model has been obtained, so that there is a one-to-one correspondence between the two sets of data. Constructing bounding boxes of the object model by using the two groups of data of the actual specific coordinate position of the object model and the estimated pose of the object model in the estimated dataset, which are obtained in the step S2, through a K-DOP algorithm, wherein the bounding boxes are called an actual bounding box and an estimated bounding box; and obtaining the corresponding overlapping relation between the estimated bounding box and the actual bounding box by adopting a bounding box collision algorithm, and judging whether the accuracy standard is reached.

Further, the step S4.2 specifically includes the following steps:

s4.2.1, obtaining the pose of the object model by using the real data provided by the data set obtained in the step S2, and then cutting the object model;

s4.2.2, synthesizing the cut object model with other pictures to achieve the purpose of replacing the background picture;

s4.2.3 image processing is performed on the combined image, including changing saturation, changing brightness, and adding noise, resulting in an augmented data set.

Further, in step S5, the trained object posture detector is applied to a mechanical arm in a laboratory, and after the building and training are performed according to steps S1 to S4, a grabbing point is required to be calculated according to the posture of the object, so as to realize the recognition and grabbing of the mechanical arm or the intelligent robot on the object in the real platform.

Further, the step of calculating the grabbing point is as follows:

s5.1, calculating a bounding box by adopting a K-DOP algorithm according to the pose of the object model.

S5.2, selecting a grabbing point on the bounding box according to the actual type of the mechanical claw of the mechanical arm.

Compared with the prior art, the invention has the advantages that:

the training of the intelligent robot neural network is completely based on the virtual data set, and does not need to rely on the real world data set, so that the training efficiency of the neural network is improved, and the neural network is forced to pay attention to the pose characteristics of the object instead of the relation between the pose and the background of the training data set, so that the problem of over-fitting is reduced. The method is applicable to multiple scenes and has strong generalization.

Drawings

FIG. 1 is a flow chart of object gesture recognition and grabbing based on virtual environment training according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, a detailed description of the specific implementation of the present invention will be given below with reference to the accompanying drawings and examples.

Examples:

the method for the intelligent robot for object gesture recognition and grabbing based on deep learning comprises the following steps as shown in fig. 1:

s1: building a virtual environment and a mechanical arm working platform model, wherein the method comprises the following steps of:

the randomization process includes:

randomizing the color and the material of the object model;

randomizing ambient light.

After randomization processing, RGB pictures of camera lens angles in the virtual environment are obtained as a data set, and specific positions of object models in the pictures in the data set are obtained for subsequent verification.

S3: constructing an object posture detector;

the object pose detector is constructed by adopting an EPnP algorithm and a Ranac algorithm, pnP (peselect-n-point) is a known point pair of n spatial 3D points corresponding to 2D points of an image, and a problem of calculating the pose of a camera or the pose of the object is solved, and there are various solutions, for example: direct Linear Transformation (DLT), P3P, EPnP, UPnP, and nonlinear optimization methods. Ranac (Random sample consensus, random sampling coincidence algorithm) is to estimate parameters of a mathematical model from a group of observation data sets containing 'outer points' in an iterative manner, and is widely used in computer vision, and the algorithm can effectively improve the accuracy of EPnP object attitude estimation; the method comprises the following steps:

i=1, …, n; in this embodiment, 10 reference points are selected and can be adjusted according to specific implementation conditions.

j=1, …,4 and +.>

j=1, …,4. The method meets the following conditions:

s3.3 using Ransac algorithm, using the camera external parameter matrix obtained in step S3.3 as the initial hypothesis model in the ranac algorithm to test the selected reference points in all other data sets, namely comparing the estimated 2D screen coordinates obtained by transforming the 3D space coordinates of the reference points with the actual 2D screen coordinates of the reference points obtained in step S3.1 to obtain an estimated-actual coordinate distance difference, which is denoted as D _mn Wherein m is the sequence number of the reference point selected from the single data in the data set, and n is the sequence number of the data in the data set; setting a threshold d according to the actual precision requirement ₀ If d _mn <＝d ₀ The reference point is considered as an intra-office point, otherwise, the reference point is considered as an extra-office point; in this embodiment, d is selected ₀ ＝1mm。

s3.5, repeating the Ranac algorithm for a plurality of Ranac iterations. Before performing the iteration, a threshold k is set for determining whether the number of local points obtained in one iteration meets the accuracy requirement. At the same time the threshold k cannot be set too high in order to prevent overfitting. In each Ranac iteration, if the ratio of the number of local points to the total number of reference points is greater than the threshold k and the number of local points is greater than the number of local points of the previous optimal outlier matrix, the camera outlier matrix for that iteration is set as the optimal outlier matrix. In this embodiment, according to the accuracy requirement, the threshold k is set to 80%, that is, the ratio of the intra-office point to the extra-office point is 4:1, the camera external parameter matrix under the condition can be qualified to select the optimal external parameter matrix; continuously carrying out Ranac iteration until the iteration is finished, and obtaining an optimal external parameter matrix of the camera under the data set; the iteration times can be set according to actual conditions, and in general, the more the iteration times are, the higher the accuracy is, but the time cost is higher, and a reasonable value is required to be determined according to the actual conditions; in this embodiment, the iteration number is 10000 according to the accuracy requirement.

S4: based on the object posture detector constructed in the step S3, constructing a neural network, and training the neural network by using the data set obtained in the step S2, wherein the method comprises the following steps of:

s4.2, the invention aims to be applicable to a plurality of scenes and has stronger generalization capability. The data set generated in the step S2 is single, so as to effectively prevent overfitting, the force neural network focuses on estimating the pose characteristics of the object, rather than the relation between the pose and the background, and the data set generated in the step S2 is subjected to augmentation processing to obtain an augmented data set, which specifically comprises the following steps:

s4.4, setting a standard to evaluate the final effect. Obtaining the estimated pose of the object model in the estimated data set through training and estimation in the step S4.3; in step S2, the actual specific coordinate position of the object model has been obtained, so that there is a one-to-one correspondence between the two sets of data. Constructing bounding boxes of the object model by using the two groups of data of the actual specific coordinate position of the object model and the estimated pose of the object model in the estimated dataset, which are obtained in the step S2, through a K-DOP algorithm, wherein the bounding boxes are called an actual bounding box and an estimated bounding box; acquiring an overlapping relation between a corresponding estimated bounding box and an actual bounding box by adopting a bounding box collision algorithm, and judging whether an accuracy standard is reached; the accuracy standard is set according to the accuracy requirement, and in this embodiment, the accuracy is set to be 90% overlapping of bounding boxes.

S5: migrating the object posture detector trained in the step S4 to a real platform;

the object posture detector after training is applied to a mechanical arm in a laboratory, after construction and training are carried out according to the steps S1-S4, a grabbing point is needed to be calculated according to the posture of the object, so that the mechanical arm or the intelligent robot in a real platform can recognize and grab the object, and the steps of calculating the grabbing point are as follows:

Claims

1. The method for the intelligent robot for identifying and grabbing the object gesture based on the deep learning is characterized by comprising the following steps of:

s1: building a virtual environment and a mechanical arm working platform model; the method comprises the following steps:

s1.2, splicing the object model obtained in the step S1.1 in a virtual environment, and simulating a real mechanical arm working platform and an actual environment;

s2: based on the virtual environment constructed in the step S1, carrying out randomization treatment on an object on the virtual model of the mechanical arm working platform, and obtaining a camera shooting image to obtain a data set; the randomization process includes:

randomizing the color and the material of the object model;

randomizing ambient illumination;

after randomization processing, RGB pictures of camera lens angles in the virtual environment are obtained as a data set, and specific positions of object models in the pictures in the data set are obtained for subsequent verification;

s3: constructing an object posture detector; the construction of the object posture detector is realized by adopting an EPnP algorithm and a Ranac algorithm, and comprises the following steps:

s3.1, adopting an EPnP algorithm, and randomly selecting n reference points in a space between a workbench and a camera in a virtual environment; acquiring 3D coordinates of the reference points in a world coordinate system, and recording as

…, n, while acquiring the 2D coordinates of these reference points on the projection plane taken by the camera, denoted +.>

…,n；

and->

The method meets the following conditions:

s3.4, using a Ranac algorithm, using the camera external parameter matrix obtained in the step S3.3 as an initial hypothesis model in the Ranac algorithm to test the selected reference points in all other data sets, namely comparing the estimated 2D screen coordinates obtained by transforming the 3D space coordinates of the reference points through the camera external parameter matrix with the actual 2D screen coordinates of the reference points obtained in the step S3.1 to obtain an estimated-actual coordinate distance difference, and recording as D _mn Wherein m is the sequence number of the reference point selected from the single data in the data set, and n is the sequence number of the data in the data set; setting a threshold d according to the actual precision requirement ₀ If d _mn <＝d ₀ The reference point is considered as an intra-office point, otherwise, the reference point is considered as an extra-office point;

s3.5, in the first iteration, randomly selecting one data in the data set to start iteration, and setting the obtained camera external parameter matrix as an optimal external parameter matrix;

s3.6, repeating the Ranac algorithm for a plurality of Ranac iterations to obtain an optimal external parameter matrix of the camera under the data set;

s3.7, obtaining an optimal external parameter matrix to obtain the pose of the camera, and further obtaining the pose of the object model under a camera coordinate system;

2. The method for intelligent robot recognition and grasping of object pose based on deep learning according to claim 1, wherein in step S3.5, a threshold k is set before performing iteration, for determining whether the number of local points obtained in one iteration meets the accuracy requirement; in each Ranac iteration, if the ratio of the number of the local points to the total number of the reference points is greater than a threshold k and the number of the local points is greater than the number of the local points of the previous optimal external parameter matrix, setting the camera external parameter matrix of the iteration as the optimal external parameter matrix; continuously carrying out Ranac iteration until the iteration is finished, and obtaining an optimal external parameter matrix of the camera under the data set; the iteration number is set according to the actual situation.

3. The method of intelligent robot for deep learning based object pose recognition and gripping according to claim 2, wherein step S4 comprises the steps of:

s4.2, carrying out augmentation treatment on the data set generated in the step S2 to obtain an augmented data set;

s4.4, setting a standard to evaluate the final effect, and obtaining the estimated pose of the object model in the evaluation data set through training and evaluation in the step S4.3; constructing bounding boxes of the object model by using the two groups of data of the actual specific coordinate position of the object model and the estimated pose of the object model in the estimated dataset, which are obtained in the step S2, through a K-DOP algorithm, wherein the bounding boxes are called an actual bounding box and an estimated bounding box; and obtaining the corresponding overlapping relation between the estimated bounding box and the actual bounding box by adopting a bounding box collision algorithm, and judging whether the accuracy standard is reached.

4. A method of intelligent robot for object gesture recognition and gripping based on deep learning according to claim 3, characterized in that step S4.2 specifically comprises the following steps:

5. The method for intelligent robot recognition and capture of object gesture based on deep learning according to claim 4, wherein in step S5, the trained object gesture detector is applied to a robot arm in a laboratory, and after the construction and training according to steps S1 to S4, the capture point is calculated according to the gesture of the object, so as to realize recognition and capture of the object by the robot arm or intelligent robot in a real platform.

6. The method for intelligent robot for recognition and gripping of object pose based on deep learning as claimed in claim 5, wherein the step of calculating the gripping point is as follows:

s5.1, calculating a bounding box by adopting a K-DOP algorithm according to the pose of the object model;