CN113370217A

CN113370217A - Method for recognizing and grabbing object posture based on deep learning for intelligent robot

Info

Publication number: CN113370217A
Application number: CN202110732696.7A
Authority: CN
Inventors: 杜广龙
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-10
Anticipated expiration: 2041-06-29
Also published as: CN113370217B

Abstract

The invention discloses an intelligent robot method for recognizing and grabbing object postures based on deep learning. The method comprises the following steps: building a virtual environment and constructing a mechanical arm working platform model; randomizing an object on the virtual model of the mechanical arm working platform based on the built virtual environment, obtaining a camera shooting image and acquiring a data set; constructing an object attitude detector; constructing a neural network based on the constructed object posture detector, and training the neural network by using the acquired data set; the method comprises the steps of simulating and generating a random object in a virtual environment to obtain a large amount of data serving as a training data set, obtaining an object posture detector with strong generalization capability by adopting a special training method, and then transferring the object posture detector to a real platform to realize the posture recognition and grabbing of a basic object.

Description

Method for recognizing and grabbing object posture based on deep learning for intelligent robot

Technical Field

The invention relates to the field of intelligent robot grabbing, in particular to a method for realizing object posture recognition and grabbing by an intelligent robot based on deep learning.

Background

Today in industry 4.0, all manner of robots are beginning to walk into factories for secondary production, which instead of humans accomplish dangerous or repetitive work tasks. Obviously, the intelligent robot does not feel tired, and only operates according to the trained neural network or rule, so that the excellent intelligent robot is favored by the industry and is widely applied to production.

However, with the popularization of industrial robots, problems regarding the training time and grasping efficiency of intelligent robots have also been raised. Although researchers have ensured that the robot is trained quickly and easily as much as possible in terms of the neural network, how to shorten the training time of the neural network and improve the grabbing efficiency of the robot is still a concern of the researchers.

At present, the mainstream intelligent robot training mode is still based on actual scene training, namely, the actual training scene randomization in real life is performed, and the scene is obtained for neural network training. However, this method has a major drawback that randomization of training scenes in real life consumes a lot of time, and the time consumed for generating each unit of training data is relatively long compared with computer perception. For the training of the intelligent robot, the time consumed in the training process is relatively negligible, but the time consumed for generating the data set for training is huge, and for the use of the intelligent robot, it is not acceptable to consume a larger proportion of the time for producing the data set for training than for specifically training the intelligent robot.

At present, there is a scheme for training a neural network of an intelligent robot by Using Simulation, for example, from IEEE International Conference on robots and Automation, 2018. An attractive alternative proposed in this document is to use an off-the-shelf simulator to present the synthetic virtual dataset and automatically generate underlying truth annotations for it. The article recognizes that models trained solely on simulated data often cannot be generalized to the real world. This document studies how to extend the stochastic simulation environment and domain adaptation methods to train the capture system to capture new objects from the original monocular RGB images. This article shows that by using virtual synthetic data and domain adaptation, the number of real-world samples required to achieve a given level of performance can be reduced substantially, primarily using randomly generated virtual data sets. However, this technique cannot be trained without using real environment data sets at all, there is a need for real world data sets, and the problem of robot neural network overfitting is not optimized.

Disclosure of Invention

Therefore, aiming at the defects of the prior art, the invention discloses a method for recognizing and grabbing an object posture based on deep learning, wherein in a virtual environment, a random algorithm is used for enabling a training scene to generate difference in more scenes and target object factors so as to generate more possible training data for covering different working scenes faced in an industrial production process as far as possible. And due to the virtual environment built by means of the computer, the method has advantages over the traditional method for generating the data set in terms of speed and data quantity of the data set. Carry out intelligent robot's training under this kind of mode, training speed can be faster than traditional training mode to when migrating this model to real robot and using, because the more extensive of data set cover, and optimized to the overfitting problem, the model has better generalization ability, provides stronger result of use in a shorter time. The invention can train the neural network of the intelligent robot completely based on the virtual data set without relying on the real world data set, thereby improving the training efficiency of the neural network, and forcing the neural network to pay attention to the pose characteristics of the object instead of the relationship between the pose and the background of the training data set, thereby weakening the over-fitting problem.

The purpose of the invention is realized by at least one of the following technical solutions.

The method for recognizing and grabbing the object posture based on deep learning comprises the following steps:

s1: building a virtual environment and constructing a mechanical arm working platform model;

s2: randomizing an object on the virtual model of the mechanical arm working platform based on the virtual environment established in the step S1, obtaining a camera shooting image, and acquiring a data set;

s3: constructing an object attitude detector;

s4: constructing a neural network based on the object posture detector constructed in the step S3, and training the neural network by using the data set obtained in the step S2;

s5: and migrating the object posture detector trained in the step S4 to a real platform.

Further, step S1 includes the steps of:

s1.1, acquiring the size and the shape of a mechanical arm working platform in a real environment, and establishing the mechanical arm and the mechanical arm working platform one by one in a virtual environment; simultaneously constructing a plurality of object models;

s1.2, splicing the object models obtained in the step S1.1 in a virtual environment, and simulating a real mechanical arm working platform and an actual basic environment.

Further, in step S2, the randomization process includes:

randomizing the appearance and drop position of a plurality of different object models;

randomizing the color and material of the object model;

randomizing ambient lighting.

Further, in step S2, after the randomization process, RGB images of the camera lens angle in the virtual environment are obtained as a data set, and the specific position of the object model in the images in the data set is obtained for subsequent verification.

Further, in step S3, the object pose detector is constructed by using EPnP algorithm and ranac algorithm, and PnP (peer-n-point) is a known point pair corresponding to n spatial 3D points and 2D points of the image, and is a type of problem for calculating the camera pose or the object pose, and it has many solutions, for example: direct Linear Transformation (DLT), P3P, EPnP, UPnP, and nonlinear optimization methods; the Random sample consensus algorithm estimates parameters of a mathematical model in an iterative mode from a group of observation data sets containing 'local outer points', is widely used in computer vision, and can effectively improve the accuracy of the posture estimation of the EPnP object; the method comprises the following steps:

s3.1, adopting an EPnP algorithm,firstly, randomly selecting n reference points in the space between a workbench and a camera in a virtual environment. Obtaining the 3D coordinates of the reference points in the world coordinate system and recording the coordinates

i is 1, …, n, and the 2D coordinates of the reference points on the projection plane shot by the camera are acquired and recorded as

i＝1,…,n；

S3.2, respectively selecting 4 control points in a world coordinate system and a camera projection plane by adopting a Principal Component Analysis (PCA) method through the selected n reference points, and respectively recording the control points as:

j-1, …,4 and

j is 1, …, 4. Satisfies the following conditions:

wherein, a_ijIs a homogeneous barycentric coordinate; the condition indicates that the selected 4 control points can represent the 3D reference point in any world coordinate system by weighting; in the projection plane, the reference point and the control point have the same weighting relation;

s3.3, obtaining coordinates of the 4 control points in a world coordinate system and a camera coordinate system through the step S3.1 and the step S3.2, and obtaining a rotation matrix R and a translation matrix t which are called as an external reference matrix of the camera by using a 3D-3D algorithm;

s3.3, testing the reference points selected in all other data sets by using the camera external parameter matrix obtained in the step S3.3 as an initial hypothesis model in the Randac algorithm by using the Randac algorithm, namely, obtaining estimated 2D screen coordinates of the 3D space coordinates of the reference points through the transformation of the camera external parameter matrix and actual reference point coordinates obtained in the step S3.1Comparing the 2D screen coordinates to obtain an estimated-actual coordinate distance difference value, and recording the difference value as D_mnWherein m is the serial number of a reference point selected from single data in the data set, and n is the serial number of the data in the data set; setting a threshold value d according to the actual precision requirement₀If d is_mn<＝d₀If the reference point is a local point, the reference point is determined to be a local point, otherwise, the reference point is a local point;

s3.4, in the first iteration, randomly selecting a part of data in the data set to start iteration, and setting the obtained camera extrinsic parameter matrix as an optimal extrinsic parameter matrix;

and S3.5, repeating the Randac algorithm to carry out multiple Randac iterations. Before iteration, a threshold value k is set for determining whether the number of local points obtained in one iteration meets the precision requirement. At the same time, the threshold value k cannot be set too high in order to prevent overfitting. In each Ranmac iteration, if the proportion of the number of the local points in the total number of the reference points is greater than a threshold value k and the number of the local points is greater than the number of the local points of the previous optimal external parameter matrix, setting the camera external parameter matrix of the iteration as the optimal external parameter matrix; continuously carrying out Ransac iteration until the iteration is finished, and obtaining the optimal external parameter matrix of the camera under the data set; the iteration times can be set according to actual conditions, and under the ordinary condition, the more the iteration times, the higher the accuracy, but the higher the time cost, and a reasonable value needs to be determined according to the actual conditions;

and S3.6, obtaining the optimal external parameter matrix to obtain the pose of the camera, and further obtaining the pose of the object model in a camera coordinate system.

Further, step S4 includes the steps of:

s4.1, constructing a neural network by adopting an open-source PyTorch deep learning framework by adopting Python as a programming language and considering the flexibility and the size of a program;

s4.2, the method aims to be applicable to a plurality of scenes and has strong generalization capability, the data set generated in the step S2 is single, and in order to effectively prevent overfitting and force the neural network to pay more attention to the characteristics of the pose of the estimated object instead of the relation between the pose and the background, the data set generated in the step S2 is subjected to augmentation processing to obtain an augmented data set;

s4.3, training the neural network constructed in the step S4.1 by using the augmentation data set obtained in the step S4.2, wherein 20% of data is used for training, and the part of data becomes a training data set; 80% of the data is used for evaluation, this part of the data is called an evaluation dataset;

and S4.4, setting a standard to evaluate the final effect. Obtaining the estimated pose of the object model in the estimation data set through the training and estimation in the step S4.3; in step S2, the actual specific coordinate positions of the object model have been obtained, so that there is a one-to-one correspondence between the two sets of data. Using the actual specific coordinate position of the object model obtained in step S2 and the estimated pose of the object model in the evaluation dataset to construct bounding boxes of the object model by using the K-DOP algorithm, which are called actual bounding boxes and estimated bounding boxes; and obtaining the overlapping relation between the corresponding estimated bounding box and the actual bounding box by adopting a bounding box collision algorithm, and judging whether the accuracy standard is reached.

Further, step S4.2 specifically includes the following steps:

s4.2.1, obtaining the pose of the object model by using the real data provided by the data set obtained in the step S2, and then cutting the object model;

s4.2.2, synthesizing the cut object model with other pictures to achieve the purpose of replacing the background picture;

s4.2.3, image processing including saturation change, brightness change and noise addition is performed on the synthesized image resulting in an augmented data set.

Further, in step S5, the trained object posture detector is applied to a robot arm in a laboratory, and a grasping point needs to be calculated according to the posture of the object after construction and training according to steps S1 to S4, so as to realize recognition and grasping of the object by the robot arm or the intelligent robot in the real platform.

Further, the step of calculating the grab point is as follows:

and S5.1, calculating the bounding box by adopting a K-DOP algorithm according to the pose of the object model.

S5.2, selecting a grabbing point on the bounding box according to the actual type of the mechanical claw of the mechanical arm.

Compared with the prior art, the invention has the advantages that:

the invention can train the neural network of the intelligent robot completely based on the virtual data set without relying on the real world data set, thereby improving the training efficiency of the neural network, and forcing the neural network to pay attention to the pose characteristics of the object instead of the relationship between the pose and the background of the training data set, thereby weakening the over-fitting problem. The method is applicable to a plurality of scenes and has strong generalization.

Drawings

FIG. 1 is a flow chart of object pose recognition and capture based on virtual environment training according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Example (b):

the method for recognizing and grabbing the object posture based on deep learning, as shown in fig. 1, includes the following steps:

s1: the method for constructing the mechanical arm working platform model by building the virtual environment comprises the following steps:

the randomization process includes:

randomizing the color and material of the object model;

randomizing ambient lighting.

And after randomization, acquiring an RGB (red, green and blue) picture of a camera lens angle in a virtual environment as a data set, and acquiring a specific position of an object model in the picture in the data set for subsequent verification.

S3: constructing an object attitude detector;

the object attitude detector is constructed by adopting an EPnP algorithm and a Ranpac algorithm, PnP (passive-n-point) is a known point pair of n spatial 3D points corresponding to an image 2D point, and the object attitude or the camera attitude is calculated, and the object attitude detector has various solutions, such as: direct Linear Transformation (DLT), P3P, EPnP, UPnP and nonlinear optimization methods. The Random sample consensus algorithm estimates parameters of a mathematical model in an iterative mode from a group of observation data sets containing 'local outer points', is widely used in computer vision, and can effectively improve the accuracy of the posture estimation of the EPnP object; the method comprises the following steps:

s3.1, adopting an EPnP algorithm, and randomly selecting n reference points in a space between a workbench and a camera in a virtual environment. Obtaining the 3D coordinates of the reference points in the world coordinate system and recording the coordinates

1, …, n; in this embodiment, 10 reference points are selected and may be adjusted according to specific implementation conditions.

j-1, …,4 and

j is 1, …, 4. Satisfies the following conditions:

s3.3, testing the reference points selected in all other data sets by using the camera external parameter matrix obtained in the step S3.3 as an initial hypothesis model in the Randac algorithm by using the Randac algorithm, namely comparing the estimated 2D screen coordinate obtained by transforming the 3D space coordinate of the reference point through the camera external parameter matrix with the actual 2D screen coordinate of the reference point obtained in the step S3.1 to obtain an estimated-actual coordinate distance difference which is recorded as D_mnWherein m is the serial number of a reference point selected from single data in the data set, and n is the serial number of the data in the data set; setting a threshold value d according to the actual precision requirement₀If d is_mn<＝d₀If the reference point is a local point, the reference point is determined to be a local point, otherwise, the reference point is a local point; in this example, d is selected₀＝1mm。

and S3.5, repeating the Randac algorithm to carry out multiple Randac iterations. Before iteration, a threshold value k is set for determining whether the number of local points obtained in one iteration meets the precision requirement. At the same time, the threshold value k cannot be set too high in order to prevent overfitting. In each Ranmac iteration, if the proportion of the number of local points to the total number of the reference points is greater than a threshold value k, and the number of local points is greater than the number of local points of the previous optimal external parameter matrix, the camera external parameter matrix of the iteration is set as the optimal external parameter matrix. In this embodiment, according to the requirement of precision, the threshold value k is set to 80%, that is, the ratio of the local inner point to the local outer point is 4: 1, the camera external parameter matrix under the condition can be qualified to select the optimal external parameter matrix; continuously carrying out Ransac iteration until the iteration is finished, and obtaining the optimal external parameter matrix of the camera under the data set; the iteration times can be set according to actual conditions, and under the ordinary condition, the more the iteration times, the higher the accuracy, but the higher the time cost, and a reasonable value needs to be determined according to the actual conditions; in this embodiment, the number of iterations is 10000 according to the precision requirement.

S4: constructing a neural network based on the object posture detector constructed in the step S3, and training the neural network by using the data set obtained in the step S2, wherein the method comprises the following steps:

s4.2, the method aims to be suitable for multiple scenes and has strong generalization capability. The data set generated in step S2 is relatively single, and in order to effectively prevent overfitting and force the neural network to focus more on the features of the pose of the estimated object, rather than the relationship between the pose and the background, the data set generated in step S2 is subjected to augmentation processing to obtain an augmented data set, which specifically includes the following steps:

and S4.4, setting a standard to evaluate the final effect. Obtaining the estimated pose of the object model in the estimation data set through the training and estimation in the step S4.3; in step S2, the actual specific coordinate positions of the object model have been obtained, so that there is a one-to-one correspondence between the two sets of data. Using the actual specific coordinate position of the object model obtained in step S2 and the estimated pose of the object model in the evaluation dataset to construct bounding boxes of the object model by using the K-DOP algorithm, which are called actual bounding boxes and estimated bounding boxes; obtaining the overlapping relation between the corresponding estimated bounding box and the actual bounding box by adopting a bounding box collision algorithm, and judging whether the accuracy standard is reached; the accuracy criteria are set according to the accuracy requirements, and in this embodiment, the accuracy is set to 90% bounding box overlap.

S5: migrating the object posture detector trained in the step S4 to a real platform;

applying the trained object posture detector to a mechanical arm in a laboratory, constructing and training according to the steps S1-S4, and calculating a grabbing point according to the posture of the object to realize the recognition and grabbing of the object by the mechanical arm or an intelligent robot in a real platform, wherein the step of calculating the grabbing point is as follows:

Claims

1. The method for recognizing and grabbing the object posture based on deep learning is characterized by comprising the following steps of:

s3: constructing an object attitude detector;

2. The method for intelligent robot based on deep learning object posture recognition and grasp as claimed in claim 1, wherein step S1 includes the following steps:

3. The method for intelligent robot based on deep learning object posture recognition and grasp as claimed in claim 2, wherein in step S2, the randomization process comprises:

randomizing the color and material of the object model;

randomizing ambient lighting.

4. The method for object pose recognition and capture based on deep learning of claim 3, wherein in step S2, after the randomization process, RGB images of camera lens angles in the virtual environment are obtained as the data set, and the specific position of the object model in the images in the data set is obtained for subsequent verification.

5. The method for intelligent robot based on deep learning object pose recognition and capture according to claim 4, wherein in step S3, the object pose detector is constructed by using EPnP algorithm and Randac algorithm, comprising the following steps:

s3.1, adopting an EPnP algorithm, and randomly selecting n reference points in a space between a workbench and a camera in a virtual environment; obtaining the 3D coordinates of the reference points in the world coordinate system and recording the coordinates

i＝1,…,n；

j-1, …,4 and

j is 1, …,4, satisfying:

s3.3, testing the reference points selected in all other data sets by using the camera external parameter matrix obtained in the step S3.3 as an initial hypothesis model in the Randac algorithm by using the Randac algorithm, namely comparing the estimated 2D screen coordinate obtained by transforming the 3D space coordinate of the reference point through the camera external parameter matrix with the actual 2D screen coordinate of the reference point obtained in the step S3.1 to obtain an estimated-actual coordinate distance difference which is recorded as D_mnWherein m is the serial number of a reference point selected from single data in the data set, and n is the serial number of the data in the data set; setting a threshold value d according to the actual precision requirement₀If d is_mn<＝d₀If the reference point is a local point, the reference point is determined to be a local point, otherwise, the reference point is a local point;

s3.5, repeating the Randac algorithm for multiple Randac iterations to obtain an optimal external parameter matrix of the camera under the data set;

6. The method for recognizing and grabbing an intelligent robot based on deep learning object posture as claimed in claim 5, wherein in step S3.5, before performing the iteration, a threshold k is set for determining whether the number of local points obtained in one iteration meets the accuracy requirement; in each Ranmac iteration, if the proportion of the number of the local points in the total number of the reference points is greater than a threshold value k and the number of the local points is greater than the number of the local points of the previous optimal external parameter matrix, setting the camera external parameter matrix of the iteration as the optimal external parameter matrix; continuously carrying out Ransac iteration until the iteration is finished, and obtaining an optimal external parameter matrix of the camera under the training data set; the number of iterations is set according to the actual situation.

7. The method for intelligent robot based on deep learning object posture recognition and grasp as claimed in claim 6, wherein step S4 includes the following steps:

s4.2, performing augmentation processing on the data set generated in the step S2 to obtain an augmented data set;

s4.4, setting a standard to evaluate the final effect, and obtaining the estimated pose of the object model in the evaluation data set through the training and evaluation in the step S4.3; using the actual specific coordinate position of the object model obtained in step S2 and the estimated pose of the object model in the evaluation dataset to construct bounding boxes of the object model by using the K-DOP algorithm, which are called actual bounding boxes and estimated bounding boxes; and obtaining the overlapping relation between the corresponding estimated bounding box and the actual bounding box by adopting a bounding box collision algorithm, and judging whether the accuracy standard is reached.

8. The method for recognizing and grabbing an intelligent robot based on deep learning object posture as claimed in claim 7, wherein the step S4.2 comprises the following steps:

9. The method for intelligent robot based on deep learning object pose recognition and capture as claimed in claim 8, wherein in step S5, the trained object pose detector is applied to the robot arm in the laboratory, and after the construction and training according to steps S1-S4, the capture point is calculated according to the pose of the object, so as to realize the recognition and capture of the object by the robot arm or the intelligent robot in the real platform.

10. The method for the intelligent robot for recognizing and grabbing the object posture based on the deep learning of any one of claims 1-9, wherein the step of calculating the grabbing point is as follows:

s5.1, calculating a bounding box by adopting a K-DOP algorithm according to the pose of the object model;