CN113409384A

CN113409384A - Pose estimation method and system of target object and robot

Info

Publication number: CN113409384A
Application number: CN202110939214.5A
Authority: CN
Inventors: 杨洋
Original assignee: Shenzhen Huahan Weiye Technology Co ltd
Current assignee: Shenzhen Huahan Weiye Technology Co ltd
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2021-09-17
Anticipated expiration: 2041-08-17
Also published as: CN113409384B

Abstract

A pose estimation method and system of a target object and a robot are provided, wherein the pose estimation method comprises the following steps: acquiring an image to be processed; inputting an image to be processed into a target detection network to obtain a target detection result image; inputting the target detection result image into a trained view reconstruction model to obtain a three-dimensional reconstruction image, wherein the three-dimensional reconstruction image comprises three channels and is used for representing three-dimensional coordinates corresponding to pixels; calculating a transformation matrix according to the two-dimensional coordinates and the corresponding three-dimensional coordinates of the pixels; and calculating according to the transformation matrix to obtain an equivalent shaft angle and an equivalent rotating shaft, thereby obtaining the pose of the target object. The view reconstruction model is obtained through training and used for establishing the mapping relation between the two-dimensional coordinates and the corresponding three-dimensional coordinates of the pixels in the image, the three-dimensional reconstruction image covering the target object can be obtained through the view reconstruction model, the three-dimensional coordinates corresponding to each pixel are obtained, and the pose estimation method can be suitable for pose estimation of objects which are low in texture, have light reflecting surfaces or are partially shielded.

Description

Pose estimation method and system of target object and robot

Technical Field

The invention relates to the technical field of machine vision, in particular to a pose estimation method and system for a target object and a robot.

Background

In the robot field, the independent grabbing of the target object is a key capability of the intelligent robot, wherein the grabbing of the scattered objects is also a key for realizing the intellectualization of the robot all the time, and the capability of the robot for grabbing the scattered objects can be applied to scenes such as sorting of parts and the like, so that the working efficiency is improved. However, the current robots perform a complex new gripping task, taking weeks to reprogram, which makes the reconfiguration of modern manufacturing lines very expensive and slow. In addition, the robot is mostly applied to a specific environment, the robot performs grabbing operation on a specific known object, and for an unknown object placed in different poses in an uncertain environment, the robot autonomously determines the grabbing position of the grabbed object and the grabbing pose of the grabbing gripper, so that the prior art is immature. If the robot can autonomously grab scattered objects, the time for teaching programming of the robot can be shortened, the flexibility and the intelligence of automatic manufacturing are better realized, the production requirements of multiple types and small batches at present are met, and the requirement of quick updating of manufacturing equipment when products are updated is met. And for the posture recognition of the scattered objects, the method is an important step for controlling the robot to grab the scattered objects.

Computer vision techniques have occupied an important position in the perception of robot unstructured scenes. The visual image is an effective means for acquiring real world information, the visual image is used for extracting the characteristics of an operation object through a visual perception algorithm, such as the position, angle, posture and other information of an object, and the information can be used for enabling the robot to execute corresponding operation and finish a specified operation task. For sorting parts, scene data can be acquired by using a vision sensor, but how to identify a target object from a scene and estimate the position and the posture of the target object is a very critical problem, which is very important for the calculation of a robot grabbing position and a grabbing path. At present, there are two main types of methods for estimating the pose of an object: the method is based on the traditional point cloud or image analysis algorithm for estimation, and is based on deep learning for estimation through a learning target detection and pose iteration method. The first method mainly identifies and matches the pose according to the image or three-dimensional point cloud template information, and has the defects that a template needs to be established according to shot images or CAD data for each object, multiple templates need to be established for multiple parts, and the product model changing period is long. The pose estimation mainly carries out 6D (three-dimensional coordinate positioning and three-dimensional direction) pose estimation, local features extracted from an image are matched with features in a three-dimensional model of an object, and the 6D pose of the object can be obtained by utilizing the corresponding relation between two-dimensional coordinates and three-dimensional coordinates. However, these methods do not handle low-texture objects very well because only few local features can be extracted. Similarly, most of the existing mainstream pose estimation algorithms based on deep learning rely on information such as color and texture of the surface of an object, and most of parts in industrial production belong to low-texture objects, and are easily influenced by illumination conditions, so that the texture reflected from a two-dimensional image is not necessarily the real texture of the surface of a three-dimensional object, and when the resolution of the image changes, the calculated texture may have large deviation, and feature extraction is not easy to perform, so that the algorithm has a poor recognition effect on the low-texture and parts with reflective surfaces. In practical situations, there is often a problem that the target object is partially occluded, which also results in difficulty in obtaining information of local features or color, texture, etc. of the object surface. In order to process low-texture objects, two methods are available, wherein the first method is to estimate the three-dimensional coordinates of object pixels or key points in an input image, so that the corresponding relation between two-dimensional coordinates and the three-dimensional coordinates is established, and 6D pose estimation can be carried out; the second method is to transform the 6D pose estimation problem into a pose classification problem or a pose regression problem by discretizing the pose space. These methods can handle low-texture objects, but are difficult to achieve high-precision pose estimation, and small errors in the classification or regression stage will directly result in pose mismatch.

Disclosure of Invention

The application provides a pose estimation method and system of a target object, a robot and a computer readable storage medium, and aims to solve the problem that the pose estimation effect of an object with low texture and a reflective surface is poor due to the fact that most existing pose estimation methods depend on information such as color and texture of the surface of the object.

According to a first aspect, an embodiment provides a pose estimation method of a target object, including:

acquiring an image to be processed;

inputting the image to be processed into a target detection network to detect a target object in the image to be processed to obtain a target detection result image;

inputting the target detection result image into a pre-trained view reconstruction model to obtain a three-dimensional reconstruction image, wherein the three-dimensional reconstruction image comprises three channels and is used for representing three-dimensional coordinates corresponding to pixels;

calculating a transformation matrix according to the two-dimensional coordinates and the corresponding three-dimensional coordinates of the pixels;

and calculating according to the transformation matrix to obtain an equivalent shaft angle and an equivalent rotating shaft, thereby obtaining the pose of the target object.

In one embodiment, the view reconstruction model is trained by:

obtaining a sample image and a corresponding three-dimensional coordinate marker imageI _GT；

Inputting the sample image into the target detection network to detect a target object in the sample image to obtain a target detection result imageI _src；

The target detection result image is processedI _srcInputting the image reconstruction model to obtain a three-dimensional reconstruction imageI _3DThe three-dimensional reconstruction image comprises three channels and is used for representing predicted three-dimensional coordinates corresponding to pixels;

according to the corresponding predicted three-dimensional coordinates of each pixel

And three-dimensional coordinate mark value

Calculating the actual reconstruction error for each pixel

Constructing a first loss function using the actual reconstruction errors of all pixels, superscriptingiIs shown asiA plurality of pixels;

reconstructing the three-dimensional imageI _3DAnd the three-dimensional coordinate mark imageI _GTInputting the error values into a preset error regression discriminant network to obtain the predicted reconstruction error of each pixel

Using the predicted reconstruction errors of all pixels

And actual reconstruction error

Constructing a second loss function;

constructing a third loss function by using a result obtained by inputting the three-dimensional reconstruction image into the error regression judging network and a result obtained by inputting the three-dimensional coordinate marking image into the error regression judging network;

and constructing a total loss function by using a weighted sum of the first loss function, the second loss function and the third loss function, and training the view reconstruction model and the error regression discriminant network by using a back propagation algorithm according to the total loss function to obtain parameters of the view reconstruction model.

In one embodiment, the first loss function is

，

Wherein the content of the first and second substances,nwhich represents the number of pixels that are to be counted,λis a preset weight value, and is used as a weight value,Frepresenting a set of pixels in the image that belong to a target object;

for a symmetric object, the first loss function is

，

Wherein the content of the first and second substances,symrepresenting the set of all symmetric poses of a symmetric object,pis shown aspThe symmetrical postures of the robot are symmetrical,R _Pis shown aspA transformation matrix of the symmetric poses;

the second loss function is

，

The third loss function is

，

Wherein the content of the first and second substances,Jan identification of a network is discriminated for the error regression,Gan identification of a model is reconstructed for the view,G(I _src) Representing the three-dimensional reconstructed image(s),J(G(I _src) Represents the result of inputting the three-dimensional reconstructed image into the error regression discrimination network,J(I _GT) Representing the result obtained by inputting the three-dimensional coordinate marking image into the error regression discrimination network;

the total loss function is

，

For a symmetric object, the total loss function is

，

Wherein the content of the first and second substances,αandβis a preset weight value.

In one embodiment, the three-dimensional coordinate marker image is obtained by: and mapping points on the target object into pixels on the image plane according to the predicted transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane, normalizing the three-dimensional coordinates of the target object to be used as RGB values of corresponding pixels on the image plane, and thus obtaining the three-dimensional coordinate marked image.

In one embodiment, the view reconstruction model is a self-encoder structure, and includes an encoder and a decoder connected by one or more fully-connected layers, and the outputs of several layers in the encoder and the outputs of symmetrical layers in the decoder are channel-spliced.

In one embodiment, the calculating the equivalent shaft angle and the equivalent rotation axis according to the transformation matrix includes:

the equivalent shaft angle is calculated according to the following formula:

，

the equivalent rotation axis is calculated according to the following formula:

，

wherein the content of the first and second substances,r ₁₁、r ₁₂、r ₁₃、r ₂₁、r ₂₂、r ₂₃、r ₃₁、r ₃₂、r ₃₃the elements of the transformation matrix are specifically:

。

according to a second aspect, an embodiment provides a pose estimation system of a target object, including:

the image acquisition module is used for acquiring an image to be processed;

the target detection network is connected with the image acquisition module and is used for detecting a target object in the image to be processed to obtain a target detection result image;

the view reconstruction model is connected with the target detection network and used for calculating the target detection result image to obtain a three-dimensional reconstruction image, and the three-dimensional reconstruction image comprises three channels and is used for representing three-dimensional coordinates corresponding to pixels;

the transformation matrix calculation module is used for calculating a transformation matrix according to the two-dimensional coordinates of the pixels and the corresponding three-dimensional coordinates;

and the pose calculation module is used for calculating to obtain an equivalent shaft angle and an equivalent rotating shaft according to the transformation matrix so as to obtain the pose of the target object.

In one embodiment, the pose estimation system of the target object further comprises a view model training module, and the view model training module is configured to train the view reconstruction model by:

And three-dimensional coordinate mark value

Calculating the actual reconstruction error for each pixel

Using the predicted reconstruction errors of all pixels

And actual reconstruction error

Constructing a second loss function;

According to a third aspect, there is provided in an embodiment a robot comprising:

a camera for taking an image to be processed including a target object;

the tail end of the mechanical arm is provided with a mechanical claw which is used for grabbing the target object according to the pose of the target object;

and the processor is connected with the camera and the mechanical arm, and is used for acquiring an image to be processed through the camera, obtaining the pose of the target object by executing the pose estimation method of the first aspect, and sending the pose to the mechanical arm so that the mechanical arm grabs the target object.

According to a fourth aspect, an embodiment provides a computer-readable storage medium having a program stored thereon, the program being executable by a processor to implement the pose estimation method of the first aspect described above.

According to the method and system for estimating the pose of the target object, the robot and the computer-readable storage medium of the embodiments, the problem of detecting and estimating the pose of the three-dimensional object is decomposed into the problem of detecting the target in the two-dimensional image and estimating the pose in the three-dimensional space, so that a complicated problem is simplified into two simple problems, and meanwhile, in the process of estimating the pose, the problem of estimating the pose is decomposed into two processes, one is to solve the mapping relation between the two-dimensional coordinates of the pixels in the image and the corresponding three-dimensional coordinates, the other is to estimate the pose of the target object, and the same is to simplify the complicated problem into two simple problems, so that the complexity of solving the problem of estimating the pose of the target object is reduced, and the operation efficiency is improved. Because the view reconstruction model is obtained through training and used for establishing the mapping relation between the two-dimensional coordinates of the pixels in the image and the corresponding three-dimensional coordinates, the three-dimensional reconstruction image covering the target object can be obtained through the view reconstruction model, the three-dimensional coordinates corresponding to each pixel are obtained, the pose estimation method can adapt to the pose estimation of objects which are low in texture and have light reflecting surfaces or are partially shielded, meanwhile, the pose estimation is solved based on the mapping relation between the two-dimensional coordinates and the three-dimensional coordinates of the pixel level, and the pose estimation precision is favorably improved.

Drawings

FIG. 1 is a schematic diagram of a pose estimation method of a target object according to the present application;

FIG. 2 is a flow diagram of a pose estimation method of a target object in one embodiment;

FIG. 3 is a schematic structural diagram of a view reconstruction model according to an embodiment;

FIG. 4 is a flowchart illustrating training of a view reconstruction model according to an embodiment;

FIG. 5 is a diagram illustrating an exemplary structure of an error regression discriminant network;

FIG. 6 is a schematic diagram of the calculation of the transformation relationship between the camera coordinate system and the world coordinate system;

FIG. 7 is a schematic diagram of a pose estimation system for a target object according to an embodiment;

fig. 8 is a schematic structural diagram of a robot in an embodiment.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.

The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The term "connected" and "coupled" when used in this application, unless otherwise indicated, includes both direct and indirect connections (couplings).

The idea of the technical scheme of the application is to establish a mapping relation between a two-dimensional image coordinate and a three-dimensional coordinate, then obtain a three-dimensional coordinate corresponding to the two-dimensional coordinate of each pixel in the image according to the mapping relation, obtain a transformation relation between a world coordinate system and a camera coordinate system by using the corresponding relation between the two-dimensional coordinate and the three-dimensional coordinate, solve the transformation relation mainly by calculating a homography matrix, obtain rotation and translation quantities for pose estimation after obtaining the homography matrix, and describe in detail below.

Fig. 1 and 2 show a general flow of a pose estimation method of a target object according to the present application, and referring to fig. 2, an embodiment of the pose estimation method of a target object includes steps 110 to 150, which are described in detail below.

Step 110: and acquiring an image to be processed. The scene where the target object is located can be shot by using imaging equipment such as a camera or a video camera to obtain an image to be processed including the target object, and the image is used for carrying out pose estimation on the target object subsequently. The target object can be a product on an industrial production line, a mechanical part in an article box, a tool on an operation table and the like.

Step 120: and inputting the image to be processed into a target detection network to detect a target object in the image to be processed to obtain a target detection result image.

In order to estimate the pose of the target object, the target object needs to be identified from the image, the position of the target object needs to be obtained, and the target object needs to be processed in a targeted manner. In the method and the device, the problems of detection and pose estimation of the three-dimensional object are decomposed into the problems of target detection in the two-dimensional image and pose estimation in the three-dimensional space, so that a complex problem is simplified into two simple problems, and the complexity of solving is reduced. In this step, a target detection network is first used to perform target detection on the image to be processed to obtain a target detection result image, and to obtain the position and type of the target object in the image to be processed to implement target detection on the two-dimensional image, and the target detection network may use the existing target detection network structure, such as SSD, YOLO, fast RCNN, Mobile Net, etc.

Step 130: and inputting the target detection result image into a pre-trained view reconstruction model to obtain a three-dimensional reconstruction image, wherein the three-dimensional reconstruction image comprises three channels and is used for representing three-dimensional coordinates corresponding to pixels.

Referring to fig. 1, in the present application, a mapping relationship between two-dimensional image coordinates and three-dimensional coordinates is established by constructing and training a view reconstruction model, the view reconstruction model reconstructs a three-dimensional coordinate corresponding to each pixel from a two-dimensional image, and outputs a three-dimensional reconstructed image with three channels to represent the three-dimensional coordinates corresponding to the pixels, where the three channels correspond to the three-dimensional coordinates in a space. The mapping relation is trained by utilizing the sample image and the corresponding three-dimensional coordinate labeled image to train the view reconstruction model, wherein the three-dimensional coordinate labeled image is constructed in advance according to the actual three-dimensional coordinate corresponding to the sample image, the view reconstruction model obtains a stable mapping relation through the training, and the error between the three-dimensional reconstruction result and the actual value is minimized as much as possible, so that the mapping relation between the two-dimensional image coordinate and the three-dimensional coordinate of an unknown object can be well established.

Referring to fig. 3, in an embodiment, the view reconstruction model may be a structure of a self-encoder, including an encoder and a decoder, where the encoder mainly includes convolution, pooling and activation function operations, and the convolution kernel size may be 3x3 or 5x5, and the step size is 1; the decoder mainly comprises up-sampling and convolution operation; the encoder and the decoder are connected through one or more full-connection layers, and the outputs of a plurality of layers in the encoder and the outputs of symmetrical layers in the decoder are subjected to channel splicing, so that the splicing of the multi-scale characteristic diagram is realized, the multi-scale characteristic diagram can adapt to the receptive field of a large object and the receptive field of a small object, and the large object and the small object can be subjected to the establishment of the mapping relation and the pose estimation at the same time. It can be understood that, for the same object, the mapping relationship between the two-dimensional image coordinate and the three-dimensional coordinate is different under different viewing angles, and therefore, in order to adapt to the mapping relationship under different viewing angles, the mapping relationship cannot be simply represented by linear transformation, and needs to be expressed by a higher-order function or transformation, so in this embodiment, by using an auto-Encoder structure, in an Encoder (encoding) and Decoder (decoding) manner, the adaptive higher-order function or transformation is finally reconstructed by performing multiple convolutions, pooling, activating functions, upsampling and concatenating of multi-scale feature maps.

The three-dimensional coordinate labeled image used in training can be constructed by normalizing the actual three-dimensional coordinate and converting the normalized actual three-dimensional coordinate into three-channel color space information of the image. The position data of the object in three-dimensional space can be represented by a three-dimensional point cloud coordinate system (x，y， z) To characterize, can be given byx，y，z) Coordinate mapping into normalized space, converting into color space information: (R，G，B) If the transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane is determined, the three-dimensional coordinates of the target object can be converted into the two-dimensional coordinates on the image, and the corresponding color space information is obtained (R，G，B) Stored at corresponding pixels in the image, a two-dimensional color image having three-dimensional coordinate information of the object is obtained, R, G, B three channels in the image and three-dimensional coordinatesx、y、zThe conversion process maps the normalized three-dimensional coordinates of the target object to the RGB values of the color space directly without feature matching, and solves the problem that the target object is difficult to extract features under the conditions of low texture, reflective surface or partial shielding. Therefore, according to the predicted transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane, the points on the target object are mapped to pixels on the image plane, then the three-dimensional coordinates of the target object are normalized and used as RGB values of corresponding pixels on the image plane, and therefore a three-dimensional coordinate marker image is obtained and used for representing the true values of the three-dimensional coordinates of the target object. It can be understood that after training, the three-dimensional reconstructed image output by the view reconstruction model is a color image, wherein the RGB values correspond to the normalized three-dimensional coordinates of the target object, thereby establishing the correspondence between the two-dimensional coordinates and the three-dimensional coordinates of each pixel.

For use in training of the view reconstruction model, RGB-D heterologous data, i.e. of the view reconstruction modelInputting images as RGB information, three-dimensional coordinate marking images since they include in spacex，y，z) The information may be depth d (depth) information. For heterogeneous data, there are two main ideas to extract and utilize features: the first method is that color space RGB information and depth D information are respectively input into different networks for processing, color space characteristic information and depth space characteristic information are respectively extracted, and then the two kinds of characteristic information are fused; the second method is to input RGB information, acquire an approximate position of the target object, and then guide feature extraction of depth D information using the position acquired from the RGB information as mask information. This application adopts the second thinking to carry out concrete implementation.

The following describes a training process of the view reconstruction model in detail, fig. 1 and 4 show an overall process of training the view reconstruction model, please refer to fig. 4, the training process of the view reconstruction model includes steps 131 to 137, which is described in detail below.

Step 131: obtaining a sample image and a corresponding three-dimensional coordinate marker imageI _GT. The sample image can be obtained by shooting a target object in different scenes by using an imaging device such as a camera or a video camera.

Step 132: inputting the sample image into a target detection network to detect a target object in the sample image to obtain a target detection result imageI _srcAnd obtaining the position of the target object.

Step 133: image of target detection resultI _srcInputting into the view reconstruction model to obtain a three-dimensional reconstructed imageI _3DFrom the above, it can be seen that the three-dimensional reconstructed image includes three channels, here, predicted three-dimensional coordinates representing pixel correspondences.

Step 134: according to the corresponding predicted three-dimensional coordinates of each pixel

And three-dimensional coordinate mark value

Calculating the actual reconstruction error for each pixel

Constructing a first loss function using the actual reconstruction errors of all pixels, wherein the upscaling is performediIs shown asiAnd (4) a pixel. The first loss function represents the difference between the three-dimensional coordinate reconstructed by the view reconstruction model and the actual value, and the target detection result imageI _srcThe pixels of the middle foreground part (i.e. the pixels belonging to the target object) should have a larger influence on the training than the pixels of the background part, so that the pixels of the foreground part and the pixels of the background part can be set to have different weight values, and thus the first loss function can be

，

Wherein the content of the first and second substances,nwhich represents the number of pixels that are to be counted,λis a preset weight value, and is used as a weight value,Fi.e. mask information of the target object, i.e.FRepresenting a set of pixels in the image that were previously labeled as belonging to the target object.

For a symmetric object, the first loss function is

，

WhereinsymRepresenting the set of all symmetric poses of a symmetric object,pis shown aspThe symmetrical postures of the robot are symmetrical,R _Pis shown aspA transformation matrix of symmetric poses. For example, a cube-shaped object having three axes of symmetry, then has three symmetrical poses.

Step 135: reconstructing a three-dimensional imageI _3DAnd three-dimensional coordinate marker imagesI _GTInputting the error values into a preset error regression discriminant network to obtain the predicted reconstruction error of each pixel

By usingPredictive reconstruction error for all pixels

And actual reconstruction error

Constructing a second loss function, which may be

。

The error regression discrimination network is used for evaluating the quality of the view reconstruction model on a three-dimensional coordinate reconstruction result, can distinguish the difference between a three-dimensional reconstruction image and a three-dimensional coordinate marking image, is in a confrontation relation with the view reconstruction model, and can give feedback to change the view reconstruction model in the direction of reducing the error if the error of the view reconstruction model is increased in the training process, so that the quality of three-dimensional coordinate reconstruction is improved. Referring to FIG. 5, the error regression discrimination network may include several convolution-pooling layers.

Step 136: and constructing a third loss function by utilizing a result obtained by inputting the three-dimensional reconstruction image into the error regression discrimination network and a result obtained by inputting the three-dimensional coordinate marking image into the error regression discrimination network, wherein the third loss function represents the resolution capability of the error regression discrimination network. Here, the result of inputting the three-dimensional reconstructed image or the three-dimensional coordinate marker image into the error regression discrimination network is considered separately, and the end of the network may be a softmax function, so that the result may be a value between 0 and 1. The objective of the view reconstruction model is to approximate the output three-dimensional reconstructed image to the three-dimensional coordinate labeled image, and the results of the error regression discrimination network and the three-dimensional coordinate labeled image as inputs are very close to each other, so that the third loss function can be

，

Wherein the content of the first and second substances,Jthe identity of the network is discriminated for error regression,Gfor the identification of the view reconstruction model,G(I _src) A three-dimensional reconstructed image is represented,J(G(I _src) ) represents the result of inputting the three-dimensional reconstructed image into an error regression discrimination network,J(I _GT) And representing the result obtained by inputting the three-dimensional coordinate marking image into the error regression discrimination network.

Step 137: and constructing a total loss function by using the weighted sum of the first loss function, the second loss function and the third loss function, and training the view reconstruction model and the error regression discriminant network by using a back propagation algorithm according to the total loss function to finally obtain parameters of the view reconstruction model. In order to make the error regression discrimination network distinguish the difference as much as possible, the view reconstruction model can reconstruct a three-dimensional coordinate reconstruction image close to the actual value as much as possible, and the total loss function can be

，

For a symmetric object, the total loss function is

，

The following steps 140 to 150 are described.

Step 140: and calculating a transformation matrix according to the two-dimensional coordinates and the corresponding three-dimensional coordinates of the pixels.

The processing in step 130 corresponds to obtaining two-dimensional coordinate points of the image (ii) inu _i，v _i) Three-dimensional coordinate points in the corresponding world coordinate system(s) ((x _i，y _i，z _i) The transformation relation between the world coordinate system and the camera coordinate system can be obtained by using the point pairs formed by the two-dimensional points and the corresponding three-dimensional points, and the subsequent calculation of the equivalent shaft angle and the equivalent rotating shaft can be further obtainedAnd transforming the matrix. Referring to FIG. 6, two-dimensional points in an image

The three-dimensional point (3D point for short) corresponding to the (2D point for short) is

The transformation relation between the world coordinate system and the camera coordinate system can be a rotation matrixRAnd translation vectortIs expressed as

. 3D points in the world coordinate system homogeneous coordinates as

And the homogeneous coordinate of the 2D point in the image coordinate system is recorded as

The calibrated intrinsic parameters of the camera are

。

Then the 3D point-to-2D point projection may be represented as

，

Wherein

Is a scaling factor. According to the usual theory

There should be 6 degrees of freedom, but a rotation matrixRAlthough there are 9 parameters, there are only 3 degrees of freedom because the rotation matrix has orthogonal constraints. The rotation matrix can be ignored firstly in the calculationRIs orthogonally constrained in accordance with

Is provided with

When 12 unknown parameters are calculated, the above formula can be changed into

，

After expansion, an equation set is obtained

，

Elimination

Written in matrix form

，

As can be derived from the above, 1 pair of 3D-2D point pairs can provide two equations, the number of the point pairs

When, a form can be obtained

The rotation matrix can be obtained by SVD (Singular Value Decomposition)RWill rotate the matrixRUsed as a transformation matrix for subsequent calculation of the equivalent shaft angles and the equivalent rotation axes. In practical application, the transformation matrix can be solved through RANSAC algorithm, namely, the transformation matrix is selected arbitrarilyNAnd calculating a transformation matrix by taking the individual point pairs as initial point pairs, then carrying out iterative optimization, and evaluating the error of the obtained transformation matrix until the error is smaller than a set threshold value.

Because the two-dimensional coordinates and the corresponding three-dimensional coordinates of each pixel are obtained in step 130, a transformation matrix is calculated based on the mapping relationship between the two-dimensional coordinates and the three-dimensional coordinates of the pixel level in this step, which is favorable for improving the accuracy of pose estimation.

Step 150: and calculating according to the transformation matrix to obtain an equivalent shaft angle and an equivalent rotating shaft, thereby obtaining the pose of the target object.

The resulting transformation matrix can be expressed as

，

Then the equivalent shaft angle is

，

An equivalent rotation axis of

。

It can be understood that the pose of the target object is estimated after the equivalent shaft angle and the equivalent rotating shaft are obtained, and the robot can grab the target object according to the pose.

Referring to fig. 7, in an embodiment, the pose estimation system of the target object includes an image acquisition module 11, a target detection network 12, a view reconstruction model 13, a transformation matrix calculation module 14, and a pose calculation module 15, which are respectively described below.

The image acquisition module 11 is configured to acquire an image to be processed, where the image to be processed may be obtained by shooting a scene where a target object is located with an imaging device such as a camera or a video camera, and is used to perform pose estimation on the target object subsequently. The target object can be a product on an industrial production line, a mechanical part in an article box, a tool on an operation table and the like.

The target detection network 12 is connected to the image acquisition module 11, and is configured to detect a target object in the image to be processed, obtain a target detection result image, and obtain a position and a category of the target object. The target detection network may use existing target detection network structures such as SSD, YOLO, fast RCNN, Mobile Net, etc.

The view reconstruction model 13 is connected to the target detection network 12, and is configured to calculate a target detection result image to obtain a three-dimensional reconstruction image, where the three-dimensional reconstruction image includes three channels and is used to represent three-dimensional coordinates corresponding to pixels.

The view reconstruction model 13 is configured to establish a mapping relationship between two-dimensional coordinates and three-dimensional coordinates, and reconstruct three-dimensional coordinates corresponding to each pixel from the two-dimensional image. The view reconstruction model 13 needs to be trained, the view reconstruction model 13 is trained by utilizing the sample image and the corresponding three-dimensional coordinate labeled image to train the mapping relation, wherein the three-dimensional coordinate labeled image is pre-constructed according to the actual three-dimensional coordinate corresponding to the sample image, the view reconstruction model 13 obtains a stable mapping relation through the training, the error between the three-dimensional reconstruction result and the actual value is minimized as much as possible, and thus, the mapping relation between the two-dimensional image coordinate and the three-dimensional coordinate of an unknown object can be well established.

Referring to fig. 3, in an embodiment, the view reconstruction model 13 may be a self-encoder structure, which includes an encoder and a decoder, the encoder mainly includes convolution, pooling and activation function operations, the convolution kernel size may be 3x3 or 5x5, and the step size is 1; the decoder mainly comprises up-sampling and convolution operation; the encoder and the decoder are connected through one or more full-connection layers, and the outputs of a plurality of layers in the encoder and the outputs of symmetrical layers in the decoder are subjected to channel splicing, so that the splicing of the multi-scale characteristic diagram is realized, the multi-scale characteristic diagram can adapt to the receptive field of a large object and the receptive field of a small object, and the large object and the small object can be subjected to the establishment of the mapping relation and the pose estimation at the same time. It can be understood that, for the same object, the mapping relationship between the two-dimensional image coordinate and the three-dimensional coordinate is different under different viewing angles, and therefore, in order to adapt to the mapping relationship under different viewing angles, the mapping relationship cannot be simply represented by linear transformation, and needs to be expressed by a higher-order function or transformation, so in this embodiment, by using an auto-Encoder structure, in an Encoder (encoding) and Decoder (decoding) manner, the adaptive higher-order function or transformation is finally reconstructed by performing multiple convolutions, pooling, activating functions, upsampling and concatenating of multi-scale feature maps.

The three-dimensional coordinate labeled image used in training can be constructed by normalizing the actual three-dimensional coordinate and converting the normalized actual three-dimensional coordinate into three-channel color space information of the image. The position data of the object in three-dimensional space can be represented by a three-dimensional point cloud coordinate system (x，y， z) To characterize, can be given byx，y，z) Coordinate mapping into normalized space, converting into color space information: (R，G，B) If the transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane is determined, the three-dimensional coordinates of the target object can be converted into the two-dimensional coordinates on the image, and the corresponding color space information is obtained (R，G，B) Stored at corresponding pixels in the image, a two-dimensional color image having three-dimensional coordinate information of the object is obtained, R, G, B three channels in the image and three-dimensional coordinatesx、y、zThe conversion process maps the normalized three-dimensional coordinates of the target object to the RGB values of the color space directly without feature matching, and solves the problem that the target object is difficult to extract features under the conditions of low texture, reflective surface or partial shielding. Therefore, according to the predicted transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane, the points on the target object are mapped to pixels on the image plane, then the three-dimensional coordinates of the target object are normalized and used as RGB values of corresponding pixels on the image plane, and therefore a three-dimensional coordinate marker image is obtained and used for representing the true values of the three-dimensional coordinates of the target object. It can be understood that after training, the three-dimensional reconstructed image output by the view reconstruction model 13 is a color image, wherein the RGB values correspond to the normalized three-dimensional coordinates of the target object, so as to establish the corresponding relationship between the two-dimensional coordinates and the three-dimensional coordinates of each pixel.

The pose estimation system of the target object may further include a view model training module 16 to train the view reconstruction model 13, where the training process mainly includes: obtaining a sample imageAnd corresponding three-dimensional coordinate marker imagesI _GT(ii) a Inputting the sample image into a target detection network to detect a target object in the sample image to obtain a target detection result imageI _srcObtaining the position of the target object; image of target detection resultI _srcInputting into the view reconstruction model to obtain a three-dimensional reconstructed imageI _3DThe three-dimensional reconstruction image comprises three channels, wherein the three channels are used for representing the corresponding predicted three-dimensional coordinates of the pixels; according to the corresponding predicted three-dimensional coordinates of each pixel

And three-dimensional coordinate mark value

Calculating the actual reconstruction error for each pixel

Constructing a first loss function using the actual reconstruction errors of all pixels, the first loss function may be

，

Wherein the content of the first and second substances,nwhich represents the number of pixels that are to be counted,λis a preset weight value, and is used as a weight value,Fi.e. mask information of the target object, i.e.FRepresenting a set of pixels in the image previously labeled as belonging to the target object;

for a symmetric object, the first loss function is

；

WhereinsymRepresenting the set of all symmetric poses of a symmetric object,pis shown aspThe symmetrical postures of the robot are symmetrical,R _Pis shown aspA transformation matrix of the symmetric poses;

reconstructing a three-dimensional imageI _3DAnd three-dimensional coordinate markImage of a personI _GTInputting the error values into a preset error regression discriminant network to obtain the predicted reconstruction error of each pixel

Using the predicted reconstruction errors of all pixels

And actual reconstruction error

Constructing a second loss function, referring to FIG. 5, the error regression discriminant network may include a number of convolution-pooling layers, and the second loss function may be

；

Constructing a third loss function using a result of inputting the three-dimensional reconstructed image into the error regression discrimination network and a result of inputting the three-dimensional coordinate mark image into the error regression discrimination network, where the result of inputting the three-dimensional reconstructed image or the three-dimensional coordinate mark image into the error regression discrimination network is considered separately, respectively, and the end of the network may be a softmax function, so that the obtained result may be a value between 0 and 1, and the third loss function may be

；

Wherein the content of the first and second substances,Jthe identity of the network is discriminated for error regression,Gfor the identification of the view reconstruction model,G(I _src) A three-dimensional reconstructed image is represented,J(G(I _src) ) represents the result of inputting the three-dimensional reconstructed image into an error regression discrimination network,J(I _GT) Representing the result obtained by inputting the three-dimensional coordinate mark image into an error regression discrimination network;

constructing a total loss function by using the weighted sum of the first loss function, the second loss function and the third loss function, training a view reconstruction model and an error regression discriminant network by using a back propagation algorithm according to the total loss function, and finally obtaining parameters of the view reconstruction model, wherein the total loss function can be

，

For a symmetric object, the total loss function is

。

For a detailed description of the training procedure of the view reconstruction model 13, reference may be made to the step 130 above, which is not described herein again.

The transformation matrix calculation module 14 is configured to calculate a transformation matrix according to the two-dimensional coordinates of the pixels and the corresponding three-dimensional coordinates. Through calculation of the attempted reconstruction model 13, it corresponds to the two-dimensional coordinate points where the image has been obtained (ii) ((iii))u _i，v _i) Three-dimensional coordinate points in the corresponding world coordinate system(s) ((x _i，y _i，z _i) The transformation relation between the world coordinate system and the camera coordinate system can be obtained by using the point pairs formed by the two-dimensional points and the corresponding three-dimensional points

WhereinRIn order to be a matrix of rotations,tfor translation vectors, the rotation matrix isRAs a transformation matrix for subsequent calculations of equivalent shaft angles and equivalent rotation axes. The specific calculation method can refer to the above step 140, and the transformation matrix can be solved by RANSAC algorithm, i.e. arbitrary selectionNAnd calculating a transformation matrix by taking the individual point pairs as initial point pairs, then carrying out iterative optimization, and evaluating the error of the obtained transformation matrix until the error is smaller than a set threshold value. The transformation relationship between the world coordinate system and the camera coordinate system can be solved according to the following formula:

，

wherein

Is the homogeneous coordinate of the 3D point in a world coordinate system,

is the homogeneous coordinate of the 2D point in the image coordinate system,

the internal parameters of the calibrated camera are obtained.

Has 12 unknown parameters, and is recorded as

Then the above formula can be changed to

，

After expansion, an equation set is obtained

，

Elimination

Written in matrix form

，

When, a form can be obtained

The rotation matrix can be obtained by SVD (Singular Value Decomposition)R。

The pose calculation module 15 is configured to calculate an equivalent axis angle and an equivalent rotation axis according to the transformation matrix, so as to obtain a pose of the target object. The transformation matrix obtained by the transformation matrix calculation module 14 can be expressed as

，

Then the equivalent shaft angle is

，

An equivalent rotation axis of

。

On the basis of the above-described target object pose estimation method, the present application also provides a robot, which may include a camera 21, a processor 22, and a robot arm 23, please refer to fig. 8.

The camera 21 is used to take a to-be-processed image including a target object, which may be a product on an industrial line, a mechanical part in an article box, a tool on an operation table, or the like. For example, the camera in fig. 8 photographs the target object in the object box.

The processor 22 is connected to the camera 21 and the robot arm 23, and configured to acquire an image to be processed through the camera 21, obtain a pose parameter of the target object by performing the pose estimation method, and send the pose parameter to the robot arm 23 so that the robot arm 23 grips the target object, where the pose parameter may refer to an equivalent rotation axis and an equivalent axis angle.

The end of the mechanical arm 23 is provided with a mechanical claw 231, and when receiving the pose parameter of the target object sent by the processor 22, the mechanical arm 23 and the mechanical claw 231 move according to the pose parameter to grab the target object.

According to the target object pose estimation method and system and the robot of the embodiment, the problem of detection and pose estimation of the three-dimensional object is decomposed into the problem of target detection in the two-dimensional image and the problem of pose estimation in the three-dimensional space, so that a complex problem is simplified into two simple problems, meanwhile, in the pose estimation process, the pose estimation problem is decomposed into two processes, one process is the solution of the mapping relation between the two-dimensional coordinates of the pixels in the image and the corresponding three-dimensional coordinates, the other process is the estimation of the pose of the target object, and the same process is simplified into the two simple problems, so that the complexity of the solution of the pose estimation problem of the target object is reduced, and the operation efficiency is improved. The view reconstruction model is obtained through training, the view reconstruction model adopts a self-encoder structure and is used for establishing a mapping relation between two-dimensional coordinates and corresponding three-dimensional coordinates of pixels in an image, a three-dimensional reconstruction image covering a target object can be obtained through the view reconstruction model, the three-dimensional coordinates corresponding to each pixel are obtained, the view reconstruction model can adapt to pose estimation of objects which are low in texture and have light reflecting surfaces or are partially shielded, the change of the external environment and illumination can be adapted, the view reconstruction model adopts multi-scale feature map splicing, pose estimation can be simultaneously carried out on small objects and large objects, and the view reconstruction model has good environment adaptability and cross-domain migration capability. Meanwhile, the pose estimation is solved based on the mapping relation between the two-dimensional coordinates and the three-dimensional coordinates of the pixel level, and the pose estimation precision is improved.

Reference is made herein to various exemplary embodiments. However, those skilled in the art will recognize that changes and modifications may be made to the exemplary embodiments without departing from the scope hereof. For example, the various operational steps, as well as the components used to perform the operational steps, may be implemented in differing ways depending upon the particular application or consideration of any number of cost functions associated with operation of the system (e.g., one or more steps may be deleted, modified or incorporated into other steps).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. Additionally, as will be appreciated by one skilled in the art, the principles herein may be reflected in a computer program product on a computer readable storage medium, which is pre-loaded with computer readable program code. Any tangible, non-transitory computer-readable storage medium may be used, including magnetic storage devices (hard disks, floppy disks, etc.), optical storage devices (CD-to-ROM, DVD, Blu-Ray discs, etc.), flash memory, and/or the like. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including means for implementing the function specified. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified.

While the principles herein have been illustrated in various embodiments, many modifications of structure, arrangement, proportions, elements, materials, and components particularly adapted to specific environments and operative requirements may be employed without departing from the principles and scope of the present disclosure. The above modifications and other changes or modifications are intended to be included within the scope of this document.

The foregoing detailed description has been described with reference to various embodiments. However, one skilled in the art will recognize that various modifications and changes may be made without departing from the scope of the present disclosure. Accordingly, the disclosure is to be considered in an illustrative and not a restrictive sense, and all such modifications are intended to be included within the scope thereof. Also, advantages, other advantages, and solutions to problems have been described above with regard to various embodiments. However, the benefits, advantages, solutions to problems, and any element(s) that may cause any element(s) to occur or become more pronounced are not to be construed as a critical, required, or essential feature or element of any or all the claims. As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, system, article, or apparatus. Furthermore, the term "coupled," and any other variation thereof, as used herein, refers to a physical connection, an electrical connection, a magnetic connection, an optical connection, a communicative connection, a functional connection, and/or any other connection.

Those skilled in the art will recognize that many changes may be made to the details of the above-described embodiments without departing from the underlying principles of the invention. Accordingly, the scope of the invention should be determined only by the claims.

Claims

1. A pose estimation method of a target object, characterized by comprising:

acquiring an image to be processed;

2. The pose estimation method according to claim 1, wherein the view reconstruction model is trained by:

And three-dimensional coordinate mark value

Calculating the actual reconstruction error for each pixel

Using the predicted reconstruction errors of all pixels

And actual reconstruction error

Constructing a second loss function;

3. The pose estimation method according to claim 2, wherein the first loss function is

，

for a symmetric object, the first loss function is

，

the second loss function is

，

The third loss function is

，

the total loss function is

，

For a symmetric object, the total loss function is

，

4. The pose estimation method according to any one of claims 2 to 3, wherein the three-dimensional coordinate mark image is obtained by: and mapping points on the target object into pixels on the image plane according to the predicted transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane, normalizing the three-dimensional coordinates of the target object to be used as RGB values of corresponding pixels on the image plane, and thus obtaining the three-dimensional coordinate marked image.

5. The pose estimation method according to claim 1, wherein the view reconstruction model is a self-encoder structure including an encoder and a decoder connected by one or more fully-connected layers, and wherein outputs of several layers in the encoder and outputs of symmetrical layers in the decoder are channel-spliced.

6. The pose estimation method according to claim 1, wherein the calculating an equivalent axis angle and an equivalent rotation axis from the transformation matrix includes:

the equivalent shaft angle is calculated according to the following formula:

，

the equivalent rotation axis is calculated according to the following formula:

，

。

7. a pose estimation system of a target object, characterized by comprising:

the image acquisition module is used for acquiring an image to be processed;

8. The pose estimation system of claim 7, further comprising a view model training module to train the view reconstruction model by:

And three-dimensional coordinate mark value

Calculating the actual reconstruction error for each pixel

Using the predicted reconstruction errors of all pixels

And actual reconstruction error

Constructing a second loss function;

9. A robot, comprising:

a camera for taking an image to be processed including a target object;

a processor, connected to the camera and the mechanical arm, for acquiring an image to be processed by the camera, obtaining the pose of the target object by performing the pose estimation method according to any one of claims 1 to 6, and sending the pose to the mechanical arm so that the mechanical arm grasps the target object.

10. A computer-readable storage medium characterized in that the medium has stored thereon a program executable by a processor to implement the pose estimation method according to any one of claims 1 to 6.