CN113409384A - Pose estimation method and system of target object and robot - Google Patents

Pose estimation method and system of target object and robot Download PDF

Info

Publication number
CN113409384A
CN113409384A CN202110939214.5A CN202110939214A CN113409384A CN 113409384 A CN113409384 A CN 113409384A CN 202110939214 A CN202110939214 A CN 202110939214A CN 113409384 A CN113409384 A CN 113409384A
Authority
CN
China
Prior art keywords
image
dimensional
target object
reconstruction
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110939214.5A
Other languages
Chinese (zh)
Other versions
CN113409384B (en
Inventor
杨洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huahan Weiye Technology Co ltd
Original Assignee
Shenzhen Huahan Weiye Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huahan Weiye Technology Co ltd filed Critical Shenzhen Huahan Weiye Technology Co ltd
Priority to CN202110939214.5A priority Critical patent/CN113409384B/en
Publication of CN113409384A publication Critical patent/CN113409384A/en
Application granted granted Critical
Publication of CN113409384B publication Critical patent/CN113409384B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J19/00Accessories fitted to manipulators, e.g. for monitoring, for viewing; Safety devices combined with or specially adapted for use in connection with manipulators
    • B25J19/02Sensing devices
    • B25J19/04Viewing devices
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • B25J9/161Hardware, e.g. neural networks, fuzzy logic, interfaces, processor
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1661Programme controls characterised by programming, planning systems for manipulators characterised by task planning, object-oriented languages
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1694Programme controls characterised by use of sensors other than normal servo-feedback from position, speed or acceleration sensors, perception control, multi-sensor controlled systems, sensor fusion
    • B25J9/1697Vision controlled systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects

Abstract

A pose estimation method and system of a target object and a robot are provided, wherein the pose estimation method comprises the following steps: acquiring an image to be processed; inputting an image to be processed into a target detection network to obtain a target detection result image; inputting the target detection result image into a trained view reconstruction model to obtain a three-dimensional reconstruction image, wherein the three-dimensional reconstruction image comprises three channels and is used for representing three-dimensional coordinates corresponding to pixels; calculating a transformation matrix according to the two-dimensional coordinates and the corresponding three-dimensional coordinates of the pixels; and calculating according to the transformation matrix to obtain an equivalent shaft angle and an equivalent rotating shaft, thereby obtaining the pose of the target object. The view reconstruction model is obtained through training and used for establishing the mapping relation between the two-dimensional coordinates and the corresponding three-dimensional coordinates of the pixels in the image, the three-dimensional reconstruction image covering the target object can be obtained through the view reconstruction model, the three-dimensional coordinates corresponding to each pixel are obtained, and the pose estimation method can be suitable for pose estimation of objects which are low in texture, have light reflecting surfaces or are partially shielded.

Description

Pose estimation method and system of target object and robot
Technical Field
The invention relates to the technical field of machine vision, in particular to a pose estimation method and system for a target object and a robot.
Background
In the robot field, the independent grabbing of the target object is a key capability of the intelligent robot, wherein the grabbing of the scattered objects is also a key for realizing the intellectualization of the robot all the time, and the capability of the robot for grabbing the scattered objects can be applied to scenes such as sorting of parts and the like, so that the working efficiency is improved. However, the current robots perform a complex new gripping task, taking weeks to reprogram, which makes the reconfiguration of modern manufacturing lines very expensive and slow. In addition, the robot is mostly applied to a specific environment, the robot performs grabbing operation on a specific known object, and for an unknown object placed in different poses in an uncertain environment, the robot autonomously determines the grabbing position of the grabbed object and the grabbing pose of the grabbing gripper, so that the prior art is immature. If the robot can autonomously grab scattered objects, the time for teaching programming of the robot can be shortened, the flexibility and the intelligence of automatic manufacturing are better realized, the production requirements of multiple types and small batches at present are met, and the requirement of quick updating of manufacturing equipment when products are updated is met. And for the posture recognition of the scattered objects, the method is an important step for controlling the robot to grab the scattered objects.
Computer vision techniques have occupied an important position in the perception of robot unstructured scenes. The visual image is an effective means for acquiring real world information, the visual image is used for extracting the characteristics of an operation object through a visual perception algorithm, such as the position, angle, posture and other information of an object, and the information can be used for enabling the robot to execute corresponding operation and finish a specified operation task. For sorting parts, scene data can be acquired by using a vision sensor, but how to identify a target object from a scene and estimate the position and the posture of the target object is a very critical problem, which is very important for the calculation of a robot grabbing position and a grabbing path. At present, there are two main types of methods for estimating the pose of an object: the method is based on the traditional point cloud or image analysis algorithm for estimation, and is based on deep learning for estimation through a learning target detection and pose iteration method. The first method mainly identifies and matches the pose according to the image or three-dimensional point cloud template information, and has the defects that a template needs to be established according to shot images or CAD data for each object, multiple templates need to be established for multiple parts, and the product model changing period is long. The pose estimation mainly carries out 6D (three-dimensional coordinate positioning and three-dimensional direction) pose estimation, local features extracted from an image are matched with features in a three-dimensional model of an object, and the 6D pose of the object can be obtained by utilizing the corresponding relation between two-dimensional coordinates and three-dimensional coordinates. However, these methods do not handle low-texture objects very well because only few local features can be extracted. Similarly, most of the existing mainstream pose estimation algorithms based on deep learning rely on information such as color and texture of the surface of an object, and most of parts in industrial production belong to low-texture objects, and are easily influenced by illumination conditions, so that the texture reflected from a two-dimensional image is not necessarily the real texture of the surface of a three-dimensional object, and when the resolution of the image changes, the calculated texture may have large deviation, and feature extraction is not easy to perform, so that the algorithm has a poor recognition effect on the low-texture and parts with reflective surfaces. In practical situations, there is often a problem that the target object is partially occluded, which also results in difficulty in obtaining information of local features or color, texture, etc. of the object surface. In order to process low-texture objects, two methods are available, wherein the first method is to estimate the three-dimensional coordinates of object pixels or key points in an input image, so that the corresponding relation between two-dimensional coordinates and the three-dimensional coordinates is established, and 6D pose estimation can be carried out; the second method is to transform the 6D pose estimation problem into a pose classification problem or a pose regression problem by discretizing the pose space. These methods can handle low-texture objects, but are difficult to achieve high-precision pose estimation, and small errors in the classification or regression stage will directly result in pose mismatch.
Disclosure of Invention
The application provides a pose estimation method and system of a target object, a robot and a computer readable storage medium, and aims to solve the problem that the pose estimation effect of an object with low texture and a reflective surface is poor due to the fact that most existing pose estimation methods depend on information such as color and texture of the surface of the object.
According to a first aspect, an embodiment provides a pose estimation method of a target object, including:
acquiring an image to be processed;
inputting the image to be processed into a target detection network to detect a target object in the image to be processed to obtain a target detection result image;
inputting the target detection result image into a pre-trained view reconstruction model to obtain a three-dimensional reconstruction image, wherein the three-dimensional reconstruction image comprises three channels and is used for representing three-dimensional coordinates corresponding to pixels;
calculating a transformation matrix according to the two-dimensional coordinates and the corresponding three-dimensional coordinates of the pixels;
and calculating according to the transformation matrix to obtain an equivalent shaft angle and an equivalent rotating shaft, thereby obtaining the pose of the target object.
In one embodiment, the view reconstruction model is trained by:
obtaining a sample image and a corresponding three-dimensional coordinate marker imageI GT
Inputting the sample image into the target detection network to detect a target object in the sample image to obtain a target detection result imageI src
The target detection result image is processedI src Inputting the image reconstruction model to obtain a three-dimensional reconstruction imageI 3D The three-dimensional reconstruction image comprises three channels and is used for representing predicted three-dimensional coordinates corresponding to pixels;
according to the corresponding predicted three-dimensional coordinates of each pixel
Figure 863110DEST_PATH_IMAGE001
And three-dimensional coordinate mark value
Figure 100002_DEST_PATH_IMAGE002
Calculating the actual reconstruction error for each pixel
Figure 807014DEST_PATH_IMAGE003
Constructing a first loss function using the actual reconstruction errors of all pixels, superscriptingiIs shown asiA plurality of pixels;
reconstructing the three-dimensional imageI 3D And the three-dimensional coordinate mark imageI GT Inputting the error values into a preset error regression discriminant network to obtain the predicted reconstruction error of each pixel
Figure 100002_DEST_PATH_IMAGE004
Using the predicted reconstruction errors of all pixels
Figure 508123DEST_PATH_IMAGE004
And actual reconstruction error
Figure 106594DEST_PATH_IMAGE005
Constructing a second loss function;
constructing a third loss function by using a result obtained by inputting the three-dimensional reconstruction image into the error regression judging network and a result obtained by inputting the three-dimensional coordinate marking image into the error regression judging network;
and constructing a total loss function by using a weighted sum of the first loss function, the second loss function and the third loss function, and training the view reconstruction model and the error regression discriminant network by using a back propagation algorithm according to the total loss function to obtain parameters of the view reconstruction model.
In one embodiment, the first loss function is
Figure 100002_DEST_PATH_IMAGE006
Wherein the content of the first and second substances,nwhich represents the number of pixels that are to be counted,λis a preset weight value, and is used as a weight value,Frepresenting a set of pixels in the image that belong to a target object;
for a symmetric object, the first loss function is
Figure 78224DEST_PATH_IMAGE007
Wherein the content of the first and second substances,symrepresenting the set of all symmetric poses of a symmetric object,pis shown aspThe symmetrical postures of the robot are symmetrical,R P is shown aspA transformation matrix of the symmetric poses;
the second loss function is
Figure 100002_DEST_PATH_IMAGE008
The third loss function is
Figure 431845DEST_PATH_IMAGE009
Wherein the content of the first and second substances,Jan identification of a network is discriminated for the error regression,Gan identification of a model is reconstructed for the view,G(I src ) Representing the three-dimensional reconstructed image(s),J(G(I src ) Represents the result of inputting the three-dimensional reconstructed image into the error regression discrimination network,J(I GT ) Representing the result obtained by inputting the three-dimensional coordinate marking image into the error regression discrimination network;
the total loss function is
Figure 100002_DEST_PATH_IMAGE010
For a symmetric object, the total loss function is
Figure DEST_PATH_IMAGE011
Wherein the content of the first and second substances,αandβis a preset weight value.
In one embodiment, the three-dimensional coordinate marker image is obtained by: and mapping points on the target object into pixels on the image plane according to the predicted transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane, normalizing the three-dimensional coordinates of the target object to be used as RGB values of corresponding pixels on the image plane, and thus obtaining the three-dimensional coordinate marked image.
In one embodiment, the view reconstruction model is a self-encoder structure, and includes an encoder and a decoder connected by one or more fully-connected layers, and the outputs of several layers in the encoder and the outputs of symmetrical layers in the decoder are channel-spliced.
In one embodiment, the calculating the equivalent shaft angle and the equivalent rotation axis according to the transformation matrix includes:
the equivalent shaft angle is calculated according to the following formula:
Figure 100002_DEST_PATH_IMAGE012
the equivalent rotation axis is calculated according to the following formula:
Figure 652873DEST_PATH_IMAGE013
wherein the content of the first and second substances,r 11 r 12 r 13 r 21 r 22 r 23 r 31 r 32 r 33 the elements of the transformation matrix are specifically:
Figure 100002_DEST_PATH_IMAGE014
according to a second aspect, an embodiment provides a pose estimation system of a target object, including:
the image acquisition module is used for acquiring an image to be processed;
the target detection network is connected with the image acquisition module and is used for detecting a target object in the image to be processed to obtain a target detection result image;
the view reconstruction model is connected with the target detection network and used for calculating the target detection result image to obtain a three-dimensional reconstruction image, and the three-dimensional reconstruction image comprises three channels and is used for representing three-dimensional coordinates corresponding to pixels;
the transformation matrix calculation module is used for calculating a transformation matrix according to the two-dimensional coordinates of the pixels and the corresponding three-dimensional coordinates;
and the pose calculation module is used for calculating to obtain an equivalent shaft angle and an equivalent rotating shaft according to the transformation matrix so as to obtain the pose of the target object.
In one embodiment, the pose estimation system of the target object further comprises a view model training module, and the view model training module is configured to train the view reconstruction model by:
obtaining a sample image and a corresponding three-dimensional coordinate marker imageI GT
Inputting the sample image into the target detection network to detect a target object in the sample image to obtain a target detection result imageI src
The target detection result image is processedI src Inputting the image reconstruction model to obtain a three-dimensional reconstruction imageI 3D The three-dimensional reconstruction image comprises three channels and is used for representing predicted three-dimensional coordinates corresponding to pixels;
according to the corresponding predicted three-dimensional coordinates of each pixel
Figure 507565DEST_PATH_IMAGE015
And three-dimensional coordinate mark value
Figure 100002_DEST_PATH_IMAGE016
Calculating the actual reconstruction error for each pixel
Figure DEST_PATH_IMAGE017
Constructing a first loss function using the actual reconstruction errors of all pixels, superscriptingiIs shown asiA plurality of pixels;
reconstructing the three-dimensional imageI 3D And the three-dimensional coordinate mark imageI GT Inputting the error values into a preset error regression discriminant network to obtain the predicted reconstruction error of each pixel
Figure 100002_DEST_PATH_IMAGE018
Using the predicted reconstruction errors of all pixels
Figure 599280DEST_PATH_IMAGE018
And actual reconstruction error
Figure DEST_PATH_IMAGE019
Constructing a second loss function;
constructing a third loss function by using a result obtained by inputting the three-dimensional reconstruction image into the error regression judging network and a result obtained by inputting the three-dimensional coordinate marking image into the error regression judging network;
and constructing a total loss function by using a weighted sum of the first loss function, the second loss function and the third loss function, and training the view reconstruction model and the error regression discriminant network by using a back propagation algorithm according to the total loss function to obtain parameters of the view reconstruction model.
According to a third aspect, there is provided in an embodiment a robot comprising:
a camera for taking an image to be processed including a target object;
the tail end of the mechanical arm is provided with a mechanical claw which is used for grabbing the target object according to the pose of the target object;
and the processor is connected with the camera and the mechanical arm, and is used for acquiring an image to be processed through the camera, obtaining the pose of the target object by executing the pose estimation method of the first aspect, and sending the pose to the mechanical arm so that the mechanical arm grabs the target object.
According to a fourth aspect, an embodiment provides a computer-readable storage medium having a program stored thereon, the program being executable by a processor to implement the pose estimation method of the first aspect described above.
According to the method and system for estimating the pose of the target object, the robot and the computer-readable storage medium of the embodiments, the problem of detecting and estimating the pose of the three-dimensional object is decomposed into the problem of detecting the target in the two-dimensional image and estimating the pose in the three-dimensional space, so that a complicated problem is simplified into two simple problems, and meanwhile, in the process of estimating the pose, the problem of estimating the pose is decomposed into two processes, one is to solve the mapping relation between the two-dimensional coordinates of the pixels in the image and the corresponding three-dimensional coordinates, the other is to estimate the pose of the target object, and the same is to simplify the complicated problem into two simple problems, so that the complexity of solving the problem of estimating the pose of the target object is reduced, and the operation efficiency is improved. Because the view reconstruction model is obtained through training and used for establishing the mapping relation between the two-dimensional coordinates of the pixels in the image and the corresponding three-dimensional coordinates, the three-dimensional reconstruction image covering the target object can be obtained through the view reconstruction model, the three-dimensional coordinates corresponding to each pixel are obtained, the pose estimation method can adapt to the pose estimation of objects which are low in texture and have light reflecting surfaces or are partially shielded, meanwhile, the pose estimation is solved based on the mapping relation between the two-dimensional coordinates and the three-dimensional coordinates of the pixel level, and the pose estimation precision is favorably improved.
Drawings
FIG. 1 is a schematic diagram of a pose estimation method of a target object according to the present application;
FIG. 2 is a flow diagram of a pose estimation method of a target object in one embodiment;
FIG. 3 is a schematic structural diagram of a view reconstruction model according to an embodiment;
FIG. 4 is a flowchart illustrating training of a view reconstruction model according to an embodiment;
FIG. 5 is a diagram illustrating an exemplary structure of an error regression discriminant network;
FIG. 6 is a schematic diagram of the calculation of the transformation relationship between the camera coordinate system and the world coordinate system;
FIG. 7 is a schematic diagram of a pose estimation system for a target object according to an embodiment;
fig. 8 is a schematic structural diagram of a robot in an embodiment.
Detailed Description
The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.
Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.
The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The term "connected" and "coupled" when used in this application, unless otherwise indicated, includes both direct and indirect connections (couplings).
The idea of the technical scheme of the application is to establish a mapping relation between a two-dimensional image coordinate and a three-dimensional coordinate, then obtain a three-dimensional coordinate corresponding to the two-dimensional coordinate of each pixel in the image according to the mapping relation, obtain a transformation relation between a world coordinate system and a camera coordinate system by using the corresponding relation between the two-dimensional coordinate and the three-dimensional coordinate, solve the transformation relation mainly by calculating a homography matrix, obtain rotation and translation quantities for pose estimation after obtaining the homography matrix, and describe in detail below.
Fig. 1 and 2 show a general flow of a pose estimation method of a target object according to the present application, and referring to fig. 2, an embodiment of the pose estimation method of a target object includes steps 110 to 150, which are described in detail below.
Step 110: and acquiring an image to be processed. The scene where the target object is located can be shot by using imaging equipment such as a camera or a video camera to obtain an image to be processed including the target object, and the image is used for carrying out pose estimation on the target object subsequently. The target object can be a product on an industrial production line, a mechanical part in an article box, a tool on an operation table and the like.
Step 120: and inputting the image to be processed into a target detection network to detect a target object in the image to be processed to obtain a target detection result image.
In order to estimate the pose of the target object, the target object needs to be identified from the image, the position of the target object needs to be obtained, and the target object needs to be processed in a targeted manner. In the method and the device, the problems of detection and pose estimation of the three-dimensional object are decomposed into the problems of target detection in the two-dimensional image and pose estimation in the three-dimensional space, so that a complex problem is simplified into two simple problems, and the complexity of solving is reduced. In this step, a target detection network is first used to perform target detection on the image to be processed to obtain a target detection result image, and to obtain the position and type of the target object in the image to be processed to implement target detection on the two-dimensional image, and the target detection network may use the existing target detection network structure, such as SSD, YOLO, fast RCNN, Mobile Net, etc.
Step 130: and inputting the target detection result image into a pre-trained view reconstruction model to obtain a three-dimensional reconstruction image, wherein the three-dimensional reconstruction image comprises three channels and is used for representing three-dimensional coordinates corresponding to pixels.
Referring to fig. 1, in the present application, a mapping relationship between two-dimensional image coordinates and three-dimensional coordinates is established by constructing and training a view reconstruction model, the view reconstruction model reconstructs a three-dimensional coordinate corresponding to each pixel from a two-dimensional image, and outputs a three-dimensional reconstructed image with three channels to represent the three-dimensional coordinates corresponding to the pixels, where the three channels correspond to the three-dimensional coordinates in a space. The mapping relation is trained by utilizing the sample image and the corresponding three-dimensional coordinate labeled image to train the view reconstruction model, wherein the three-dimensional coordinate labeled image is constructed in advance according to the actual three-dimensional coordinate corresponding to the sample image, the view reconstruction model obtains a stable mapping relation through the training, and the error between the three-dimensional reconstruction result and the actual value is minimized as much as possible, so that the mapping relation between the two-dimensional image coordinate and the three-dimensional coordinate of an unknown object can be well established.
Referring to fig. 3, in an embodiment, the view reconstruction model may be a structure of a self-encoder, including an encoder and a decoder, where the encoder mainly includes convolution, pooling and activation function operations, and the convolution kernel size may be 3x3 or 5x5, and the step size is 1; the decoder mainly comprises up-sampling and convolution operation; the encoder and the decoder are connected through one or more full-connection layers, and the outputs of a plurality of layers in the encoder and the outputs of symmetrical layers in the decoder are subjected to channel splicing, so that the splicing of the multi-scale characteristic diagram is realized, the multi-scale characteristic diagram can adapt to the receptive field of a large object and the receptive field of a small object, and the large object and the small object can be subjected to the establishment of the mapping relation and the pose estimation at the same time. It can be understood that, for the same object, the mapping relationship between the two-dimensional image coordinate and the three-dimensional coordinate is different under different viewing angles, and therefore, in order to adapt to the mapping relationship under different viewing angles, the mapping relationship cannot be simply represented by linear transformation, and needs to be expressed by a higher-order function or transformation, so in this embodiment, by using an auto-Encoder structure, in an Encoder (encoding) and Decoder (decoding) manner, the adaptive higher-order function or transformation is finally reconstructed by performing multiple convolutions, pooling, activating functions, upsampling and concatenating of multi-scale feature maps.
The three-dimensional coordinate labeled image used in training can be constructed by normalizing the actual three-dimensional coordinate and converting the normalized actual three-dimensional coordinate into three-channel color space information of the image. The position data of the object in three-dimensional space can be represented by a three-dimensional point cloud coordinate system (x,y, z) To characterize, can be given byx,y,z) Coordinate mapping into normalized space, converting into color space information: (R,G,B) If the transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane is determined, the three-dimensional coordinates of the target object can be converted into the two-dimensional coordinates on the image, and the corresponding color space information is obtained (R,G,B) Stored at corresponding pixels in the image, a two-dimensional color image having three-dimensional coordinate information of the object is obtained, R, G, B three channels in the image and three-dimensional coordinatesxyzThe conversion process maps the normalized three-dimensional coordinates of the target object to the RGB values of the color space directly without feature matching, and solves the problem that the target object is difficult to extract features under the conditions of low texture, reflective surface or partial shielding. Therefore, according to the predicted transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane, the points on the target object are mapped to pixels on the image plane, then the three-dimensional coordinates of the target object are normalized and used as RGB values of corresponding pixels on the image plane, and therefore a three-dimensional coordinate marker image is obtained and used for representing the true values of the three-dimensional coordinates of the target object. It can be understood that after training, the three-dimensional reconstructed image output by the view reconstruction model is a color image, wherein the RGB values correspond to the normalized three-dimensional coordinates of the target object, thereby establishing the correspondence between the two-dimensional coordinates and the three-dimensional coordinates of each pixel.
For use in training of the view reconstruction model, RGB-D heterologous data, i.e. of the view reconstruction modelInputting images as RGB information, three-dimensional coordinate marking images since they include in spacex,y,z) The information may be depth d (depth) information. For heterogeneous data, there are two main ideas to extract and utilize features: the first method is that color space RGB information and depth D information are respectively input into different networks for processing, color space characteristic information and depth space characteristic information are respectively extracted, and then the two kinds of characteristic information are fused; the second method is to input RGB information, acquire an approximate position of the target object, and then guide feature extraction of depth D information using the position acquired from the RGB information as mask information. This application adopts the second thinking to carry out concrete implementation.
The following describes a training process of the view reconstruction model in detail, fig. 1 and 4 show an overall process of training the view reconstruction model, please refer to fig. 4, the training process of the view reconstruction model includes steps 131 to 137, which is described in detail below.
Step 131: obtaining a sample image and a corresponding three-dimensional coordinate marker imageI GT . The sample image can be obtained by shooting a target object in different scenes by using an imaging device such as a camera or a video camera.
Step 132: inputting the sample image into a target detection network to detect a target object in the sample image to obtain a target detection result imageI src And obtaining the position of the target object.
Step 133: image of target detection resultI src Inputting into the view reconstruction model to obtain a three-dimensional reconstructed imageI 3D From the above, it can be seen that the three-dimensional reconstructed image includes three channels, here, predicted three-dimensional coordinates representing pixel correspondences.
Step 134: according to the corresponding predicted three-dimensional coordinates of each pixel
Figure 100002_DEST_PATH_IMAGE020
And three-dimensional coordinate mark value
Figure DEST_PATH_IMAGE021
Calculating the actual reconstruction error for each pixel
Figure 100002_DEST_PATH_IMAGE022
Constructing a first loss function using the actual reconstruction errors of all pixels, wherein the upscaling is performediIs shown asiAnd (4) a pixel. The first loss function represents the difference between the three-dimensional coordinate reconstructed by the view reconstruction model and the actual value, and the target detection result imageI src The pixels of the middle foreground part (i.e. the pixels belonging to the target object) should have a larger influence on the training than the pixels of the background part, so that the pixels of the foreground part and the pixels of the background part can be set to have different weight values, and thus the first loss function can be
Figure 100002_DEST_PATH_IMAGE023
Wherein the content of the first and second substances,nwhich represents the number of pixels that are to be counted,λis a preset weight value, and is used as a weight value,Fi.e. mask information of the target object, i.e.FRepresenting a set of pixels in the image that were previously labeled as belonging to the target object.
For a symmetric object, the first loss function is
Figure DEST_PATH_IMAGE024
WhereinsymRepresenting the set of all symmetric poses of a symmetric object,pis shown aspThe symmetrical postures of the robot are symmetrical,R P is shown aspA transformation matrix of symmetric poses. For example, a cube-shaped object having three axes of symmetry, then has three symmetrical poses.
Step 135: reconstructing a three-dimensional imageI 3D And three-dimensional coordinate marker imagesI GT Inputting the error values into a preset error regression discriminant network to obtain the predicted reconstruction error of each pixel
Figure 688851DEST_PATH_IMAGE018
By usingPredictive reconstruction error for all pixels
Figure 912021DEST_PATH_IMAGE018
And actual reconstruction error
Figure 711350DEST_PATH_IMAGE019
Constructing a second loss function, which may be
Figure 329676DEST_PATH_IMAGE008
The error regression discrimination network is used for evaluating the quality of the view reconstruction model on a three-dimensional coordinate reconstruction result, can distinguish the difference between a three-dimensional reconstruction image and a three-dimensional coordinate marking image, is in a confrontation relation with the view reconstruction model, and can give feedback to change the view reconstruction model in the direction of reducing the error if the error of the view reconstruction model is increased in the training process, so that the quality of three-dimensional coordinate reconstruction is improved. Referring to FIG. 5, the error regression discrimination network may include several convolution-pooling layers.
Step 136: and constructing a third loss function by utilizing a result obtained by inputting the three-dimensional reconstruction image into the error regression discrimination network and a result obtained by inputting the three-dimensional coordinate marking image into the error regression discrimination network, wherein the third loss function represents the resolution capability of the error regression discrimination network. Here, the result of inputting the three-dimensional reconstructed image or the three-dimensional coordinate marker image into the error regression discrimination network is considered separately, and the end of the network may be a softmax function, so that the result may be a value between 0 and 1. The objective of the view reconstruction model is to approximate the output three-dimensional reconstructed image to the three-dimensional coordinate labeled image, and the results of the error regression discrimination network and the three-dimensional coordinate labeled image as inputs are very close to each other, so that the third loss function can be
Figure 100002_DEST_PATH_IMAGE025
Wherein the content of the first and second substances,Jthe identity of the network is discriminated for error regression,Gfor the identification of the view reconstruction model,G(I src ) A three-dimensional reconstructed image is represented,J(G(I src ) ) represents the result of inputting the three-dimensional reconstructed image into an error regression discrimination network,J(I GT ) And representing the result obtained by inputting the three-dimensional coordinate marking image into the error regression discrimination network.
Step 137: and constructing a total loss function by using the weighted sum of the first loss function, the second loss function and the third loss function, and training the view reconstruction model and the error regression discriminant network by using a back propagation algorithm according to the total loss function to finally obtain parameters of the view reconstruction model. In order to make the error regression discrimination network distinguish the difference as much as possible, the view reconstruction model can reconstruct a three-dimensional coordinate reconstruction image close to the actual value as much as possible, and the total loss function can be
Figure 100002_DEST_PATH_IMAGE026
For a symmetric object, the total loss function is
Figure DEST_PATH_IMAGE027
Wherein the content of the first and second substances,αandβis a preset weight value.
The following steps 140 to 150 are described.
Step 140: and calculating a transformation matrix according to the two-dimensional coordinates and the corresponding three-dimensional coordinates of the pixels.
The processing in step 130 corresponds to obtaining two-dimensional coordinate points of the image (ii) inu i v i ) Three-dimensional coordinate points in the corresponding world coordinate system(s) ((x i y i z i ) The transformation relation between the world coordinate system and the camera coordinate system can be obtained by using the point pairs formed by the two-dimensional points and the corresponding three-dimensional points, and the subsequent calculation of the equivalent shaft angle and the equivalent rotating shaft can be further obtainedAnd transforming the matrix. Referring to FIG. 6, two-dimensional points in an image
Figure 100002_DEST_PATH_IMAGE028
The three-dimensional point (3D point for short) corresponding to the (2D point for short) is
Figure DEST_PATH_IMAGE029
The transformation relation between the world coordinate system and the camera coordinate system can be a rotation matrixRAnd translation vectortIs expressed as
Figure DEST_PATH_IMAGE030
. 3D points in the world coordinate system homogeneous coordinates as
Figure DEST_PATH_IMAGE031
And the homogeneous coordinate of the 2D point in the image coordinate system is recorded as
Figure DEST_PATH_IMAGE032
The calibrated intrinsic parameters of the camera are
Figure DEST_PATH_IMAGE033
Then the 3D point-to-2D point projection may be represented as
Figure DEST_PATH_IMAGE034
Wherein
Figure DEST_PATH_IMAGE035
Is a scaling factor. According to the usual theory
Figure DEST_PATH_IMAGE036
There should be 6 degrees of freedom, but a rotation matrixRAlthough there are 9 parameters, there are only 3 degrees of freedom because the rotation matrix has orthogonal constraints. The rotation matrix can be ignored firstly in the calculationRIs orthogonally constrained in accordance with
Figure DEST_PATH_IMAGE037
Is provided with
Figure DEST_PATH_IMAGE038
When 12 unknown parameters are calculated, the above formula can be changed into
Figure DEST_PATH_IMAGE039
After expansion, an equation set is obtained
Figure DEST_PATH_IMAGE040
Elimination
Figure DEST_PATH_IMAGE041
Written in matrix form
Figure DEST_PATH_IMAGE042
As can be derived from the above, 1 pair of 3D-2D point pairs can provide two equations, the number of the point pairs
Figure DEST_PATH_IMAGE043
When, a form can be obtained
Figure DEST_PATH_IMAGE044
The rotation matrix can be obtained by SVD (Singular Value Decomposition)RWill rotate the matrixRUsed as a transformation matrix for subsequent calculation of the equivalent shaft angles and the equivalent rotation axes. In practical application, the transformation matrix can be solved through RANSAC algorithm, namely, the transformation matrix is selected arbitrarilyNAnd calculating a transformation matrix by taking the individual point pairs as initial point pairs, then carrying out iterative optimization, and evaluating the error of the obtained transformation matrix until the error is smaller than a set threshold value.
Because the two-dimensional coordinates and the corresponding three-dimensional coordinates of each pixel are obtained in step 130, a transformation matrix is calculated based on the mapping relationship between the two-dimensional coordinates and the three-dimensional coordinates of the pixel level in this step, which is favorable for improving the accuracy of pose estimation.
Step 150: and calculating according to the transformation matrix to obtain an equivalent shaft angle and an equivalent rotating shaft, thereby obtaining the pose of the target object.
The resulting transformation matrix can be expressed as
Figure DEST_PATH_IMAGE045
Then the equivalent shaft angle is
Figure DEST_PATH_IMAGE046
An equivalent rotation axis of
Figure DEST_PATH_IMAGE047
It can be understood that the pose of the target object is estimated after the equivalent shaft angle and the equivalent rotating shaft are obtained, and the robot can grab the target object according to the pose.
Referring to fig. 7, in an embodiment, the pose estimation system of the target object includes an image acquisition module 11, a target detection network 12, a view reconstruction model 13, a transformation matrix calculation module 14, and a pose calculation module 15, which are respectively described below.
The image acquisition module 11 is configured to acquire an image to be processed, where the image to be processed may be obtained by shooting a scene where a target object is located with an imaging device such as a camera or a video camera, and is used to perform pose estimation on the target object subsequently. The target object can be a product on an industrial production line, a mechanical part in an article box, a tool on an operation table and the like.
The target detection network 12 is connected to the image acquisition module 11, and is configured to detect a target object in the image to be processed, obtain a target detection result image, and obtain a position and a category of the target object. The target detection network may use existing target detection network structures such as SSD, YOLO, fast RCNN, Mobile Net, etc.
The view reconstruction model 13 is connected to the target detection network 12, and is configured to calculate a target detection result image to obtain a three-dimensional reconstruction image, where the three-dimensional reconstruction image includes three channels and is used to represent three-dimensional coordinates corresponding to pixels.
The view reconstruction model 13 is configured to establish a mapping relationship between two-dimensional coordinates and three-dimensional coordinates, and reconstruct three-dimensional coordinates corresponding to each pixel from the two-dimensional image. The view reconstruction model 13 needs to be trained, the view reconstruction model 13 is trained by utilizing the sample image and the corresponding three-dimensional coordinate labeled image to train the mapping relation, wherein the three-dimensional coordinate labeled image is pre-constructed according to the actual three-dimensional coordinate corresponding to the sample image, the view reconstruction model 13 obtains a stable mapping relation through the training, the error between the three-dimensional reconstruction result and the actual value is minimized as much as possible, and thus, the mapping relation between the two-dimensional image coordinate and the three-dimensional coordinate of an unknown object can be well established.
Referring to fig. 3, in an embodiment, the view reconstruction model 13 may be a self-encoder structure, which includes an encoder and a decoder, the encoder mainly includes convolution, pooling and activation function operations, the convolution kernel size may be 3x3 or 5x5, and the step size is 1; the decoder mainly comprises up-sampling and convolution operation; the encoder and the decoder are connected through one or more full-connection layers, and the outputs of a plurality of layers in the encoder and the outputs of symmetrical layers in the decoder are subjected to channel splicing, so that the splicing of the multi-scale characteristic diagram is realized, the multi-scale characteristic diagram can adapt to the receptive field of a large object and the receptive field of a small object, and the large object and the small object can be subjected to the establishment of the mapping relation and the pose estimation at the same time. It can be understood that, for the same object, the mapping relationship between the two-dimensional image coordinate and the three-dimensional coordinate is different under different viewing angles, and therefore, in order to adapt to the mapping relationship under different viewing angles, the mapping relationship cannot be simply represented by linear transformation, and needs to be expressed by a higher-order function or transformation, so in this embodiment, by using an auto-Encoder structure, in an Encoder (encoding) and Decoder (decoding) manner, the adaptive higher-order function or transformation is finally reconstructed by performing multiple convolutions, pooling, activating functions, upsampling and concatenating of multi-scale feature maps.
The three-dimensional coordinate labeled image used in training can be constructed by normalizing the actual three-dimensional coordinate and converting the normalized actual three-dimensional coordinate into three-channel color space information of the image. The position data of the object in three-dimensional space can be represented by a three-dimensional point cloud coordinate system (x,y, z) To characterize, can be given byx,y,z) Coordinate mapping into normalized space, converting into color space information: (R,G,B) If the transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane is determined, the three-dimensional coordinates of the target object can be converted into the two-dimensional coordinates on the image, and the corresponding color space information is obtained (R,G,B) Stored at corresponding pixels in the image, a two-dimensional color image having three-dimensional coordinate information of the object is obtained, R, G, B three channels in the image and three-dimensional coordinatesxyzThe conversion process maps the normalized three-dimensional coordinates of the target object to the RGB values of the color space directly without feature matching, and solves the problem that the target object is difficult to extract features under the conditions of low texture, reflective surface or partial shielding. Therefore, according to the predicted transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane, the points on the target object are mapped to pixels on the image plane, then the three-dimensional coordinates of the target object are normalized and used as RGB values of corresponding pixels on the image plane, and therefore a three-dimensional coordinate marker image is obtained and used for representing the true values of the three-dimensional coordinates of the target object. It can be understood that after training, the three-dimensional reconstructed image output by the view reconstruction model 13 is a color image, wherein the RGB values correspond to the normalized three-dimensional coordinates of the target object, so as to establish the corresponding relationship between the two-dimensional coordinates and the three-dimensional coordinates of each pixel.
The pose estimation system of the target object may further include a view model training module 16 to train the view reconstruction model 13, where the training process mainly includes: obtaining a sample imageAnd corresponding three-dimensional coordinate marker imagesI GT (ii) a Inputting the sample image into a target detection network to detect a target object in the sample image to obtain a target detection result imageI src Obtaining the position of the target object; image of target detection resultI src Inputting into the view reconstruction model to obtain a three-dimensional reconstructed imageI 3D The three-dimensional reconstruction image comprises three channels, wherein the three channels are used for representing the corresponding predicted three-dimensional coordinates of the pixels; according to the corresponding predicted three-dimensional coordinates of each pixel
Figure DEST_PATH_IMAGE049
And three-dimensional coordinate mark value
Figure DEST_PATH_IMAGE051
Calculating the actual reconstruction error for each pixel
Figure DEST_PATH_IMAGE053
Constructing a first loss function using the actual reconstruction errors of all pixels, the first loss function may be
Figure DEST_PATH_IMAGE055
Wherein the content of the first and second substances,nwhich represents the number of pixels that are to be counted,λis a preset weight value, and is used as a weight value,Fi.e. mask information of the target object, i.e.FRepresenting a set of pixels in the image previously labeled as belonging to the target object;
for a symmetric object, the first loss function is
Figure DEST_PATH_IMAGE057
WhereinsymRepresenting the set of all symmetric poses of a symmetric object,pis shown aspThe symmetrical postures of the robot are symmetrical,R P is shown aspA transformation matrix of the symmetric poses;
reconstructing a three-dimensional imageI 3D And three-dimensional coordinate markImage of a personI GT Inputting the error values into a preset error regression discriminant network to obtain the predicted reconstruction error of each pixel
Figure DEST_PATH_IMAGE059
Using the predicted reconstruction errors of all pixels
Figure 772902DEST_PATH_IMAGE059
And actual reconstruction error
Figure DEST_PATH_IMAGE061
Constructing a second loss function, referring to FIG. 5, the error regression discriminant network may include a number of convolution-pooling layers, and the second loss function may be
Figure DEST_PATH_IMAGE063
Constructing a third loss function using a result of inputting the three-dimensional reconstructed image into the error regression discrimination network and a result of inputting the three-dimensional coordinate mark image into the error regression discrimination network, where the result of inputting the three-dimensional reconstructed image or the three-dimensional coordinate mark image into the error regression discrimination network is considered separately, respectively, and the end of the network may be a softmax function, so that the obtained result may be a value between 0 and 1, and the third loss function may be
Figure DEST_PATH_IMAGE065
Wherein the content of the first and second substances,Jthe identity of the network is discriminated for error regression,Gfor the identification of the view reconstruction model,G(I src ) A three-dimensional reconstructed image is represented,J(G(I src ) ) represents the result of inputting the three-dimensional reconstructed image into an error regression discrimination network,J(I GT ) Representing the result obtained by inputting the three-dimensional coordinate mark image into an error regression discrimination network;
constructing a total loss function by using the weighted sum of the first loss function, the second loss function and the third loss function, training a view reconstruction model and an error regression discriminant network by using a back propagation algorithm according to the total loss function, and finally obtaining parameters of the view reconstruction model, wherein the total loss function can be
Figure DEST_PATH_IMAGE067
For a symmetric object, the total loss function is
Figure DEST_PATH_IMAGE069
For a detailed description of the training procedure of the view reconstruction model 13, reference may be made to the step 130 above, which is not described herein again.
The transformation matrix calculation module 14 is configured to calculate a transformation matrix according to the two-dimensional coordinates of the pixels and the corresponding three-dimensional coordinates. Through calculation of the attempted reconstruction model 13, it corresponds to the two-dimensional coordinate points where the image has been obtained (ii) ((iii))u i v i ) Three-dimensional coordinate points in the corresponding world coordinate system(s) ((x i y i z i ) The transformation relation between the world coordinate system and the camera coordinate system can be obtained by using the point pairs formed by the two-dimensional points and the corresponding three-dimensional points
Figure 624314DEST_PATH_IMAGE030
WhereinRIn order to be a matrix of rotations,tfor translation vectors, the rotation matrix isRAs a transformation matrix for subsequent calculations of equivalent shaft angles and equivalent rotation axes. The specific calculation method can refer to the above step 140, and the transformation matrix can be solved by RANSAC algorithm, i.e. arbitrary selectionNAnd calculating a transformation matrix by taking the individual point pairs as initial point pairs, then carrying out iterative optimization, and evaluating the error of the obtained transformation matrix until the error is smaller than a set threshold value. The transformation relationship between the world coordinate system and the camera coordinate system can be solved according to the following formula:
Figure DEST_PATH_IMAGE070
wherein
Figure DEST_PATH_IMAGE071
Is the homogeneous coordinate of the 3D point in a world coordinate system,
Figure DEST_PATH_IMAGE072
is the homogeneous coordinate of the 2D point in the image coordinate system,
Figure DEST_PATH_IMAGE073
the internal parameters of the calibrated camera are obtained.
Figure DEST_PATH_IMAGE074
Has 12 unknown parameters, and is recorded as
Figure DEST_PATH_IMAGE075
Then the above formula can be changed to
Figure DEST_PATH_IMAGE076
After expansion, an equation set is obtained
Figure 175469DEST_PATH_IMAGE040
Elimination
Figure 648301DEST_PATH_IMAGE035
Written in matrix form
Figure 311364DEST_PATH_IMAGE042
As can be derived from the above, 1 pair of 3D-2D point pairs can provide two equations, the number of the point pairs
Figure DEST_PATH_IMAGE077
When, a form can be obtained
Figure 837023DEST_PATH_IMAGE044
The rotation matrix can be obtained by SVD (Singular Value Decomposition)R
The pose calculation module 15 is configured to calculate an equivalent axis angle and an equivalent rotation axis according to the transformation matrix, so as to obtain a pose of the target object. The transformation matrix obtained by the transformation matrix calculation module 14 can be expressed as
Figure DEST_PATH_IMAGE078
Then the equivalent shaft angle is
Figure DEST_PATH_IMAGE079
An equivalent rotation axis of
Figure DEST_PATH_IMAGE081
It can be understood that the pose of the target object is estimated after the equivalent shaft angle and the equivalent rotating shaft are obtained, and the robot can grab the target object according to the pose.
On the basis of the above-described target object pose estimation method, the present application also provides a robot, which may include a camera 21, a processor 22, and a robot arm 23, please refer to fig. 8.
The camera 21 is used to take a to-be-processed image including a target object, which may be a product on an industrial line, a mechanical part in an article box, a tool on an operation table, or the like. For example, the camera in fig. 8 photographs the target object in the object box.
The processor 22 is connected to the camera 21 and the robot arm 23, and configured to acquire an image to be processed through the camera 21, obtain a pose parameter of the target object by performing the pose estimation method, and send the pose parameter to the robot arm 23 so that the robot arm 23 grips the target object, where the pose parameter may refer to an equivalent rotation axis and an equivalent axis angle.
The end of the mechanical arm 23 is provided with a mechanical claw 231, and when receiving the pose parameter of the target object sent by the processor 22, the mechanical arm 23 and the mechanical claw 231 move according to the pose parameter to grab the target object.
According to the target object pose estimation method and system and the robot of the embodiment, the problem of detection and pose estimation of the three-dimensional object is decomposed into the problem of target detection in the two-dimensional image and the problem of pose estimation in the three-dimensional space, so that a complex problem is simplified into two simple problems, meanwhile, in the pose estimation process, the pose estimation problem is decomposed into two processes, one process is the solution of the mapping relation between the two-dimensional coordinates of the pixels in the image and the corresponding three-dimensional coordinates, the other process is the estimation of the pose of the target object, and the same process is simplified into the two simple problems, so that the complexity of the solution of the pose estimation problem of the target object is reduced, and the operation efficiency is improved. The view reconstruction model is obtained through training, the view reconstruction model adopts a self-encoder structure and is used for establishing a mapping relation between two-dimensional coordinates and corresponding three-dimensional coordinates of pixels in an image, a three-dimensional reconstruction image covering a target object can be obtained through the view reconstruction model, the three-dimensional coordinates corresponding to each pixel are obtained, the view reconstruction model can adapt to pose estimation of objects which are low in texture and have light reflecting surfaces or are partially shielded, the change of the external environment and illumination can be adapted, the view reconstruction model adopts multi-scale feature map splicing, pose estimation can be simultaneously carried out on small objects and large objects, and the view reconstruction model has good environment adaptability and cross-domain migration capability. Meanwhile, the pose estimation is solved based on the mapping relation between the two-dimensional coordinates and the three-dimensional coordinates of the pixel level, and the pose estimation precision is improved.
Reference is made herein to various exemplary embodiments. However, those skilled in the art will recognize that changes and modifications may be made to the exemplary embodiments without departing from the scope hereof. For example, the various operational steps, as well as the components used to perform the operational steps, may be implemented in differing ways depending upon the particular application or consideration of any number of cost functions associated with operation of the system (e.g., one or more steps may be deleted, modified or incorporated into other steps).
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. Additionally, as will be appreciated by one skilled in the art, the principles herein may be reflected in a computer program product on a computer readable storage medium, which is pre-loaded with computer readable program code. Any tangible, non-transitory computer-readable storage medium may be used, including magnetic storage devices (hard disks, floppy disks, etc.), optical storage devices (CD-to-ROM, DVD, Blu-Ray discs, etc.), flash memory, and/or the like. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including means for implementing the function specified. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified.
While the principles herein have been illustrated in various embodiments, many modifications of structure, arrangement, proportions, elements, materials, and components particularly adapted to specific environments and operative requirements may be employed without departing from the principles and scope of the present disclosure. The above modifications and other changes or modifications are intended to be included within the scope of this document.
The foregoing detailed description has been described with reference to various embodiments. However, one skilled in the art will recognize that various modifications and changes may be made without departing from the scope of the present disclosure. Accordingly, the disclosure is to be considered in an illustrative and not a restrictive sense, and all such modifications are intended to be included within the scope thereof. Also, advantages, other advantages, and solutions to problems have been described above with regard to various embodiments. However, the benefits, advantages, solutions to problems, and any element(s) that may cause any element(s) to occur or become more pronounced are not to be construed as a critical, required, or essential feature or element of any or all the claims. As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, system, article, or apparatus. Furthermore, the term "coupled," and any other variation thereof, as used herein, refers to a physical connection, an electrical connection, a magnetic connection, an optical connection, a communicative connection, a functional connection, and/or any other connection.
Those skilled in the art will recognize that many changes may be made to the details of the above-described embodiments without departing from the underlying principles of the invention. Accordingly, the scope of the invention should be determined only by the claims.

Claims (10)

1. A pose estimation method of a target object, characterized by comprising:
acquiring an image to be processed;
inputting the image to be processed into a target detection network to detect a target object in the image to be processed to obtain a target detection result image;
inputting the target detection result image into a pre-trained view reconstruction model to obtain a three-dimensional reconstruction image, wherein the three-dimensional reconstruction image comprises three channels and is used for representing three-dimensional coordinates corresponding to pixels;
calculating a transformation matrix according to the two-dimensional coordinates and the corresponding three-dimensional coordinates of the pixels;
and calculating according to the transformation matrix to obtain an equivalent shaft angle and an equivalent rotating shaft, thereby obtaining the pose of the target object.
2. The pose estimation method according to claim 1, wherein the view reconstruction model is trained by:
obtaining a sample image and a corresponding three-dimensional coordinate marker imageI GT
Inputting the sample image into the target detection network to detect a target object in the sample image to obtain a target detection result imageI src
The target detection result image is processedI src Inputting the image reconstruction model to obtain a three-dimensional reconstruction imageI 3D The three-dimensional reconstruction image comprises three channels and is used for representing predicted three-dimensional coordinates corresponding to pixels;
according to the corresponding predicted three-dimensional coordinates of each pixel
Figure DEST_PATH_IMAGE002
And three-dimensional coordinate mark value
Figure DEST_PATH_IMAGE004
Calculating the actual reconstruction error for each pixel
Figure DEST_PATH_IMAGE006
Constructing a first loss function using the actual reconstruction errors of all pixels, superscriptingiIs shown asiA plurality of pixels;
reconstructing the three-dimensional imageI 3D And the three-dimensional coordinate mark imageI GT Inputting the error values into a preset error regression discriminant network to obtain the predicted reconstruction error of each pixel
Figure DEST_PATH_IMAGE008
Using the predicted reconstruction errors of all pixels
Figure 298992DEST_PATH_IMAGE008
And actual reconstruction error
Figure DEST_PATH_IMAGE010
Constructing a second loss function;
constructing a third loss function by using a result obtained by inputting the three-dimensional reconstruction image into the error regression judging network and a result obtained by inputting the three-dimensional coordinate marking image into the error regression judging network;
and constructing a total loss function by using a weighted sum of the first loss function, the second loss function and the third loss function, and training the view reconstruction model and the error regression discriminant network by using a back propagation algorithm according to the total loss function to obtain parameters of the view reconstruction model.
3. The pose estimation method according to claim 2, wherein the first loss function is
Figure DEST_PATH_IMAGE012
Wherein the content of the first and second substances,nwhich represents the number of pixels that are to be counted,λis a preset weight value, and is used as a weight value,Frepresenting a set of pixels in the image that belong to a target object;
for a symmetric object, the first loss function is
Figure DEST_PATH_IMAGE014
Wherein the content of the first and second substances,symrepresenting the set of all symmetric poses of a symmetric object,pis shown aspThe symmetrical postures of the robot are symmetrical,R P is shown aspA transformation matrix of the symmetric poses;
the second loss function is
Figure DEST_PATH_IMAGE016
The third loss function is
Figure DEST_PATH_IMAGE018
Wherein the content of the first and second substances,Jan identification of a network is discriminated for the error regression,Gan identification of a model is reconstructed for the view,G(I src ) Representing the three-dimensional reconstructed image(s),J(G(I src ) Represents the result of inputting the three-dimensional reconstructed image into the error regression discrimination network,J(I GT ) Representing the result obtained by inputting the three-dimensional coordinate marking image into the error regression discrimination network;
the total loss function is
Figure DEST_PATH_IMAGE020
For a symmetric object, the total loss function is
Figure DEST_PATH_IMAGE022
Wherein the content of the first and second substances,αandβis a preset weight value.
4. The pose estimation method according to any one of claims 2 to 3, wherein the three-dimensional coordinate mark image is obtained by: and mapping points on the target object into pixels on the image plane according to the predicted transformation relation between the three-dimensional coordinates of the target object and the two-dimensional coordinates of the image plane, normalizing the three-dimensional coordinates of the target object to be used as RGB values of corresponding pixels on the image plane, and thus obtaining the three-dimensional coordinate marked image.
5. The pose estimation method according to claim 1, wherein the view reconstruction model is a self-encoder structure including an encoder and a decoder connected by one or more fully-connected layers, and wherein outputs of several layers in the encoder and outputs of symmetrical layers in the decoder are channel-spliced.
6. The pose estimation method according to claim 1, wherein the calculating an equivalent axis angle and an equivalent rotation axis from the transformation matrix includes:
the equivalent shaft angle is calculated according to the following formula:
Figure DEST_PATH_IMAGE023
the equivalent rotation axis is calculated according to the following formula:
Figure DEST_PATH_IMAGE025
wherein the content of the first and second substances,r 11 r 12 r 13 r 21 r 22 r 23 r 31 r 32 r 33 the elements of the transformation matrix are specifically:
Figure DEST_PATH_IMAGE026
7. a pose estimation system of a target object, characterized by comprising:
the image acquisition module is used for acquiring an image to be processed;
the target detection network is connected with the image acquisition module and is used for detecting a target object in the image to be processed to obtain a target detection result image;
the view reconstruction model is connected with the target detection network and used for calculating the target detection result image to obtain a three-dimensional reconstruction image, and the three-dimensional reconstruction image comprises three channels and is used for representing three-dimensional coordinates corresponding to pixels;
the transformation matrix calculation module is used for calculating a transformation matrix according to the two-dimensional coordinates of the pixels and the corresponding three-dimensional coordinates;
and the pose calculation module is used for calculating to obtain an equivalent shaft angle and an equivalent rotating shaft according to the transformation matrix so as to obtain the pose of the target object.
8. The pose estimation system of claim 7, further comprising a view model training module to train the view reconstruction model by:
obtaining a sample image and a corresponding three-dimensional coordinate marker imageI GT
Inputting the sample image into the target detection network to detect a target object in the sample image to obtain a target detection result imageI src
The target detection result image is processedI src Inputting the image reconstruction model to obtain a three-dimensional reconstruction imageI 3D The three-dimensional reconstruction image comprises three channels and is used for representing predicted three-dimensional coordinates corresponding to pixels;
according to the corresponding predicted three-dimensional coordinates of each pixel
Figure 638969DEST_PATH_IMAGE002
And three-dimensional coordinate mark value
Figure 90810DEST_PATH_IMAGE027
Calculating the actual reconstruction error for each pixel
Figure DEST_PATH_IMAGE028
Constructing a first loss function using the actual reconstruction errors of all pixels, superscriptingiIs shown asiA plurality of pixels;
reconstructing the three-dimensional imageI 3D And the three-dimensional coordinate mark imageI GT Inputting the error values into a preset error regression discriminant network to obtain the predicted reconstruction error of each pixel
Figure 683422DEST_PATH_IMAGE008
Using the predicted reconstruction errors of all pixels
Figure 904319DEST_PATH_IMAGE008
And actual reconstruction error
Figure 257940DEST_PATH_IMAGE010
Constructing a second loss function;
constructing a third loss function by using a result obtained by inputting the three-dimensional reconstruction image into the error regression judging network and a result obtained by inputting the three-dimensional coordinate marking image into the error regression judging network;
and constructing a total loss function by using a weighted sum of the first loss function, the second loss function and the third loss function, and training the view reconstruction model and the error regression discriminant network by using a back propagation algorithm according to the total loss function to obtain parameters of the view reconstruction model.
9. A robot, comprising:
a camera for taking an image to be processed including a target object;
the tail end of the mechanical arm is provided with a mechanical claw which is used for grabbing the target object according to the pose of the target object;
a processor, connected to the camera and the mechanical arm, for acquiring an image to be processed by the camera, obtaining the pose of the target object by performing the pose estimation method according to any one of claims 1 to 6, and sending the pose to the mechanical arm so that the mechanical arm grasps the target object.
10. A computer-readable storage medium characterized in that the medium has stored thereon a program executable by a processor to implement the pose estimation method according to any one of claims 1 to 6.
CN202110939214.5A 2021-08-17 2021-08-17 Pose estimation method and system of target object and robot Active CN113409384B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110939214.5A CN113409384B (en) 2021-08-17 2021-08-17 Pose estimation method and system of target object and robot

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110939214.5A CN113409384B (en) 2021-08-17 2021-08-17 Pose estimation method and system of target object and robot

Publications (2)

Publication Number Publication Date
CN113409384A true CN113409384A (en) 2021-09-17
CN113409384B CN113409384B (en) 2021-11-30

Family

ID=77688522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110939214.5A Active CN113409384B (en) 2021-08-17 2021-08-17 Pose estimation method and system of target object and robot

Country Status (1)

Country Link
CN (1) CN113409384B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114912287A (en) * 2022-05-26 2022-08-16 四川大学 Robot autonomous grabbing simulation system and method based on target 6D pose estimation
CN115661349A (en) * 2022-10-26 2023-01-31 中国农业大学 Three-dimensional reconstruction method, system, device, medium and product based on sample image
CN115690333A (en) * 2022-12-30 2023-02-03 思看科技(杭州)股份有限公司 Three-dimensional scanning method and system
WO2023082089A1 (en) * 2021-11-10 2023-05-19 中国科学院深圳先进技术研究院 Three-dimensional reconstruction method and apparatus, device and computer storage medium
CN116494253A (en) * 2023-06-27 2023-07-28 北京迁移科技有限公司 Target object grabbing pose acquisition method and robot grabbing system
CN116681755A (en) * 2022-12-29 2023-09-01 广东美的白色家电技术创新中心有限公司 Pose prediction method and device
CN117351306A (en) * 2023-12-04 2024-01-05 齐鲁空天信息研究院 Training method, determining method and device for three-dimensional point cloud projection pose solver

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105928493A (en) * 2016-04-05 2016-09-07 王建立 Binocular vision three-dimensional mapping system and method based on UAV
CN108038902A (en) * 2017-12-07 2018-05-15 合肥工业大学 A kind of high-precision three-dimensional method for reconstructing and system towards depth camera
US20190197196A1 (en) * 2017-12-26 2019-06-27 Seiko Epson Corporation Object detection and tracking
CN110009722A (en) * 2019-04-16 2019-07-12 成都四方伟业软件股份有限公司 Three-dimensional rebuilding method and device
CN110355755A (en) * 2018-12-15 2019-10-22 深圳铭杰医疗科技有限公司 Robot hand-eye system calibration method, apparatus, equipment and storage medium
CN110544297A (en) * 2019-08-06 2019-12-06 北京工业大学 Three-dimensional model reconstruction method for single image
CN111161349A (en) * 2019-12-12 2020-05-15 中国科学院深圳先进技术研究院 Object attitude estimation method, device and equipment
CN111589138A (en) * 2020-05-06 2020-08-28 腾讯科技(深圳)有限公司 Action prediction method, device, equipment and storage medium
CN112767489A (en) * 2021-01-29 2021-05-07 北京达佳互联信息技术有限公司 Three-dimensional pose determination method and device, electronic equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105928493A (en) * 2016-04-05 2016-09-07 王建立 Binocular vision three-dimensional mapping system and method based on UAV
CN108038902A (en) * 2017-12-07 2018-05-15 合肥工业大学 A kind of high-precision three-dimensional method for reconstructing and system towards depth camera
US20190197196A1 (en) * 2017-12-26 2019-06-27 Seiko Epson Corporation Object detection and tracking
CN110355755A (en) * 2018-12-15 2019-10-22 深圳铭杰医疗科技有限公司 Robot hand-eye system calibration method, apparatus, equipment and storage medium
CN110009722A (en) * 2019-04-16 2019-07-12 成都四方伟业软件股份有限公司 Three-dimensional rebuilding method and device
CN110544297A (en) * 2019-08-06 2019-12-06 北京工业大学 Three-dimensional model reconstruction method for single image
CN111161349A (en) * 2019-12-12 2020-05-15 中国科学院深圳先进技术研究院 Object attitude estimation method, device and equipment
CN111589138A (en) * 2020-05-06 2020-08-28 腾讯科技(深圳)有限公司 Action prediction method, device, equipment and storage medium
CN112767489A (en) * 2021-01-29 2021-05-07 北京达佳互联信息技术有限公司 Three-dimensional pose determination method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐德 等: "《机器人视觉测量与控制》", 31 January 2016 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023082089A1 (en) * 2021-11-10 2023-05-19 中国科学院深圳先进技术研究院 Three-dimensional reconstruction method and apparatus, device and computer storage medium
CN114912287A (en) * 2022-05-26 2022-08-16 四川大学 Robot autonomous grabbing simulation system and method based on target 6D pose estimation
CN115661349A (en) * 2022-10-26 2023-01-31 中国农业大学 Three-dimensional reconstruction method, system, device, medium and product based on sample image
CN115661349B (en) * 2022-10-26 2023-10-27 中国农业大学 Three-dimensional reconstruction method, system, equipment, medium and product based on sample image
CN116681755A (en) * 2022-12-29 2023-09-01 广东美的白色家电技术创新中心有限公司 Pose prediction method and device
CN116681755B (en) * 2022-12-29 2024-02-09 广东美的白色家电技术创新中心有限公司 Pose prediction method and device
CN115690333A (en) * 2022-12-30 2023-02-03 思看科技(杭州)股份有限公司 Three-dimensional scanning method and system
CN115690333B (en) * 2022-12-30 2023-04-28 思看科技(杭州)股份有限公司 Three-dimensional scanning method and system
CN116494253A (en) * 2023-06-27 2023-07-28 北京迁移科技有限公司 Target object grabbing pose acquisition method and robot grabbing system
CN116494253B (en) * 2023-06-27 2023-09-19 北京迁移科技有限公司 Target object grabbing pose acquisition method and robot grabbing system
CN117351306A (en) * 2023-12-04 2024-01-05 齐鲁空天信息研究院 Training method, determining method and device for three-dimensional point cloud projection pose solver
CN117351306B (en) * 2023-12-04 2024-03-22 齐鲁空天信息研究院 Training method, determining method and device for three-dimensional point cloud projection pose solver

Also Published As

Publication number Publication date
CN113409384B (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN113409384B (en) Pose estimation method and system of target object and robot
CN109816725B (en) Monocular camera object pose estimation method and device based on deep learning
CN111738261B (en) Single-image robot unordered target grabbing method based on pose estimation and correction
CN109903313B (en) Real-time pose tracking method based on target three-dimensional model
CN111553949B (en) Positioning and grabbing method for irregular workpiece based on single-frame RGB-D image deep learning
CN114863573B (en) Category-level 6D attitude estimation method based on monocular RGB-D image
WO2015006224A1 (en) Real-time 3d computer vision processing engine for object recognition, reconstruction, and analysis
CN112950667B (en) Video labeling method, device, equipment and computer readable storage medium
CN112233181A (en) 6D pose recognition method and device and computer storage medium
JP4709668B2 (en) 3D object recognition system
JP2022519194A (en) Depth estimation
CN112907735B (en) Flexible cable identification and three-dimensional reconstruction method based on point cloud
CN113043267A (en) Robot control method, device, robot and computer readable storage medium
JP2018128897A (en) Detection method and detection program for detecting attitude and the like of object
WO2021164887A1 (en) 6d pose and shape estimation method
Billings et al. SilhoNet-fisheye: Adaptation of a ROI based object pose estimation network to monocular fisheye images
CN116249607A (en) Method and device for robotically gripping three-dimensional objects
Chen et al. Progresslabeller: Visual data stream annotation for training object-centric 3d perception
CN117351078A (en) Target size and 6D gesture estimation method based on shape priori
Pichkalev et al. Face drawing by KUKA 6 axis robot manipulator
CN115578460B (en) Robot grabbing method and system based on multi-mode feature extraction and dense prediction
JP2020527270A (en) Electronic devices, systems and methods for determining object posture
KR101673144B1 (en) Stereoscopic image registration method based on a partial linear method
Ward et al. A model-based approach to recovering the structure of a plant from images
Makihara et al. Grasp pose detection for deformable daily items by pix2stiffness estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant