CN113034581A

CN113034581A - Spatial target relative pose estimation method based on deep learning

Info

Publication number: CN113034581A
Application number: CN202110275862.5A
Authority: CN
Inventors: 李志�; 李海超; 蒙波; 黄剑斌; 张志民; 杨兴昊; 黄良伟; 黄龙飞
Original assignee: China Academy of Space Technology CAST
Current assignee: China Academy of Space Technology CAST
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2021-06-25

Abstract

The invention relates to a method for estimating relative poses of space targets based on deep learning, which comprises the following steps: a. constructing a labeled sample set by using two-dimensional projections of the three-dimensional model of the space target at different positions and postures; b. dividing a training set, a verification set and a test set of the labeled sample set, and constructing a pose estimation neural network; c. inputting the training set and the verification set into a constructed pose estimation neural network for training to obtain a pose estimation model; d. and testing the test set by using the pose estimation model obtained by training to obtain the pose information of the space target of each sample in the test set. The method can simultaneously estimate the position and the attitude information of the space target through the regression model of the single image, and is suitable for target pose estimation under the condition of space complex illumination.

Description

Spatial target relative pose estimation method based on deep learning

Technical Field

The invention relates to a method for estimating relative poses of space targets based on deep learning.

Background

Accurate and robust position and attitude estimation is required in spatial on-orbit servicing, rendezvous docking, and debris-clearing on-orbit approach operations. However, the spatial environment tends to be very complex. For example, the light intensity in the space is high, or low light, reflection, etc. occurs. In addition, at different viewing angles, the aircraft sometimes also appears in a complex textured earth background, thus interfering with pose estimation.

The traditional 6D target pose can be based on a local feature matching PnP (passive-n-Point) method of a three-dimensional model and an image, but is not suitable for an object with insufficient texture. Although the template matching or dense feature learning method can solve the problem of insufficient texture, the method is sensitive to illumination and shading. Moreover, the dense feature learning method takes a long time for feature extraction and attitude measurement.

On-orbit spacecraft attitude estimation can only be achieved by means of sensors such as monocular vision or infrared cameras, stereo cameras and LiDAR (LiDAR). Monocular camera based attitude estimation has certain advantages in an aircraft due to the simplicity of the sensors. Monocular solutions for spacecraft tracking and attitude estimation include:

【B.Naasz,J.V.Eepoel,S.Queen,C.M.Southward,andJ.Hannah,“Flightresultsfromthehstsm4relativenavigationsensorsystem,”2010】；

【J.M.Kelsey,J.Byrne,M.Cosgrove,S.Seereeram,andR.K.Mehra,“Vision-basedrelativeposeestimationforautonomousrendezvousanddocking,”inIEEEAerospaceConference,2006】；

【C.LiuandW.Hu,“Relativeposeestimationforcylinder-shapedspacecraftsusingsingleimage,”IEEETransactionsonAerospaceandElectronicSystems,vol.50,no.4,2014】；

relying on a model-based approach that aligns a wire-frame model of a spacecraft (component) with an edge image of a real spacecraft (component) based on heuristics.

The attitude estimation technology based on the deep learning algorithm makes new progress in ground application. This type of algorithm overcomes traditional image-based processing methods and instead attempts to learn the non-linear transformation between the two-dimensional input image and the 6D output pose in an end-to-end manner. For example:

BB8RadM, LepetitV, BB8: ascable, acurate, robusttopatialcyclionmethododepressing the3 Dposseofchallengobjectswiuthouputdeth, ICCV,2017 ] predict the projection coordinates of 8 corner points of a 3D bounding box formed by a 3D object on 2D, and after obtaining the 2D coordinates, directly obtain a 3D rotation vector and a translation vector by using a PnP algorithm.

Sharms, c.beierle, and s.d' Amico, "position estimation for non-coherent space acquisition after based on using a coherent estimation of a spacecraft attitude," entrance aeronautical conference, pp.1-12,2018, "first proposed using a convolutional neural network for spacecraft attitude estimation, which is based on a position estimation of bounding box detection and an attitude estimation based on soft classification. However, this positioning method fails when a portion of the target is not within the field of view.

Disclosure of Invention

The invention aims to provide a method for estimating the relative pose of a spatial target based on deep learning.

In order to achieve the above object, the present invention provides a method for estimating a relative pose of a spatial target based on deep learning, comprising the following steps:

a. constructing a labeled sample set by using two-dimensional projections of the three-dimensional model of the space target at different positions and postures;

b. dividing a training set, a verification set and a test set of the labeled sample set, and constructing a pose estimation neural network;

c. inputting the training set and the verification set into a constructed pose estimation neural network for training to obtain a pose estimation model;

d. and testing the test set by using the pose estimation model obtained by training to obtain the pose information of the space target of each sample in the test set.

According to an aspect of the present invention, in the step (a), each sample in the labeled sample set includes a position (x, y, z), a posture, and an image corresponding to the position and the posture, and the posture is expressed by a quaternion.

According to one aspect of the invention, in the step (a), the three-dimensional model of the space target is input into 3dsMax software, a camera is added into the 3dsMax software, the position, the posture, the view angle and the width and the height of the image of the camera are set, the position and the posture of the camera are used as a reference, the position and the posture of the three-dimensional model relative to the camera are adjusted, the camera performs two-dimensional imaging on the three-dimensional model, the position and the posture of the three-dimensional model relative to the camera and the corresponding image under the position and the posture are stored, and the labeled sample set is generated;

or inputting the space target three-dimensional model into imaging simulation software, adding a camera into the imaging simulation software, setting the position, the posture, the field angle and the width and the height of an image of the camera, taking the position and the posture of the camera as a reference, adjusting the position and the posture of the three-dimensional model relative to the camera, performing two-dimensional imaging on the three-dimensional model by the camera, storing the position and the posture of the three-dimensional model relative to the camera and the corresponding image under the position and the posture, and generating the labeled sample set;

the imaging simulation software is developed based on OSG, camera parameters and space target three-dimensional model parameters are given in the imaging simulation software, and a two-dimensional image is generated by imaging the space target three-dimensional model through a camera; wherein the content of the first and second substances,

the camera parameters include a cameraPosition (x)_cam,y_cam,z_cam) Camera three-axis attitude angle (pitch)_cam,yaw_cam,roll_cam) Camera field angle and width and height of the image;

the parameters of the three-dimensional model of the space target comprise the position (x) of the three-dimensional model of the space target_obj,y_obj,z_obj) And three-axis attitude angle (pitch) of three-dimensional model of space target_obj,yaw_obj,roll_obj)。

According to one aspect of the present invention, in the step (b), the labeled sample set is randomly divided into a training set, a validation set and a testing set according to a ratio of 7:2: 1.

According to one aspect of the invention, in the step (b), a pose estimation neural network is constructed based on a deep convolution residual error network ResNet;

when a network is constructed, a sample image is input into a network formed by taking a residual convolution neural network as a backbone network, then an output two-dimensional feature map is input into a bottleneck layer with variable dimensionality, dimensionality reduction convolution processing is carried out, finally the feature map is flattened into a one-dimensional array, and position and posture information is output through two branches respectively.

According to one aspect of the invention, the backbone network is divided into five parts, namely stage1, stage2, stage3, stage4 and stage 5;

stage1 is composed of 1 convolutional layer and 1 maximal pooling layer;

the convolution kernel size of the convolution layer in stage1 is (7,7), the convolution step is (2,2), the number of channels is 64, the maximum pooling layer down-sampling factor is (3,3), and the step is (2, 2);

stage2 consists of 1 conv _ block and 2 identity _ blocks; the number of the conv _ block and the number of the identification _ block in the stage2 are 64, 64 and 256;

stage3 consists of 1 conv _ block and 4 identity _ blocks; the number of the conv _ block and the identity _ block in the stage3 is 128, 128 and 512;

stage4 consists of 1 conv _ block and 8 identity _ blocks; the number of the conv _ block and the identity _ block in the stage4 is 256, 256 and 1024;

stage5 consists of 1 conv _ block and 2 identity _ blocks; the number of the conv _ block and the identity _ block in the stage5 is 512, 512 and 2048.

According to one aspect of the invention, the output characteristics of stage5 are input to the bottleeck layer, which is convolved two-dimensionally by a 3 × 3 convolution kernel with an adjustable number of channels, with a step size of (2, 2).

According to one aspect of the invention, the output features of the bottleeck layer are input into the position estimation structure and the pose estimation structure, respectively.

According to one aspect of the invention, the output characteristics of the bottleeck layer are input into the position estimation structure, and three-dimensional information is output in a regression mode by adopting a two-layer full-connection layer structure;

the first full connection layer is used for performing dimensionality reduction operation on the flattened feature map information, compressing the feature map information to 1024 dimensions, inputting output data of the first full connection layer to the next full connection layer after a relu activation function is performed, and finally outputting the output data to three-dimensional coordinate information (x, y, z);

inputting the output characteristics of the bottleeck layer into an attitude estimation structure, and outputting an estimated quaternion in a regression mode by adopting a two-layer full-connection layer structure;

and the first full connection layer is used for performing dimensionality reduction operation on the flattened feature diagram information, compressing the flattened feature diagram information to 1024 dimensionalities, inputting output data of the layer to the next full connection layer after a relu activation function is performed, and finally outputting the attitude information represented by quaternion.

According to one aspect of the present invention, in the training network process of step (c), training the training samples in the labeled sample set and the validation samples are input to a pose estimation neural network for training, and a model with the minimum loss function is selected as a training model;

wherein the loss function is obtained by adding a position loss term and an attitude loss term:

Loss＝Loss_{position of}+Loss_Posture；

Wherein, Loss is total Loss function, and takes relative error form, Loss_{Position of}As a position Loss term, Loss_PostureIs an attitude loss term;

loss of location function Loss_{Position of}Comprises the following steps:

attitude Loss function Loss_PostureComprises the following steps:

where m is the number of training samples, i is 1,2, …, m, the position estimate⁽ⁱ⁾True position value⁽ⁱ⁾A quaternion representing the position estimation value, the labeled position value and the attitude estimation value of the ith sample respectively⁽ⁱ⁾Quaternion of true attitude⁽ⁱ⁾The attitude estimation value and the labeled attitude value of the ith sample are expressed by quaternions respectively.

According to the concept of the invention, a sample is labeled on a space target of the three-dimensional model, two-dimensional projections of the three-dimensional model of the space target at different positions and postures are obtained during labeling, and information of the corresponding position and posture of the space target is recorded, so that a labeled sample set containing position and posture information is obtained. And then constructing a pose estimation neural network, inputting the labeled sample set into the pose estimation neural network, and training the labeled sample set to obtain a training model with the minimum loss function. And finally, inputting the space target image into a training model to obtain the position and posture information of the space target. Therefore, the invention can simultaneously estimate the position and the attitude information of the space target through a single image and a simple regression model, and is suitable for the attitude estimation of the space target.

Drawings

FIG. 1 is a flow chart of a method for estimating relative poses of spatial objects based on deep learning according to an embodiment of the present invention;

FIG. 2 is a block diagram schematically illustrating a pose estimation neural network according to one embodiment of the present invention;

FIG. 3 is a diagram schematically illustrating a statistical result of attitude estimation accuracy of a test sample according to an embodiment of the present invention;

FIG. 4 is a diagram schematically illustrating statistics of position estimation accuracy for a test sample according to one embodiment of the present invention;

fig. 5 and 6 schematically show pose truth and estimate maps for two examples of the invention, respectively.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

The present invention is described in detail below with reference to the drawings and the specific embodiments, which are not repeated herein, but the embodiments of the present invention are not limited to the following embodiments.

Referring to fig. 1, in the method for estimating the relative pose of a spatial target based on deep learning of the present invention, first, a sample is labeled on the spatial target of a three-dimensional model, specifically, a labeled sample set containing position and pose information is constructed by using two-dimensional projections of the three-dimensional model of the spatial target at different positions and poses. And then, dividing the labeled sample set into a training set, a verification set and a test set, and constructing a pose estimation neural network. And inputting the training set and the verification set into the constructed pose estimation neural network for training to obtain a pose estimation model. And finally, testing the test set by using the pose estimation model obtained by training to obtain the pose information of the space target of each sample in the test set.

In the invention, each sample in the labeled sample set comprises a position (x, y, z) represented by three translation amounts, a posture and an image corresponding to the current position and the posture, wherein the posture is represented by a quaternion. The invention does not specially limit the selected visual angles of the three positions and postures of the space target as long as the invention is used for selecting the three positions and postures of the space targetThe pose of the space target can be determined by using the pose, for example, x, y and z can form a three-dimensional rectangular coordinate system. When a sample is extracted, a three-dimensional model of a space target can be input into imaging simulation software, a camera is added into the imaging simulation software, the position, the posture, the angle of view and the width and the height of an image of the camera are set, the position and the posture of the camera are used as references, the position and the posture of the three-dimensional model relative to the camera are adjusted, the camera performs two-dimensional imaging on the three-dimensional model, the position and the posture of the three-dimensional model relative to the camera and the corresponding image under the position and the posture are stored, and an annotated sample set is generated. In the invention, the imaging simulation software is the imaging simulation software developed based on OSG, and the given camera parameter and the space target three-dimensional model parameter are needed in the imaging simulation software, wherein the camera parameter comprises the camera position (x)_cam,y_cam,z_cam) Camera three-axis attitude angle (pitch)_cam,yaw_cam,roll_cam) Camera field angle, width and height of image, and the three-dimensional model parameters of the space target comprise the position (x) of the three-dimensional model of the space target_obj,y_obj,z_obj) Three-axis attitude angle (pitch) of three-dimensional model of space target_obj,yaw_obj,roll_obj) (ii) a And imaging the three-dimensional model of the space target by a camera to generate a two-dimensional image. Or inputting the space target three-dimensional model into 3dsMax software, adding a camera into the 3dsMax software, setting the position, the posture, the field angle and the width and the height of an image of the camera, adjusting the position and the posture of the three-dimensional model relative to the camera by taking the position and the posture of the camera as a reference, performing two-dimensional imaging on the three-dimensional model by the camera, storing the position and the posture of the three-dimensional model relative to the camera and the corresponding image at the position and the posture, and generating an annotation sample set.

According to any method, the two-dimensional projection image of the space target can be extracted, and the construction of the labeling sample set is completed by using the images. Subsequently, the constructed labeled sample set can be divided into a training set, a verification set and a test set. In the embodiment, the labeled sample set is randomly divided into a training set, a verification set and a test set according to a ratio of 7:2:1, that is, the labeled sample set is randomly divided into training samples, verification samples and test samples according to a ratio of 70%, 20% and 10%.

After the division of the sample set is completed, the position and pose estimation neural network can be constructed by utilizing the labeled sample set. In the invention, the pose estimation neural network is constructed based on a network model modified by a deep convolution residual error network ResNet. Because the basic unit residual block in the residual network has good characteristics, the phenomenon that the depth is increased and the loss function is increased is solved, and therefore the network is adopted as a basic model for designing. Generally speaking, when a network is constructed, a sample image is firstly input into a network formed by taking a residual convolutional neural network as a backbone network, then an output two-dimensional feature map is input into a bottleneck layer with variable dimensionality, dimensionality reduction convolution processing is carried out, finally, the feature map is flattened into a one-dimensional array, and then position and attitude information is respectively output through two branches.

As shown in fig. 2, a backbone network (backbone) constructed based on a residual convolutional neural network can be divided into five parts, namely, stage1, stage2, stage3, stage4 and stage5, and the composition and corresponding parameters of each part are as follows:

stage1 is composed of 1 convolutional layer and 1 maximal pooling layer;

According to the steps for constructing the network, the output characteristics of the stage5 are input into the bottleeck layer, the bottleeck layer is subjected to two-dimensional convolution by a convolution kernel of 3 multiplied by 3 with adjustable channel number, and the step length is (2, 2). Therefore, the output dimensionality is reduced, the parameter quantity is greatly reduced for the operation of the following full-connection layer, and the training time is reduced. The output characteristics of the bottleeck layer can be input into two branches of the position estimation structure and the attitude estimation structure respectively.

In the step of inputting the output characteristics of the bottleeck layer into the position estimation structure, the invention adopts a two-layer full-connection layer structure to directly output three-dimensional information in a regression mode. And the first full connection layer is used for performing dimensionality reduction operation on the flattened feature map information, compressing the feature map information to 1024 dimensions, inputting output data of the layer to the next full connection layer after a relu activation function is performed, and finally outputting the three-dimensional coordinate information (x, y, z). In the step of inputting the output characteristics of the bottleeck layer into the attitude estimation structure, the invention adopts a two-layer full-connection layer structure to directly output the estimated quaternion in a regression mode. The first full-connection layer is used for performing dimensionality reduction operation on the flattened feature diagram information, compressing the feature diagram information to 1024 dimensionalities, inputting output data of the layer to the next full-connection layer after a relu activation function is performed, and finally outputting the attitude information (q) represented by quaternion₁,q₂,q₃,q₄)。

Through the steps, a basic pose estimation neural network can be constructed, and then the network needs to be trained to be capable of becoming a pose estimation model. In the process of training the network, the training samples in the labeled sample set and the verification samples are input into a pose estimation neural network, the neural network is trained, and the model with the minimum loss function is selected as a final training model. And then inputting the test samples (namely the test set) into a pose estimation neural network, and testing the test samples by using the trained model so as to obtain a pose test result of the space target, namely pose information of each sample. In the invention, the loss function is obtained by adding a position loss term and an attitude loss term. In addition, because the obtained position and posture information are independent of each other and belong to different categories, two branch structures need to be designed, that is, two branches are divided by two arrows in the lower part of fig. 2, corresponding loss functions are designed for the two branches, and then the two branches are summed, so that the obtained loss function is:

Loss＝Loss_{position of}+Loss_Posture；

Wherein, Loss is total Loss function, and takes relative error form, Loss_{Position of}As a position Loss term, Loss_PostureIs a pose loss term.

Loss of location function Loss_{Position of}Comprises the following steps:

attitude Loss function Loss_PostureComprises the following steps:

in the expression of position and posture, m is the number of training samples, i is 1,2, …, m is a cyclic variable, and the position estimation value⁽ⁱ⁾True position value⁽ⁱ⁾A quaternion representing the position estimation value, the labeled position value and the attitude estimation value of the ith sample respectively⁽ⁱ⁾Quaternion of true attitude⁽ⁱ⁾The attitude estimation value and the labeled attitude value of the ith sample are expressed by quaternions respectively.

After the final training model is obtained through training, the position and the posture of the image containing the space target can be estimated by utilizing the final training model, so that the position and the posture information of the space target are obtained.

Fig. 3 shows a statistical result chart of the attitude estimation accuracy of 100 test samples, the average value is 3.71 °, and fig. 4 shows a statistical result chart of the position estimation accuracy of 100 test samples, the average value is 0.389 m. Fig. 5 and 6 show a comparison of the target pose annotation value with the estimate value. The left image of FIG. 5 is the true value of the pose labeling, the true value of the position is (-0.9979,0.3473,65.5559), and the true value of the attitude quaternion is (-0.6179,0.1311,0.0299, 0.7747); the right image of fig. 5 is a pose estimate, with position estimates (-0.9779,0.288,65.1297), and pose quaternion estimates (-0.6248,0.146,0.022, 0.7666); the resulting position error was 0.44m and attitude error was 2.28 °. The left image of FIG. 6 is the true value of the pose labeling, the true value of the position is (-3.9179, -16.5673,65.974), and the true value of the attitude quaternion is (-0.8,0.15, -0.059, 0.5778); the right image of fig. 6 is a pose estimate, with position estimates (-3.8477, -16.3197,65.7765), and attitude quaternion estimates (-0.7933,0.1419, -0.0296, 0.591); the resulting position error was 0.32m and attitude error was 3.93 °.

Generally, the estimation method for the relative pose of the space target has small difference between the estimation values of the position and the attitude of the space target and the true value, has high precision and is suitable for the technical field of estimation of the pose of the space target. The position and posture information of the space target can be estimated simultaneously through the regression model of the single image, and the method is suitable for target pose estimation under the condition of space complex illumination. Therefore, the technical problem that the pose of the space target is difficult to estimate is solved.

The above description is only one embodiment of the present invention, and is not intended to limit the present invention, and it is apparent to those skilled in the art that various modifications and variations can be made in the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for estimating relative poses of spatial targets based on deep learning comprises the following steps:

2. The method of claim 1, wherein in step (a), each sample in the labeled sample set comprises a position (x, y, z), a pose, and an image corresponding to the position and the pose, the pose being represented by a quaternion.

3. The method according to claim 1, wherein in the step (a), the three-dimensional model of the space target is input into 3dsMax software, a camera is added into the 3dsMax software, the position, the posture, the angle of view and the width and the height of the image of the camera are set, the position and the posture of the camera are taken as a reference, the position and the posture of the three-dimensional model relative to the camera are adjusted, the camera performs two-dimensional imaging on the three-dimensional model, the position and the posture of the three-dimensional model relative to the camera and the corresponding image under the position and the posture are saved, and the labeled sample set is generated;

the camera parameters include camera position (x)_cam,y_cam,z_cam) Camera three-axis attitude angle (pitch)_cam,yaw_cam,roll_cam) Camera field angle and width and height of the image;

the three-dimensional model parameters of the space target comprise space orderTargeting three-dimensional model position (x)_obj,y_obj,z_obj) And three-axis attitude angle (pitch) of three-dimensional model of space target_obj,yaw_obj,roll_obj)。

4. The method of claim 1, wherein in step (b), the labeled sample set is randomly divided into a training set, a validation set and a testing set according to a ratio of 7:2: 1.

5. The method according to claim 1, wherein in step (b), a pose estimation neural network is constructed based on a deep convolution residual network (ResNet);

6. The method of claim 5, wherein the backbone network is divided into five parts, namely stage1, stage2, stage3, stage4 and stage 5;

stage1 is composed of 1 convolutional layer and 1 maximal pooling layer;

7. The method of claim 6, wherein the output characteristics of stage5 are input to a bottleeck layer, which is convolved two-dimensionally by a 3 x 3 convolution kernel with an adjustable number of channels, with a step size of (2, 2).

8. The method according to claim 7, wherein the output features of the bottleeck layer are input to two branches of the position estimation structure and the pose estimation structure, respectively.

9. The method according to claim 8, characterized in that the output characteristics of the bottleeck layer are input into the position estimation structure, and three-dimensional information is output in a regression manner by adopting a two-layer fully-connected layer structure;

10. The method according to claim 1, wherein in the training network of step (c), the training samples in the labeled sample set and the validation samples are input into a pose estimation neural network for training, and a model with the minimum loss function is selected as a training model;

Loss＝Loss_{position of}+Loss_Posture；

loss of location function Loss_{Position of}Comprises the following steps:

attitude Loss function Loss_PostureComprises the following steps: