CN111260680B

CN111260680B - RGBD camera-based unsupervised pose estimation network construction method

Info

Publication number: CN111260680B
Application number: CN202010034081.2A
Authority: CN
Inventors: 杨宇翔; 潘耀辉; 高明煜; 何志伟; 黄继业; 董哲康
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2023-01-03
Anticipated expiration: 2040-01-13
Also published as: CN111260680A

Abstract

The invention discloses an RGBD camera-based unsupervised pose estimation network construction method. Estimating the motion of a camera from an image is a major research topic of current vision mobile robots. The traditional method is easy to fail in the environments of low texture, complex geometric structure, illumination, shading and the like. Most deep learning based methods require additional supervision data, which complicates the work and increases the cost. The method of the convolutional neural network makes up the defects of the traditional method, utilizes the distance information of the depth image, combines the traditional geometric knowledge, and utilizes the positive sequence and negative sequence input to increase the constraint, so that the network can accurately estimate the pose of the camera.

Description

RGBD camera-based unsupervised pose estimation network construction method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to an RGBD camera-based unsupervised bit motion estimation network construction method

Background

The instant positioning And Mapping of the SLAM (singular Localization And Mapping) is an important research direction of machine vision, and through the development of the last 30 years, the SLAM And related technologies become research hotspots in the fields of robots, image processing, deep learning, motion recovery structures, augmented reality And the like. And the recovery Of Motion Structure, SFM (Structure Of Motion), from consecutive frames is a major focus. Although conventional SFM methods are efficient in many cases, they rely on accurate images and are prone to failure in low texture, complex geometry and lighting, shadowing, etc. environments.

In order to solve the problem, with the development of deep learning in recent years, some methods based on deep learning are proposed and applied to various stages of the conventional SFM. Because the deep learning methods use a large amount of data sets for training and the methods use accurate external supervision during training, the interframe estimation under a fixed scene becomes more accurate, and the defects of the traditional methods are overcome. However, external surveillance data, particularly true values of relative motion between frames, are not easily obtained, requiring the use of additional sensors such as IMU, GPS, etc., which makes the task more complicated and increases costs.

Depth cameras have been widely used in SLAM research in recent years, and accurate color maps and corresponding depth maps can be acquired easily and accurately. The color map has rich characteristic information, can learn characteristic representation by using a neural network, and can sense the characteristic correlation among different frames. While the depth map provides distance information of the object. And the distance information and the deep learning of the depth map are fused, and a new idea is provided for unsupervised network by combining geometric knowledge.

Disclosure of Invention

The invention aims to overcome the defects of the traditional method and the limitation of supervised deep learning, and provides an unsupervised pose estimation network construction method based on an RGBD camera. The method not only uses a deep learning method to learn the relationship of inter-frame conversion, but also combines the traditional geometric knowledge, utilizes the inter-frame conversion relationship generated by the network and the distance information of the depth map, and combines the traditional geometric knowledge to guide the network to generate more accurate results, thereby achieving unsupervised effect, and adding constraint through a positive sequence and reverse sequence network to ensure that the estimation of the network is more accurate. The method comprises the following specific steps:

step (1): obtaining same scene color image and depth image by RGB-D camera

Continuous color image using RGB-D camera

And corresponding continuous depth images

The resolution is H × W, and H and W are the height and width of the image respectively; selecting a color image at time t

And its adjacent image

Each color image is RGB three channels, and three color images are combined

In the form of a 9-channel splice

Wherein feature refers to a feature map obtained after convolutional layer, and 0 refers to convolution operation of 0 th time;

step (2): learning inter-frame structure relationship based on pose network

The pose network is composed of a convolutional neural network, and passes through a ReLU activation layer after each convolutional layer; first 9 channel sequences

Passing through a convolution layer with convolution kernel size of 7 × 7, then a convolution layer with convolution kernel size of 5 × 5, and then a convolution layer with five convolution kernels of 3 × 3 to obtain 256 channels

Then passing through a layer of rollThe convolution kernel with a kernel size of 1 x 1 performs dimensionality reduction to obtain a channel number of 12

Finally, averaging H and W dimensions to form a number to obtain a group of 12-dimensional numbers; the number is divided into two groups of 6-dimensional numbers which are respectively marked as T _t→t-1 、T _t→t+1 (ii) a For T _t→t-1 The first three digits represent

To

The last three positions are expressed by Euler angles

To

Rotation of the coordinate system of, T _t→t+1 The same is shown;

and (3): and (3) completing self-supervision by utilizing an interframe camera pose relation and combining distance information of a depth map and geometric knowledge:

for images

The corresponding depth map is

Image corresponding to time t +1

Has a conversion relation of T _t→t+1 (ii) a For images

A certain point pixel on

Correspond to

Is as

The camera projection model and the inter-frame triangular relation can be obtained

Correspond to

Is formed by a plurality of pixels

There is a relationship:

①

wherein K is an internal reference of the camera; according to

Mapping to

Corresponding space is obtained

Each pixel value corresponds to

Is further according to

The size of the pixel value and the position of the initial pixel are obtained by using a differentiable bilinear sampling interpolation method

Corresponding to (1)Composite drawing

Wherein each pixel value of the composite map is not a simple mapping

The differentiable bilinear sampling interpolation is obtained by weighting four pixels around the pixel;

②

where i = top or bottom, j = left or right, stands for

Four surrounding pixels, where w ^ij Representing the weight of four pixels, has ∑ w ^ij =1; composite views

Later, the original view

Self-supervision is formed between two frames, and a loss function is formed:

③

therefore, the purpose of self-supervision without external supervision is achieved by synthesizing a new graph by using a depth graph and constructing a photometric error;

and (4): prevention of network training gradient corruption by masking network

When the traditional geometric knowledge is used in the previous step, the preconditions of no dynamic object, no shielding object and the like in the image need to be met, and a mask network is provided for preventing network training from being inhibited; the mask network and the pose network share the former five-layer convolution network,training with pose network, and obtaining a mask corresponding to a sequence by upsampling through four layers of 4 × 4 convolutional layers and one layer of 3 × 3 convolutional layers

For each pixel corresponding mask

The loss function for two frames becomes from equation (3)

④

And (5): adding constraints through an inverse sequence network enables the network to more accurately estimate relative pose between frames

When the positive sequence image is used for inputting, the input sequence is

The image input of the reverse order network is

A good pose estimation network can not only estimate the pose relationship between frames in a positive sequence, but also estimate the pose between the frames when an image sequence is input in a negative sequence, thereby increasing the constraint; for a sequence of three pictures, the position and posture obtained by the network when the sequence is positive are

Pose obtained by inverse sequence is

Ideally, there are

However, the network estimate always has an error, and the error increases the constraint, and the loss function is as follows

⑤

Representing the displacement estimated by the network at the input of a positive sequence,

representing the displacement of the network estimate at the time of the inverse sequence input,

representing the rotation of the network estimate at the input of the positive sequence,

representing the rotation of the network estimation when the inverse sequence is input, and omega represents the weight;

therefore, the pose network is trained by adding constraints, so that the network has the capability of accurately estimating the relative motion between frames.

The invention has the beneficial effects that: the method of deep learning is used for searching the association between adjacent frames from the characteristic information of the color image, the distance information provided by the depth image is utilized, the traditional geometric method is combined, the network avoids complicated external supervision to achieve unsupervised learning, and the constraint is added through the positive sequence and reverse sequence network, so that the motion of the camera is estimated more accurately.

Drawings

FIG. 1 is a single-sequence flow chart of the present invention;

FIG. 2 is an image reconstruction process;

FIG. 3 is a forward and reverse order combining network.

Detailed Description

The invention is further described below with reference to the accompanying drawings, comprising the steps of:

step (1): obtaining same scene color image and depth image by RGB-D camera

Obtaining connections using RGB-D cameraColor-continued image

And corresponding continuous depth images

The resolution is H W, H and W being the height and width of the image, respectively. Selecting a color image at time t

And its adjacent image

Each color image is RGB three channels, and three color images are combined

In the form of a 9-channel splice

(wherein feature refers to a feature map obtained after convolutional layer, and 0 refers to the 0 th convolution operation).

Step (2): learning of inter-frame structure relationship based on convolutional neural network

The pose network is mainly composed of a convolutional neural network, and passes through a ReLU activation layer after each convolutional layer. First 9 channel sequences

Then, dimension reduction is performed on the convolution kernel with the size of 1 x 1 after one layer of convolution kernel, and the number of channels is 12

And finally, averaging the dimensions H and W into a number to obtain a group of 12-dimensional numbers. The number is divided into two groups of 6-dimensional numbers which are respectively marked as T _t→t-1 、T _t→t+1 . For T _t→t-1 The first three digits represent

To

The last three bits are expressed by Euler angles

To

Rotation of the coordinate system of, T _t→t+1 The same is shown.

And (3): and (3) completing self-supervision by utilizing an inter-frame camera pose relation and combining distance information of a depth map and geometric knowledge: for images

The corresponding depth map is

Image corresponding to time t +1

Has a conversion relation of T _t→t+1 . For images

A pixel of a certain point on

Is correspondingly provided with

Is as

Correspond to

Of a pixel

There is a relationship:

①

wherein K is an internal reference of the camera. According to

Mapping to

The corresponding space is easily obtained

Each pixel value corresponds to

Is further according to

The pixel value and the initial pixel position are obtained by using a differentiable bilinear sampling interpolation method

Corresponding composite map of

Wherein each pixel value of the composite map is not a simple mapping

The differentiable bilinear sampling interpolation is obtained by weighting four pixels around the pixel. As shown in figure 2 of the attached drawings of the specification

②

Where i = top or bottom, j = left or right, stands for

Four surrounding pixels, where w ^ij Representing the weight of four pixels, has ∑ w ^ij And =1. Composite views

Later, the original view

Self-supervision is formed between two frames, and a loss function is formed:

③

therefore, the purposes of synthesizing a new image by utilizing a depth image and constructing a luminosity error to achieve self-supervision without external supervision are achieved.

And (4): and the mask network is used for preventing the network training gradient from being damaged. Because the conventional geometric knowledge is used in the previous step, the preconditions of no dynamic object, no shielding object and the like in the image need to be met, and the mask network is provided in order to prevent the network training from being inhibited. The mask network and the pose network are trained together, and the mask corresponding to one sequence is obtained by adopting the upsampling, passing through four layers of 4 x 4 convolutional layers and then passing through one layer of 3 x 3 convolutional layers

For each pixel corresponding mask

The loss function for two frames becomes from equation (3)

④

And (5): an inverse sequence network, the network structure of which is shown in fig. 3, is used to add constraints to enable the network to estimate the relative pose between frames more accurately. When the positive sequence image is used for inputting, the input sequence is

The image input of the reverse order network is

A good pose estimation network can not only estimate the pose relationship between frames in a positive sequence, but also estimate the pose between the frames when an image sequence is input in a negative sequence, thereby increasing the constraint. For a sequence of three pictures, the pose obtained by the network in the positive sequence is

Pose obtained by inverse sequence is

Ideally, there are

However, the network estimate always has errors, and the error increases the constraint, and the loss function is as follows

⑤

representing the rotation of the network estimate at the input of a positive sequence,

representing the rotation of the network estimate at the time of inverse sequence input and ω represents the weight.

Therefore, the pose network is trained by adding constraint, so that the network has the capability of accurately estimating the relative motion between frames.

Claims

1. An RGBD camera-based unsupervised pose estimation network construction method is characterized by comprising the following specific steps:

step (1): obtaining co-scene color images and depth images using RGB-D camera

Continuous color image using RGB-D camera

And corresponding continuous depth images

Image adjacent thereto

Each color image is RGB three channels, and three color images are formed

Is spliced into a 9-channel sequence

Wherein feature refers to a feature map obtained after convolution, and 0 refers to convolution operation of 0 th time;

step (2): pose network based learning of inter-frame structural relationships

The pose network is formed by a convolutional neural network, and passes through a ReLU activation layer after each convolution layer; first 9 channel sequence

Then, dimension reduction is performed on the convolution kernel with the size of 1 x 1 after passing through one layer of convolution kernel, and the number of channels is 12

To

The last three positions are expressed by Euler angles

To

Rotation of the coordinate system of, T _t→t+1 The same is shown;

and (3): and (3) completing self-supervision by utilizing an inter-frame camera pose relation and combining distance information of a depth map and geometric knowledge:

for images

The corresponding depth map is

Image corresponding to time t +1

Has a conversion relation of T _t→t+1 (ii) a For images

A certain point pixel on

Is correspondingly provided with

Is as

The camera projection model and the inter-frame triangular relation can be used for obtaining

Correspond to

Is formed by a plurality of pixels

There is a relationship:

①

wherein K is an internal reference of the camera; according to

Mapping to

Corresponding space is obtained

Each pixel value corresponds to

Is further according to

Corresponding composite map of

Wherein each pixel value of the composite map is not a simple mapping

②

where i = top or bottom, j = left or right, stands for

Later, the original view

Self-supervision is formed between two frames, and a loss function is formed:

③

therefore, the purposes of synthesizing a new image by utilizing a depth image and constructing a luminosity error to achieve the purpose of self-supervision without external supervision are achieved;

and (4): prevention of network training gradient corruption by masking network

The mask network and the pose network share the former five-layer convolution network, and are trained together with the pose network, and the up-sampling is adopted to obtain a mask I corresponding to a sequence through four layers of 4-4 convolutional layers and one layer of 3-3 convolutional layers _t ^mask For each pixel corresponding mask P _t ^mask Then the loss function for two frames becomes from equation (3)

④