CN114663496A

CN114663496A - Monocular vision odometer method based on Kalman pose estimation network

Info

Publication number: CN114663496A
Application number: CN202210290482.3A
Authority: CN
Inventors: 曾慧; 修海鑫; 刘红敏; 樊彬; 张利欣
Original assignee: University of Science and Technology Beijing USTB; Shunde Graduate School of USTB
Current assignee: University of Science and Technology Beijing USTB; Shunde Graduate School of USTB
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-06-24
Anticipated expiration: 2042-03-23
Also published as: CN114663496B

Abstract

The invention provides a monocular vision odometer method based on a Kalman pose estimation network, and belongs to the technical field of computer vision. The method comprises the following steps: constructing a depth estimation network and a pose estimation network based on Kalman filtering; calculating a photometric error loss function of a video image sequence based on motion weighting according to the pose transformation between each pair of adjacent frame images output by the pose estimation network and the depth image of the input frame output by the depth estimation network; introducing a variation automatic encoder structure into the constructed pose estimation network and the depth estimation network, and calculating a variation automatic encoder loss function; based on the obtained luminosity error loss function and the variational automatic encoder loss function, a training strategy aiming at the frame missing condition is adopted to train a pose estimation network and a depth estimation network; and estimating the camera pose corresponding to each frame of image by using the trained pose estimation network. By adopting the method and the device, the accuracy of the camera pose estimation can be improved and the frame missing condition can be adapted.

Description

Monocular vision odometer method based on Kalman pose estimation network

Technical Field

The invention relates to the technical field of computer vision, in particular to a monocular vision odometer method based on a Kalman pose estimation network.

Background

The visual odometer is used as a part of a simultaneous positioning and mapping technology and is widely applied to the fields of robot navigation, automatic driving, augmented reality, wearable computing and the like. The visual odometer is a method for estimating the current position and posture of a camera according to an input video image frame. The visual odometer can be classified into a monocular visual odometer, a binocular visual odometer, a visual odometer with inertial information fused, and the like, according to the type and number of the sensors. The monocular vision odometer has the advantages of only needing one camera, low requirement on hardware, no need of correction and the like.

The traditional visual odometry method firstly extracts and matches image features, and then estimates the relative pose between two adjacent frames according to the geometric relationship. The method achieves good results in practical application, is the mainstream method of the current visual odometer, and has the problem that the computing performance and the robustness are difficult to balance.

Monocular visual odometry based on deep learning can be divided into supervised and self-supervised methods. The self-supervision method only needs to input video image frames, does not need to collect real poses, does not depend on additional equipment, and is wider in applicability compared with the supervision method.

The existing many self-monitoring methods do not consider the association between frames, and the information between frames is not fully utilized, so that the trained network is difficult to estimate a more accurate pose, and the method can not adapt to the condition of frame missing. In addition, the moving object in the scene is inconsistent with the Euclidean transformation of the scene, and does not meet the assumption of a static scene, so that the motion of the scene is difficult to be described by one Euclidean transformation, and the estimation result of the network has deviation.

Disclosure of Invention

The embodiment of the invention provides a monocular vision odometer method based on a Kalman pose estimation network, which can improve the accuracy of camera pose estimation and adapt to the condition of frame loss. The technical scheme is as follows:

the embodiment of the invention provides a monocular vision odometer method based on a Kalman pose estimation network, which comprises the following steps:

constructing a depth estimation network and a pose estimation network based on Kalman filtering; the system comprises a pose estimation network and a depth estimation network, wherein the pose estimation network is used for outputting pose transformation between each pair of input adjacent frame images;

calculating a photometric error loss function of a video image sequence based on motion weighting according to the output pose transformation between each pair of adjacent frame images and the depth image of the input frame;

introducing a variation automatic encoder structure into the constructed pose estimation network and the depth estimation network, and calculating a loss function of the variation automatic encoder;

based on the obtained luminosity error loss function and the variational automatic encoder loss function, a training strategy aiming at the frame missing condition is adopted to train a pose estimation network and a depth estimation network;

and estimating the camera pose corresponding to each frame of image in the video image sequence of the pose to be estimated by using the trained pose estimation network.

Further, the pose estimation network includes: the system comprises a pose measurement network, a pose weighted fusion network, a pose updating network and a pose prediction network; wherein,

input adjacent frame image I through pose measurement network_t-1And I_tCoding is carried out to obtain a pose measurement vector C at the time t_measure，t：

C_measure，t＝Measure(I_t-1，I_t)

Wherein, I_t-1And I_tImages respectively representing the time t-1 and the time t, and Measure () is the pose measurement network;

measuring pose by vector C_measure，tAnd pose prediction vector C_pred，tInputting the pose weighted fusion vector C into the pose weighted fusion network to obtain the pose weighted fusion vector C at the time t_fuse，t：

C_fuse，t＝(1-W_t)*C_measure，t+W_t*C_pred，t

Wherein, W_tOutput of [0, 1 ] for the last full link layer in the pose weighted fusion network]Weight in between; c_pred，tIn the adjacent frame image I_t-2、I_t-1When inputting the pose estimation network, the pose prediction vector at the t moment output by the pose prediction network, C_pred，t＝Predict(C_fuse，t-1)，C_fuse，t-1Weighting and fusing the pose at the time t-1, and using Predict as the pose prediction network;

fusing pose weighting vector C_fuse，tInput pose update network estimation pose transformation T_t→t-1：

T_t→t-1＝Update(C_fuse，t)

Wherein Update () is the pose Update network; t is_t→t-1Represents from I_t-1To I_tThe 6 degree of freedom relative pose vector of (1), comprising: relative rotation and relative displacement.

Furthermore, both the pose estimation network and the depth estimation network adopt encoder-decoder structures.

Further, the calculating a photometric error loss function based on motion weighting for a video image sequence according to the output pose transformation between each pair of adjacent frame images and the input frame depth image comprises:

multiplying the pose transformation between each pair of adjacent frame images output by the pose estimation network to obtain the pose transformation in a longer time period, and calculating the photometric error between the images based on the motion weighting based on the obtained pose transformation in the longer time period;

and calculating a photometric error loss function based on motion weighting of the video image sequence according to the photometric error obtained by calculation.

Further, the multiplying the pose transformation between each pair of adjacent frame images output by the pose estimation network to obtain a pose transformation of a longer time period, and based on the obtained pose transformation of the longer time period, calculating the photometric error between the images based on the motion weighting comprises:

for a video image sequence with length N, the corresponding time is t₀,t₁,...,t_N-1Accumulating and multiplying the poses between each pair of adjacent frame images output by the pose estimation network to obtain pose transformation in a longer period of time

Wherein,

is from time t_jTo time t_iPose transformation between images; n is the length of each batch of video image sequences of the input pose estimation network and the depth estimation network;

for images

A point of

Its three-dimensional coordinates are represented by its depth image

Reduction; in the image

Upper corresponding projected point

Expressed as:

wherein K is a camera intrinsic parameter;

is t_jA depth image of a time;

by aligning images

Sampling to obtain t_jTime of day image

Is reconstructed image of

For the

Pixel of (2)

Use of

Calculating its motion weighting term W_mw：

Using the resulting motion weighting term W_mwCalculating an image

And

motion-weighted photometric error between:

wherein,

representing images

And

based on the motion-weighted photometric error between,

representing an original image

And reconstructing the image

Structural similarity between them, α₀、α₁、α₂For the hyper-parameter controlling the proportion of the parts, the symbol denotes the product between the pixels, | · |₁Represents a 1-norm, | · |₂Representing a2 norm.

Further, the obtained motion weighting term W is utilized_mwCalculating an image

And

before the photometric error based on motion weighting, the method further comprises:

the pixel involved in the photometric error calculation is determined and labeled as mask:

wherein,

is t_iThe time of the original image is determined,

is t_jThe time of the original image is determined,

is from t_iOriginal image of time

T obtained by sampling_jTime of day image

Is reconstructed image, | · |_*Representing a photometric error, i.e., a 1-norm or a 2-norm;

in order to calculate the image

And

based on motion weighted photometric errors, only mask-marked pixels are used for the calculation.

Further, the photometric error loss function is represented as:

wherein L is_pA photometric error loss function is represented.

Further, the variational autoencoder loss function is represented as:

wherein L is_VAERepresenting a variational autocoder loss function, x_d、x_pAll represent an input image, λ₁、λ₂All represent a hyper-parameter; p is a radical of_η(c) Is a prior distribution, c is the independent variable of the distribution; q. q.s_d(c_d|x_d) Coding of networks for depth estimation c_dThe sampled distribution of; q. q.s_p(c_p|x_p) Coding of networks for depth estimation c_pIs the KL divergence, KL (q)_d(c_d|x_d)||p_η(c) Is q represents_d(c_d|x_d) For p_η(c) KL divergence of (i), KL (q)_p(c_p|x_p)||p_η(c) Is q represents_p(c_p|x_p) For p_η(c) The KL divergence of (a),

to c is to_dAnd c_pRespectively inputting the outputs obtained by the decoders of the depth estimation network and the pose estimation network, and further generating a reconstructed image

The probability distribution of (a) is determined,

representing a mathematical expectation, c_d～q_d(c_d|x_d) Denotes c_dObey q_d(c_d|x_d)，c_p～q_p(c_p|x_p) Denotes c_pObey q_p(c_p|x_p)，

Is shown in satisfying c_d～q_d(c_d|x_d) And c_p～q_p(c_p|x_p) Under the conditions of (a) under (b),

a mathematical expectation of (d); c. C_d～q_d(c_d|x_d) Denotes c_dObey q_d(c_d|x_d) Distributing; c. C_p～q_p(c_p|x_p) Denotes c_pObey q_p(c_p|x_p) And (4) distribution.

Further, the training strategy adopted for the frame missing condition based on the obtained photometric error loss function and the variational automatic encoder loss function to train the pose estimation network and the depth estimation network comprises:

for the output of the depth estimation network, a depth smoothing loss function is computed:

wherein,

is parallax with the depth image D_tIn an inverse proportional relationship with respect to each other,

denotes the partial derivatives in the x-and y-directions, I_tIs the image at the time t;

determining a final loss function L based on the obtained depth smoothing loss function, photometric error loss function and variational automatic encoder loss function:

L＝L_p+λL_s+L_VAE

wherein λ is a hyper-parameter controlling the depth smoothing loss function ratio, L_pRepresenting a photometric error loss function, L_VAERepresenting a variational autocoder loss function;

and training a pose estimation network and a depth estimation network by adopting a training strategy aiming at the frame missing condition by using the obtained final loss function.

Further, the training the pose estimation network and the depth estimation network by adopting the training strategy aiming at the frame missing condition comprises:

inputting all images in a batch of video image sequences into a pose estimation network and a depth estimation network, and training the pose estimation network and the depth estimation network;

inputting all images in a batch of video image sequences into a depth estimation network, setting zero of one or more frames of images in the batch of video image sequences, inputting the images into a pose estimation network, and training the pose estimation network and the depth estimation network.

The monocular vision odometer method based on the Kalman pose estimation network, provided by the embodiment of the invention, at least has the following advantages:

(1) aiming at the problems that the correlation between frames is not considered in many existing self-monitoring methods, and the information between the frames is not fully utilized, so that the trained network is difficult to estimate a more accurate pose and can not adapt to the frame missing condition, the embodiment constructs a pose estimation network based on Kalman filtering, and designs a training strategy aiming at the frame missing condition on the basis of the pose estimation network, so that the pose estimation network can estimate the current pose by utilizing the information between the frames and is more suitable for the frame missing condition;

(2) aiming at the problems that an Euclidean transformation of a moving object possibly existing in a scene is inconsistent with that of the scene, the assumption of a static scene is not satisfied, and the motion of the scene is difficult to be described by one Euclidean transformation, so that the estimation result of a pose estimation network is deviated.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a monocular vision odometry method based on a Kalman pose estimation network according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a pose estimation network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a work flow of a monocular vision odometry method based on a Kalman pose estimation network according to an embodiment of the present invention;

fig. 4 is a schematic diagram of the trajectories estimated by the method provided by the embodiment of the present invention on sequences 09 and 10 in the KITTI odometry dataset.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present invention provides a monocular vision odometry method based on a kalman pose estimation network, including:

s101, constructing a depth estimation network (DepthNet) and a pose estimation network (KF-PoseNet) based on Kalman filtering; the system comprises a pose estimation network and a depth estimation network, wherein the pose estimation network is used for outputting pose transformation between each pair of input adjacent frame images;

as shown in fig. 2, the pose estimation network includes: the system comprises a pose measurement network, a pose weighted fusion network, a pose updating network and a pose prediction network; wherein, as shown in Table 1,

the pose measurement network comprises a ResNet50 layer, three convolutional layers and a global averaging pooling layer; the first two layers of the three convolutional layers take ReLU (Rectification Linear Unit) as an activation function, and the last layer of the convolutional layers is a pure convolutional layer without an activation function; the input of the pose measurement network passes through ResNet50, then sequentially passes through three layers of convolutional layers, and finally is output through a full-play average pooling layer; the pose measurement network uses the ResNet50 structure as an encoder;

the pose weighted fusion network comprises 4 full connection layers and a weighted fusion layer; the first three layers of the 4 full connection layers use ReLU as an activation function, and the last layer of the 4 full connection layers use a Sigmoid function as an activation function; c_measure,tAnd C_pred,tAfter the first full connection layer is input, the first full connection layer sequentially passes through the last three full connection layers, and a weight coefficient with a value range of 0-1 is output; the weight coefficient is further related to C_measure,tAnd C_pred,tSending the mixture into a weighted fusion layer;

the pose updating network comprises 4 fully-connected layers, and the first three fully-connected layers use ReLU as an activation function; the 4 full-connection layers are connected in sequence;

similar to the pose updating network, the pose prediction network also comprises 4 fully-connected layers, and the 4 fully-connected layers are connected in sequence.

TABLE 1 KF-PoseNet network architecture

In this embodiment, the working process of the pose estimation network is as follows:

input adjacent frame image I through pose measurement network_t-1And I_tCoding is carried out to obtain a pose measurement vector C at the time t_measure,t：

C_measure,t＝Measure(I_t-1,I_t)

Wherein, I_t-1And I_tImages respectively representing the t-1 moment and the t moment, and Measure () is the pose measurement network; it should be noted that C_measure,tNot a 6 degree of freedom pose vector, but only the image pair (I)_t-1,I_t) BitA coded vector of pose information;

measuring pose by vector C_measure,tAnd pose prediction vector C_pred,tInputting the pose weighted fusion vector C into the pose weighted fusion network to obtain the pose weighted fusion vector C at the time t_fuse，t：

C_fuse，t＝(1-W_t)*C_measure，t+W_t*C_pred，t

Wherein, W_t＝Weight(C_measure，t，C_pred，t) Output of [0, 1 ] for the last full link layer in the pose weighted fusion network]Weight between the pose and the pose, Weight is 4 full connection layers in the pose weighted fusion network; c_pred，tIn the adjacent frame image I_t-2、I_t-1When inputting the pose estimation network, the pose prediction vector at the t moment output by the pose prediction network, C_pred，t＝Predict(C_fuse，t-1)，C_fuse，t-1Weighting and fusing the pose at the time t-1, and using Predict as the pose prediction network;

fusing pose weighting vector C_fuse，tInputting pose updating network estimation final pose transformation T_t→t-1：

T_t→t-1＝Update(C_fuse，t)

Wherein Update () is the pose Update network; t is a unit of_t→t-1Represents from I_t-1To I_tRelative pose vector of 6 degrees of freedom.

As shown in FIG. 3, the input of KF-PoseNet is two adjacent frames of images, the output is a 6-DOF relative pose vector, the first three elements of which represent 3-DOF relative rotation R, and the last three elements of which represent 3-DOF relative displacement t.

In this embodiment, both the pose estimation network and the depth estimation network adopt encoder-decoder structures, an encoder in the pose estimation network is a ResNet50 structure in the pose measurement network, and a decoder of the pose estimation network is a rest structure, a pose weighting fusion network, a pose prediction network and a pose update network except ResNet50 in the pose measurement network.

In this embodiment, the depth estimation network (DepthNet) also selects the ResNet50 structure as an encoder, uses a multilayer deconvolution structure similar to a DispNet decoder as a decoder, and is connected to the encoder through a skip link structure, and the output layer activation function is Sigmoid. In this embodiment, the input of DepthNet is a single frame image, and the output is normalized parallax D. To obtain the depth D, the reciprocal D of the obtained parallax needs to be 1/(aD + b), where a and b are parameters for limiting the output value range, and the output depth is between 0.1 and 100.

In this embodiment, in order to control the memory usage and keep the details as much as possible, the input RGB images of the pose estimation network and the depth estimation network are scaled to 832 × 256.

In this embodiment, the pair of adjacent frame images is set as the image I at the current time t_tPicture I at the last instant t-1_t-1. Adjacent frame image I_tAnd I_t-1Inputting the pose estimation network and the depth estimation network to obtain pose transformation T between the adjacent frame images_t→t-1And the depth image Dt of each input frame.

S102, calculating a luminosity error loss function of a video image sequence based on motion weighting according to the pose transformation between each pair of output adjacent frame images and the depth image of an input frame; the method specifically comprises the following steps:

a1, multiplying the pose transformation between each pair of adjacent frame images output by the pose estimation network to obtain the pose transformation in a long time period, and calculating the photometric error between the images based on the motion weighting based on the obtained pose transformation in the long time period;

in this embodiment, there may be some fast moving objects in a scene. These objects are not consistent with the euclidean transforms of the camera. It is obviously not reasonable to treat the pixels corresponding to these objects equally when training the network. For the case that the motion amplitude in the data set is not large and the illumination change is not obvious, the brightness of the pixel at the same position in two adjacent frames does not change too much. Based on this, in order to reduce the influence of fast moving objects, the present invention designs photometric errors based on motion weighting. In order to enable the network to consider consistency of pose transformation in a long time, the embodiment calculates photometric errors constrained by long-time poses by using continuous multi-frame images when calculating photometric errors based on motion weighting, specifically:

Wherein,

then, for the image

A point of

Whose three-dimensional coordinates may be represented by its depth image

Reduction; then it is in the image

Upper corresponding projection point

Can be calculated by the following formula:

wherein K is a camera intrinsic parameter;

is t_jA depth image of a time;

the above formula ignores the calculation of part of the homogeneous coordinate system;

by aligning images

Sampling to obtain t_jTime of day image

Is reconstructed image of

Then, for

Pixel of (2)

Can use

Calculating its motion weighting term W_mw：

Finally, the obtained motion weighting term W is utilized_mwComputing images

And

photometric error based on motion weighting

Wherein,

representing an original image

And reconstructing the image

In this embodiment, the motion weighting term W described above is used_mwAnd weighting the calculated breadth error pixel by pixel to obtain the luminosity error weighted by the motion.

Further, it is considered that when an object that is stationary with respect to the camera exists in the field of view, the accuracy of the depth estimation may be affected, resulting in the estimated depth becoming infinite. For this purpose, a method of automatically marking still pixels is also used in this embodiment and removed from the training process. Specifically, pixels having errors smaller than the reconstruction error between the current image and the reference image are regarded as pixels stationary with respect to the camera, and the depth network is trained using only pixels having reconstruction errors smaller than the errors between the current image and the reference image (i.e., pixels involved in photometric error calculation).

In this embodiment, the pixels involved in the photometric error calculation are determined and marked as mask:

wherein,

is t_iThe time of the original image is determined,

is t_jThe time of the original image is determined,

is from t_iOriginal image of time

T obtained by sampling_jTime of day image

Is reconstructed from the image, | · |_*Representing a photometric error, i.e., a 1-norm or a 2-norm;

in order to calculate the image

And

when the luminosity error is based on the motion weighting, only the pixels marked by the mask are used for calculation, and then the pixels marked by the mask are used for network training.

A2, calculating a photometric error loss function L of video image sequence motion weighting according to the photometric error obtained by calculation_p：

Wherein L is_p' represents the photometric error of the motion weighting.

S103, introducing a variation automatic encoder structure into the constructed pose estimation network and the constructed depth estimation network, and calculating a loss function of the variation automatic encoder;

in this embodiment, KF-PoseNet and DepthNet both use encoder-decoder structures; in order to improve the robustness of the output of a decoder to noise in the coding of the input of the decoder and improve the generalization capability of a network, a variable Auto-Encoder (VAE) structure is introduced into KF-PoseNet and DepthNet;

take a depth estimation network as an example;

encoder of depth estimation network inputs image x_d＝I_tMapping to coding space to obtain mean vector E_d(x_d)；

Further, let q be_d(c_d|x_d) For codes to be input to a decoder c_dIs set as the mean value of the mean value E of the input image_dThe covariance being the covariance Σ of the input image_dGaussian distribution of

At q_d(c_d|x_d) Random sampling in the distribution to obtain code c_dWherein c is_dObey q_d(c_d|x_d) Distribution of using c_d～q_d(c_d|x_d) Represents;

further, c is encoded_dAn input decoder obtains a depth image of an input image;

in order to meet the requirement of deep network back propagation, in this embodiment, when the code is randomly sampled in the coding space, the following reparameterization method is adopted to change the random sampling process into a micromanipulation: let η be Gaussian distribution obeying zero mean unit covariance

Random vector of (2):

where I is the identity matrix, then pair c_d～q_d(c_d|x_d) Can be passed through c_d＝E_d(x_d)+∑_dEta implementation, where_dIs the covariance of the input image;

the pose estimation network is the same;

further, a VAE loss function L is calculated_VAEComprises the following steps:

wherein x is_d、x_pAll representing the input image, over-parameter lambda₁、λ₂Weight, p, for controlling the target item_η(c) Is a prior distribution, c is the independent variable of the distribution; q. q.s_d(c_d|x_d) Coding of networks for depth estimation c_dIs sampled over a period of time q_p(c_p|x_p) Coding of networks for depth estimation c_pIs the KL divergence, KL (q)_d(c_d|x_d)||p_η(c) Is q represents_d(c_d|x_d) For p_η(c) KL divergence of (i), KL (q)_p(c_p|x_p)||p_η(c) Is q represents_p(c_p|x_p) For p_η(c) The KL divergence of (a) is,

to c is_dAnd c_pRespectively inputting the outputs obtained by the decoders of the depth estimation network and the pose estimation network, and further generating a reconstructed image

The probability distribution of (a) is determined,

the mathematical expectation of (c); the first two items in the formula control the tendency that the distribution of KL divergence punishment hidden codes deviates from prior distribution; the last term, minimizing a non-negative log-likelihood term, is equivalent to minimizing a photometric error loss function; thus, the VAE loss function is actually only the first two terms in the formula.

In this embodiment, the prior distribution p_η(c) 0 mean Gaussian distribution

S104, training a pose estimation network and a depth estimation network by adopting a training strategy aiming at the frame missing condition based on the obtained luminosity error loss function and the variation automatic encoder loss function; the method specifically comprises the following steps:

first, considering a texture-stable plane in three-dimensional space, its depth in the depth image tends not to vary too drastically. Therefore, in the present embodiment, for the output of the depth estimation network, the depth smoothing loss function L is also calculated as follows_s：

Wherein,

to look atDifference and depth image D_tIn an inverse proportional relationship with respect to each other,

representing partial derivatives, I, in the x-and y-directions, respectively_tIs the image at time t;

in this embodiment, the depth smoothing loss function is calculated for each frame of image in each batch;

then, based on the obtained depth smoothing loss function, photometric error loss function and variational automatic encoder loss function, determining a final loss function L:

L＝L_p+λL_s+L_VAE

and finally, training a pose estimation network and a depth estimation network by adopting a training strategy aiming at the frame loss condition by using the obtained final loss function.

And S105, estimating the camera pose corresponding to each frame of image in the video image sequence of the pose to be estimated by using the trained pose estimation network.

In the embodiment, the pose estimation network (KF-PoseNet) based on Kalman filtering refers to the idea of Kalman filtering during design, and the multiple estimations are associated in time sequence, so that the KF-PoseNet in the invention can better adapt to the frame loss condition;

in the embodiment, during training, all images in a batch of video image sequences are input into the pose estimation network and the depth estimation network, and the pose estimation network and the depth estimation network are trained; further, aiming at the possible frame missing condition existing in the visual odometer, all images in a batch of video image sequences are input into the depth estimation network, one or more frames of images in the batch of video image sequences are input into the pose estimation network after being set to zero, and the pose estimation network and the depth estimation network are trained. For example, when N is 5, a batch simultaneously inputs 5 consecutive frames of images to the depth estimation network, and respectively inputs every two adjacent frames to the pose estimation network; further, aiming at the possible frame missing condition existing in the visual odometer, two frames of images are randomly set to zero from the last 3 frames of the five continuous frames input at one time, and then the images are input into the pose estimation network for training, while the input of the depth estimation network is still a complete image.

And after the training is finished, estimating the camera pose corresponding to each frame of image in the video image sequence of the pose to be estimated by using the trained pose estimation network.

The monocular vision odometer based on the Kalman pose estimation network can effectively estimate the camera pose corresponding to each frame according to the input image sequence and adapt to the frame missing condition. The invention is suitable for the self-supervision monocular vision mileometer.

(2) aiming at the problems that an Euclidean transformation of a moving object possibly existing in a scene is inconsistent with that of the scene, the assumption of a static scene is not satisfied, and the motion of the scene is difficult to be described by using one Euclidean transformation, so that the estimation result of a pose estimation network is deviated.

In order to verify the effectiveness of the monocular vision odometry method based on the Kalman pose estimation network provided by the embodiment of the invention, the performance of the method is tested by using an evaluation index provided in a KITTI odometry data set:

(1) relative displacement mean square error (rel.): the average displacement rmse (root Mean Square error) of all subsequences of a sequence of

length

100, 200, … …, 800 meters, measured in% i.e. meters per 100 meters deviation, is as good as the smaller the value.

(2) Relative rotation mean square error (rel.): the average rotation RMSE, measured in deg/m, of all subsequences of 100, 200, … …, 800 meters length in a sequence is as small as possible.

In the embodiment, eight sequences 00-07 in a KITTI odometer data set are used as a training set and a verification set to train a pose estimation network and a depth estimation network, and two sequences 09-10 are used for testing the performance of the pose estimation network based on Kalman filtering for the self-supervision monocular vision odometer.

The KITTI odometer data set is a binocular image, radar points and actual tracks of the road environment in the city, which are acquired by equipment such as a vehicle-mounted camera.

In the implementation process, a depth estimation network and a pose estimation network based on Kalman filtering are constructed; the system comprises a pose estimation network and a depth estimation network, wherein the pose estimation network is used for outputting pose transformation between each pair of input adjacent frame images; calculating a photometric error loss function of a video image sequence based on motion weighting according to the output pose transformation between each pair of adjacent frame images and the depth image of the input frame; introducing a variation automatic encoder structure into the constructed pose estimation network and the depth estimation network, and calculating a loss function of the variation automatic encoder; based on the obtained luminosity error loss function and the variational automatic encoder loss function, a training strategy aiming at the frame missing condition is adopted to train a pose estimation network and a depth estimation network; and estimating the camera pose corresponding to each frame of image in the video image sequence of the pose to be estimated by using the trained pose estimation network.

In this embodiment, the parameter α of the hyperparametric of the photometric error loss function₀＝0.85，α₁＝0.1，α₂0.05, the parameter λ of the depth smoothing loss function is 10^-3Parameter of VAE loss function λ₁＝λ₂0.01. In the training process of the network, the initial learning rate is 10^-4And gradually reducing along with the training, wherein the learning rate is 0.97 times of that of the previous round after each round of iteration, and performing 45 iterations by adopting an Adam optimizer, wherein the batch size of each round of iteration is 2, and each batch contains 3 continuous frames of images.

In order to verify the performance of the method of the present invention, in this example, a monocular visual odometry method based on self-supervision of deep learning in recent years was selected for comparison, and the experimental results are shown in table 2. The generated trajectory in this embodiment is shown in fig. 4, where the dashed trajectory is a real trajectory, and the solid trajectory is the estimated trajectory in this embodiment.

As can be seen from table 2, the method described in this embodiment achieves better performance compared to other methods due to better utilization of information extracted from past time instants, weighting of motion pixels, and application of VAE structures.

TABLE 2 comparison of the method of this example with other methods

In order to verify the significance of the parts of the method described in this example, ablation experiments were also performed in this example. The experimental result is shown in table 3, where "without kalman structure" in the second row indicates that the kalman structure in the network is removed, the decoder structure of the pose estimation network is a four-layer convolutional layer, the activation function of the first three-layer convolutional layer is ReLU, and the output of the fourth layer is subjected to global averaging pooling to obtain a pose vector with 6 degrees of freedom. The third row to the fifth row respectively correspond to experimental results for removing motion weighting, a VAE structure and long-term consistency constraint in the network. The "# fc ═ 6" and "# fc ═ 2" in the sixth and seventh rows respectively represent experimental results of the pose estimation network decoder portion using fully-connected layers of different numbers of layers. The first row "basic" represents the experimental results without the addition of the above three structures. The last row represents experimental results of the complete method herein.

The experimental result shows that the structure similar to the Kalman structure enables the network to obtain reference from previous data when estimating the current adjacent frame, so that the current estimation result is more accurate; due to the introduction of motion weighting, the network can pay more attention to the pixels of static objects in the environment during training, and the interference of objects inconsistent with the Euclidean transformation of the camera is weakened; due to the introduction of the VAE structure, a decoder of the network has more robustness to noise in a result of an encoder, the generalization capability of the network is improved, and the result is further improved. Finally, the complete method herein achieves better experimental results. The performance of our method gradually increased with each part, and the significance of each part in our method is proved.

TABLE 3 ablation test results

Table 4 experimental results for the case of frame missing

The embodiment also performs an ablation experiment on the training strategy for the frame missing condition designed in the invention. During testing, the present embodiment adopts the way of setting the image of one frame to zero at the 50 th and 150 … … th frames and setting the image of two frames to zero at the 100 th and 200 … … th frames, so as to test the invention under the condition of frame missing. The test results are shown in table 4. The first row "without frame training" represents a result of training without using a training method for a frame missing condition in the present embodiment, the second row "without kalman structure" represents an experimental result of training without using a training method for a frame missing condition, and the third row represents an experimental result of training with a training method for a frame missing condition in the present embodiment. As can be seen from table 4, the method proposed in this embodiment can be well adapted to the frame missing situation.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A monocular vision odometry method based on a Kalman pose estimation network is characterized by comprising the following steps:

introducing a variation automatic encoder structure into the constructed pose estimation network and the depth estimation network, and calculating a variation automatic encoder loss function;

2. The monocular visual odometry method based on a kalman pose estimation network of claim 1, wherein the pose estimation network comprises: the system comprises a pose measurement network, a pose weighted fusion network, a pose updating network and a pose prediction network; wherein,

C_measure，t＝Measure(I_t-1，I_t)

Wherein, I_t-1And I_tImages respectively representing the t-1 moment and the t moment, and Measure () is the pose measurement network;

C_fuse，t＝(1-W_t)*C_measure，t+W_t*C_pred，t

Wherein, W_tOutput of [0, 1 ] for the last full link layer in the pose weighted fusion network]Weight in between; c_pred，tFor the adjacent frame image I_t-2、I_t-1When inputting the pose estimation network, the pose prediction vector at the t moment output by the pose prediction network, C_pred，t＝Predict(C_fuse，t-1)，C_fuse，t-1Weighting and fusing the pose at the time t-1, and using Predict as the pose prediction network;

T_t→t-1＝Update(C_fuse，t)

3. The monocular visual odometry method based on a kalman pose estimation network of claim 2, wherein the pose estimation network and the depth estimation network both employ an encoder-decoder structure.

4. The Kalman pose estimation network based monocular visual odometry method of claim 1, wherein the computing a motion-weighted based photometric error loss function for a video image sequence based on the pose transformation between each pair of output adjacent frame images and the depth image of the input frame comprises:

and calculating a photometric error loss function based on motion weighting of the video image sequence according to the calculated photometric error.

5. The monocular vision odometry method based on the kalman pose estimation network of claim 4, wherein the multiplying the pose transformation between each pair of adjacent frame images output by the pose estimation network results in a pose transformation of a longer period of time, and the calculating the photometric error between the images based on the motion weighting based on the resulting pose transformation of a longer period of time comprises:

for a video image sequence with length N, the corresponding time is t₀，t₁，...，t_N-1Accumulating and multiplying the poses between each pair of adjacent frame images output by the pose estimation network to obtain pose transformation in a longer period of time

Wherein,

is from time t_jTo time t_iThe pose between the images is changed; n is the length of each batch of video image sequences of the input pose estimation network and the depth estimation network;

for images

A point of

Its three-dimensional coordinates are represented by its depth image

Reduction; in the image

Upper corresponding projected point

Expressed as:

wherein K is a camera intrinsic parameter;

is t_jA depth image of a time;

by aligning images

Sampling to obtain t_jTime of day image

Is reconstructed image of

For

Pixel of (2)

Use of

Calculating its motion weighting term W_mw：

Using the resulting motion-weighted term W_mwComputing images

And

motion-weighted photometric error between:

wherein,

representing images

And

based on the motion-weighted photometric error between,

representing an original image

And reconstructing the image

Structural similarity between them, α₀、α₁、α₂To control the hyper-parameters of the proportion of the parts, the symbol denotes the product between pixels, | · survival₁Represents 1 norm, | ·| non-conducting phosphor₂Representing a2 norm.

6. The Kalman pose estimation network based monocular vision odometry method of claim 5, characterized in that the derived motion weighting term W is utilized_mwCalculating an image

And

before the motion-weighted based photometric error, the method further comprises:

the pixel involved in the photometric error calculation is determined and marked as mask:

wherein,

is t_iThe time of the original image is determined,

is t_jThe time of the original image is determined,

is from t_iOriginal image of time

T obtained by sampling_jTime of day image

The reconstructed image, | · | | luminous flux_*Representing a photometric error, i.e., a 1-norm or a 2-norm;

in order to calculate the image

And

7. The Kalman pose estimation network based monocular vision odometry method of claim 5, wherein the photometric error loss function is expressed as:

wherein L is_pA photometric error loss function is represented.

8. The Kalman pose estimation network based monocular visual odometry method of claim 1, characterized in that the variational autoencoder loss function is expressed as:

wherein L is_VAERepresenting a variational autocoder loss function, x_d、x_pAll represent an input image, λ₁、λ₂All represent a hyper-parameter; p η (c) is the prior distribution, c is the independent variable of the distribution; q. q.s_d(c_d|x_d) Coding of networks for depth estimation c_dThe sampled distribution of; q. q.s_p(c_p|x_p) Coding of networks for depth estimation c_pIs the KL divergence, KL (q)_d(c_d|x_d)||p_η(c) Is q represents_d(c_d|x_d) For p_η(c) KL divergence of (Q)_p(c_p|x_p)||p_η(c) Is q represents_p(c_p|x_p) For p_η(c) The KL divergence of (a),

to c is_dAnd c_pRespectively inputting the output obtained by the decoders of the depth estimation network and the pose estimation network, and further generating a reconstructed image

The probability distribution of (a) is determined,

Is shown in satisfying c_d～q_d(c_d|x_d) And c_p～g_p(c_p|x_p) Under the conditions of (a) under (b),

a mathematical expectation of (d); c. C_d～q_d(c_d|x_d) Denotes c_dObey q_d(c_d|x_d) Distributing; c. C_p～q_p(c_p|x_p) Denotes c_pComplianceq_p(c_p|x_p) And (4) distribution.

9. The monocular visual odometry method based on a kalman pose estimation network of claim 1, wherein the training of the pose estimation network and the depth estimation network with the training strategy for the frame missing condition based on the obtained photometric error loss function and the variational automatic encoder loss function comprises:

wherein,

is parallax, is inversely proportional to the depth image Dt,

L＝L_p+λL_s+L_VAE

10. The Kalman pose estimation network based monocular visual odometry method of claim 1, wherein the training the pose estimation network and the depth estimation network with a training strategy for frame loss comprises:

inputting all images in a batch of video image sequence into a pose estimation network and a depth estimation network, and training the pose estimation network and the depth estimation network;