CN117197229A

CN117197229A - Multi-stage estimation monocular vision odometer method based on brightness alignment

Info

Publication number: CN117197229A
Application number: CN202311238629.5A
Authority: CN
Inventors: 曾慧; 梁溢友; 江左; 杨清港
Original assignee: Aerospace Science And Industry Group Intelligent Technology Research Institute Co ltd; Shunde Innovation School of University of Science and Technology Beijing
Current assignee: Aerospace Science And Industry Group Intelligent Technology Research Institute Co ltd; Shunde Innovation School of University of Science and Technology Beijing
Priority date: 2023-09-22
Filing date: 2023-09-22
Publication date: 2023-12-08
Anticipated expiration: 2043-09-22
Also published as: CN117197229B

Abstract

The invention provides a multi-stage estimation monocular vision odometer method based on brightness alignment, and belongs to the technical field of computer vision. The method comprises the following steps: constructing a depth estimation network, a pose estimation network based on brightness alignment and a pose optimization network based on a two-way long-short-term memory network; calculating luminosity loss functions between adjacent frame images based on brightness alignment; calculating a motion constraint loss function of an image sequence input into the pose estimation network; optimizing the pose by a pose optimizing network, estimating the pose of the relative camera output at the current moment of the network, and calculating a pose optimizing loss function; training a pose estimation network, a depth estimation network and a pose optimization network; and estimating the relative camera pose corresponding to each frame of image in the image sequence of the pose to be estimated by using the trained pose estimation network and the pose optimization network. By adopting the method and the device, the accuracy and the robustness of the relative camera pose estimation result can be effectively improved.

Description

Multi-stage estimation monocular vision odometer method based on brightness alignment

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-stage estimation monocular vision odometer method based on brightness alignment.

Background

The visual odometer is an important component of the vision simultaneous localization and mapping technology, is mainly used for camera self-motion estimation, and is widely applied to a plurality of fields such as robot navigation, automatic driving, augmented reality and the like. Depending on the type of sensor used and the number thereof, the visual odometer may be classified into a monocular visual odometer, a binocular visual odometer, and a vision-inertia based odometer method. The monocular vision mileage calculation method can be realized only by a monocular camera, has the characteristics of low cost, easy deployment and the like, and is widely paid attention to by researchers in recent years.

Conventional visual odometry methods are classified into feature-based methods and direct methods. The feature-based method extracts and matches features of two consecutive input frames and estimates the camera's own motion based on the geometrical relationship between the matched features. The direct law uses the gray invariance assumption to directly track the pixel motion between frames, and further estimates the self-motion of the camera by minimizing the photometric error.

With the development of deep learning technology, deep neural networks are widely used in the field of computer vision. Researchers have adopted deep learning to implement visual odometry methods, including supervised learning-based visual odometry methods and self-supervised learning-based visual odometry methods (referred to simply as self-supervised visual odometry methods). The self-supervision vision inner calculation method does not need the real relative pose of the camera as a supervision signal, and has wider application scenes.

Many existing self-supervising visual odometry methods are also relatively poor in accuracy of estimation of relative camera pose and still have limitations in terms of algorithm robustness, particularly in ambient light changing scenarios. Furthermore, many existing methods do not fully exploit the timing information implicit in the input image sequence during camera self-motion estimation.

Disclosure of Invention

The embodiment of the invention provides a multi-stage estimation monocular vision odometer method based on brightness alignment, which can effectively improve the accuracy and the robustness of a relative camera pose estimation result. The technical scheme is as follows:

in one aspect, a multi-stage estimated monocular vision odometer method based on brightness alignment is provided, the method being applied to an electronic device, the method comprising:

constructing a depth estimation network, a pose estimation network based on brightness alignment and a pose optimization network based on a two-way long-short-term memory network; the pose optimization network is used for aggregating implicit time sequence information in the input image sequence and optimizing the relative camera pose output by the pose estimation network;

calculating luminosity loss functions based on brightness alignment between adjacent frame images according to relative camera pose and brightness alignment parameters between each pair of adjacent frame images output by a pose estimation network and the depth image of an input frame output by a depth estimation network;

Calculating a motion constraint loss function of an image sequence input into the pose estimation network;

through the pose optimization network, the pose of the relative camera is optimized according to the pose of the relative camera estimated by the pose estimation network at the past moment, the pose of the relative camera is output at the current moment of the pose estimation network, and a pose optimization loss function is calculated by utilizing the pose of the relative camera output by the pose optimization network;

inputting all images in the image sequence into a pose estimation network, a depth estimation network and a pose optimization network, and training the constructed pose estimation network, depth estimation network and pose optimization network based on the obtained luminosity loss function, motion constraint loss function and pose optimization loss function based on brightness alignment;

and estimating the relative camera pose corresponding to each frame of image in the image sequence of the pose to be estimated by using the trained pose estimation network and the pose optimization network.

Further, the pose estimation network adopts an encoder-decoder structure;

the pose optimization network is formed by connecting a full-connection layer after adopting a two-way long-short-period memory network; wherein,

the two-way long-short-term memory network is used for aggregating the time sequence information implicit in the image sequence;

The full connection layer is used for returning the aggregated time sequence information to a relative camera pose with 6 degrees of freedom.

Further, the calculating the luminosity loss function based on the brightness alignment between the adjacent frame images according to the relative camera pose between each pair of adjacent frame images output by the pose estimation network, the brightness alignment parameter and the depth image of the input frame output by the depth estimation network comprises:

for a pair of images having a relative motion relationship I _t And I _t+1 Estimating corresponding relative camera pose and brightness alignment parameters by using a pose estimation network, estimating a depth image of an input frame by using a depth estimation network, aligning brightness information of two images by using the brightness alignment parameters, and using the estimated relative camera pose and the corresponding depth image to align the brightness of a first frame image I after brightness alignment _t Re-castingShadow to second frame image I _t+1 And calculating luminosity loss between the re-projection image and the real image based on brightness alignment, and obtaining a luminosity loss function value based on brightness alignment.

Further, for a pair of image pairs having a relative motion relationship, estimating corresponding relative camera pose and brightness alignment parameters using a pose estimation network, estimating a depth image of an input frame using a depth estimation network, aligning brightness information of the two images using the brightness alignment parameters, re-projecting a first frame image after brightness alignment to a second frame using the estimated relative camera pose and the corresponding depth image, calculating brightness loss between the re-projected image and a real image based on brightness alignment, and obtaining a brightness loss function value based on brightness alignment includes:

At the current time t, the image pair I with relative motion relation _t And I _t+1 Inputting the pose estimation network, and outputting the image pair I of the adjacent frames by the pose estimation network _t And I _t+1 6 degrees of freedom relative to camera poseAnd brightness alignment parameters a and b; wherein (1)>Expressed as:

wherein I is _t And I _t+1 Images at time t and time t+1 are respectively represented, and Pose_BA () is the Pose estimation network based on brightness alignment;

the depth estimation network estimates a second frame image I _t+1 Corresponding depth image D _t+1 ；

For the followingConverting it into 4 x 4 pose changes using euler transform formulaMatrix change->

Wherein Euler () represents the Euler transform formula;

input image I using luminance alignment parameters a and b _t Affine transformation of brightness is carried out to obtain an image

For image I _t+1 A point p on _t+1 Its three-dimensional coordinates are defined by its depth D _t+1 (p _t+1 ) Reducing; which is in the imageGo up corresponding projection point->Expressed as:

wherein K is an intra-camera parameter, D _t+1 A depth image at time t+1;

by applying to image I _t Sampling to obtain an image I at the time t+1 _t+1 Is a reprojection image I' _t+1 ：

Determining a reprojection image I' _t+1 And real image I _t+1 Luminosity loss between based on brightness alignmentExpressed as:

wherein SSIM (I) _t+1 ,I′ _t+1 ) Representing a real image I _t+1 And re-projection image I' _t+1 Structural similarity between mu ₀ 、μ ₁ To control the superparameter of the proportion of the corresponding parts ₁ Representing the L1 norm.

Further, the motion constraint loss function includes: a non-adjacent frame luminosity loss function and a pose continuity loss function;

the calculating the motion constraint loss function of the image sequence input to the pose estimation network comprises:

for a non-adjacent frame image pair in an image sequence with the length of N before the current t moment image, estimating the relative camera pose and brightness alignment parameters by using a pose estimation network, calculating a corresponding brightness alignment-based luminosity loss function, and distributing different weights for the brightness alignment-based luminosity loss functions of all the non-adjacent frame image pairs in the sequence to add to obtain the non-adjacent frame luminosity loss function of the image sequence;

converting the relative camera pose of each group of adjacent image pairs in the image sequence estimated by the pose estimation network into a pose transformation matrix, multiplying all the pose transformation matrices to obtain a relative pose transformation matrix between head and tail frame images of the image sequence, estimating the relative camera pose between head and tail true images by using the pose estimation network, converting the relative camera pose into the relative pose transformation matrix, and calculating the L1 norm between the relative pose transformation matrices between the head and tail frame images of the image sequence obtained in the two modes to obtain a pose continuity loss function of the image sequence.

Further, for the non-adjacent frame image pair in the image sequence with the length of N before the current t moment image, estimating the relative camera pose and brightness alignment parameters by using a pose estimation network, calculating a corresponding brightness alignment-based brightness loss function, and allocating different weights to the brightness alignment-based brightness loss functions of all the non-adjacent frame image pairs in the sequence to add to obtain the non-adjacent frame brightness loss function of the image sequence, wherein the method comprises the following steps:

for a non-adjacent frame image pair in an image sequence with the length of N before an image at the current t moment, estimating relative camera pose and brightness alignment parameters between the image pairs with the image time interval of 2 to N in the image sequence by using a pose estimation network, and calculating a corresponding brightness alignment-based luminosity loss function value;

assigning different weights to luminosity loss distribution based on brightness alignment between the non-adjacent frame image pairs according to the interval between the image pairs to obtain non-adjacent frame luminosity loss L of the whole image sequence _na ：

Wherein I' _i→j Representing a reprojected image from the image at the i-th moment to the j-th moment,reprojection image I' _i→j And real image I _j Luminance loss based on brightness alignment, μ is an artificially set super parameter, whose value is inversely related to the interval between two input frames of images:

μ＝10 ^i-j 。

Further, the converting the relative camera pose of each group of adjacent image pairs in the image sequence estimated by the pose estimation network into a pose transformation matrix, multiplying all the pose transformation matrices to obtain a relative pose transformation matrix between the head frame image and the tail frame image of the image sequence, estimating the relative camera pose between the head frame image and the tail frame image by using the pose estimation network, converting the relative camera pose into the relative pose transformation matrix, and calculating the L1 norm between the relative pose transformation matrices between the head frame image and the tail frame image of the image sequence obtained in the two modes to obtain a pose continuity loss function of the image sequence comprises:

in a sequence of images of length M, each adjacent pair of images is of the type I _i And I _i+1 There is a position transformation matrix betweenWherein i=0, 1, …, M-2, < >>From adjacent image pairs I _i And I _i+1 Is formed by converting the relative camera pose;

transforming all the pose into matrix under the coordinate system of the cameraContinuously multiplying left to obtain a relative pose transformation matrix between head and tail frame images of the image sequence>

Simultaneously, the relative camera pose generated by camera motion between the first frame image and the last frame image is estimated directly by using a pose estimation network, and is converted into a relative pose transformation matrix by using Euler transformation

According to the obtained And->The L1 norm between the two to obtain the pose continuity loss L of the image sequence with the length of M _pc,M ：

For an image sequence with the length of N before the current image at the moment t, adding the pose continuity loss of each sub-sequence with the sampling interval of 2,3, … and N-2 in the image sequence to obtain the total pose continuity loss:

further, the calculating, by the pose optimizing network, the pose optimizing loss function according to the pose of the relative camera output by the pose optimizing network at the current moment of the pose optimizing network using the pose estimating network at the past moment, and the relative camera output by the pose optimizing network includes:

estimating the relative camera pose estimated by an n-1 moment pose estimation network before the current t momentRelative camera pose estimated by pose estimation network at current t moment +.>Inputting the pose optimization network, and outputting a 6-degree-of-freedom relative camera pose with higher precision by the pose optimization network>

Wherein t is larger than n-1, pose_Optmization () is a pose optimization network,comprising 3 degrees of freedomRelative rotation and relative displacement of 3 degrees of freedom;

image reprojection operation is carried out by using the relative camera pose, and a corresponding luminosity loss function based on brightness alignment is calculated and used as a loss function L of a training pose optimization network _PO 。

In one aspect, an electronic device is provided that includes a processor and a memory having at least one instruction stored therein that is loaded and executed by the processor to implement the multi-stage estimated monocular vision odometry method described above based on brightness alignment.

In one aspect, a computer-readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement the multi-stage estimated monocular vision odometer method described above based on brightness alignment is provided.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

in the embodiment of the invention, a depth estimation network, a pose estimation network based on brightness alignment and a pose optimization network based on a two-way long-short-term memory network are constructed; the pose optimization network is used for aggregating implicit time sequence information in the input image sequence and optimizing the relative camera pose output by the pose estimation network; calculating luminosity loss functions based on brightness alignment between adjacent frame images according to relative camera pose and brightness alignment parameters between each pair of adjacent frame images output by a pose estimation network and the depth image of an input frame output by a depth estimation network; calculating a motion constraint loss function of an image sequence input into the pose estimation network; through the pose optimization network, the pose of the relative camera is optimized according to the pose of the relative camera estimated by the pose estimation network at the past moment, the pose of the relative camera is output at the current moment of the pose estimation network, and a pose optimization loss function is calculated by utilizing the pose of the relative camera output by the pose optimization network; inputting all images in the image sequence into a pose estimation network, a depth estimation network and a pose optimization network, and training the constructed pose estimation network, depth estimation network and pose optimization network based on the obtained luminosity loss function, motion constraint loss function and pose optimization loss function based on brightness alignment; and estimating the relative camera pose corresponding to each frame of image in the image sequence of the pose to be estimated by using the trained pose estimation network and the pose optimization network. Therefore, the brightness information of the input image can be aligned, and the time sequence information implicit in the image sequence can be effectively utilized by adopting a multi-stage pose estimation strategy, so that the precision and the robustness of the relative camera pose estimation result are effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a multi-stage estimated monocular vision odometer method based on brightness alignment according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an overall flow framework provided in an embodiment of the present invention;

fig. 3 is a schematic diagram of a pose estimation network structure according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a pose optimization network structure according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an initial pose estimation flow provided in an embodiment of the present invention;

fig. 6 is a schematic diagram of a pose optimization flow provided by an embodiment of the present invention;

FIG. 7 (a) is a schematic diagram of a trajectory estimated by a method according to an embodiment of the present invention on sequence 09 in a KITTI odometer dataset;

FIG. 7 (b) is a schematic diagram of a trajectory estimated by a method according to an embodiment of the present invention over a sequence 10 in a KITTI odometer dataset;

FIG. 8 is a schematic diagram of the brightness alignment experiment result provided in the embodiment of the present invention;

FIG. 9 (a) is a schematic diagram of a trajectory estimated by a method according to an embodiment of the present invention over a sequence 11 in a KITTI odometer dataset;

FIG. 9 (b) is a schematic diagram of a trajectory estimated by a method according to an embodiment of the present invention over a sequence 15 in a KITTI odometer dataset;

FIG. 9 (c) is a schematic diagram of a trajectory estimated by a method according to an embodiment of the present invention over a sequence 16 in a KITTI odometer dataset;

FIG. 9 (d) is a schematic diagram of a trajectory estimated by a method according to an embodiment of the present invention over sequence 17 in a KITTI odometer dataset;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1-2, an embodiment of the present invention provides a multi-stage estimation monocular vision odometer method based on brightness alignment, which may be implemented by an electronic device, which may be a terminal or a server, the method including:

s101, constructing a Depth estimation network (Depth Net), a Pose estimation network (Pose-BA Net) based on brightness alignment and a Pose optimization network (Pose-OPT Net) based on a two-way long-short-term memory network;

In this embodiment, the depth estimation network is configured to output a depth image of an input frame. The depth estimation network selects the ResNet50 structure as the encoder, takes a multi-layer deconvolution structure similar to the DispNet decoder as the decoder, and connects with the encoder through a jump link structure, with the output layer activation function being Sigmoid.

In this embodiment, the pose estimation network is configured to output a relative camera pose and a brightness alignment parameter between each input image pair; as shown in fig. 3, the pose estimation network adopts an encoder-decoder structure, and the network structure is shown in table 1:

TABLE 1Pose-BA Net network architecture

As shown in fig. 3, the pose estimation network is composed of a res net18 encoder (res block is the basic module that constitutes res net 18) and three parallel three-layer convolution structures. The first two convolution layers of the three convolution layers take a ReLU (Rectification Linear Unit, rectifying linear unit) as an activation function, and the last convolution layer is a pure convolution layer without an activation function; after the input of the pose estimation network passes through the ResNet18, three parallel three-layer convolution layers are adopted to output a 6-degree-of-freedom relative camera pose (comprising 3-degree-of-freedom relative rotation and 3-degree-of-freedom relative displacement) and two brightness alignment parameters.

In this embodiment, the inputs of the pose estimation network and the depth estimation network are 832×256 RGB images.

In this embodiment, the relative camera pose is estimated by using a multi-stage estimation method, and the relative camera pose is initially estimated by using a pose estimation network, so that the output result of the pose estimation network is optimized by using a pose optimization network. Therefore, the pose optimization network not only can be used for aggregating the implicit time sequence information in the input image sequence, but also can be used for optimizing the relative camera pose output by the pose estimation network. The pose optimization network is formed by connecting a full-connection layer after adopting a two-way long-short-term memory network, and the basic structure of the pose optimization network is shown as a broken line frame in fig. 4; wherein,

the two-way long-short-term memory network is used for aggregating time sequence information implicit in an image sequence (namely, a camera motion process);

S102, calculating luminosity loss functions based on brightness alignment between adjacent frame images according to relative camera pose and brightness alignment parameters between each pair of adjacent frame images output by a pose estimation network and the depth image of an input frame output by a depth estimation network;

In the present embodiment, for a pair of images having a relative motion relationship, pair I _t And I _t+1 Estimating corresponding relative camera pose and brightness alignment parameters by using a pose estimation network, estimating a depth image of an input frame by using a depth estimation network, aligning brightness information of two images by using the brightness alignment parameters, and using the estimated relative camera pose and the corresponding depth image to align the brightness of a first frame image I after brightness alignment _t Re-projection to second frame image I _t+1 Calculating the luminosity loss between the re-projection image and the real image based on the brightness alignment to obtain a luminosity loss function value based on the brightness alignment, as shown in fig. 5, specifically may include the following steps:

a1, at the current t moment, the image pair I with relative motion relation _t And I _t+1 Inputting the pose estimation network, and outputting the image pair I of the adjacent frames by the pose estimation network _t And I _t+1 6 degrees of freedom relative to camera poseAnd brightness alignment parameters a and b; wherein (1)>Expressed as:

notably, here is the relative camera poseAlthough the relative camera pose output by the pose optimization network is not high in precision, the pose estimation value of the relative camera at the current moment is more reliable.

A2, estimating a second frame image I by the depth estimation network _t+1 Corresponding depth image D _t+1 ；

A3 forConverting it into a4 x 4 pose transform matrix using euler transform formula>

Wherein Euler () represents the Euler transform formula;

a4 input image I using luminance alignment parameters a and b _t Affine transformation of brightness is carried out to obtain an image

A5 for image I _t+1 A point p on _t+1 Its three-dimensional coordinates are defined by its depth D _t+1 (p _t+1 ) Reducing; which is in the imageGo up corresponding projection point->Expressed as:

wherein K is an intra-camera parameter, D _t+1 A depth image at time t+1;

a6 by comparing image I _t Sampling to obtain an image I at the time t+1 _t+1 Is a reprojection image I' _t+1 ：

A7, determining a re-projection image I' _t+1 And real image I _t+1 Luminosity loss between based on brightness alignmentExpressed as:

S103, calculating a motion constraint loss function of an image sequence input into the pose estimation network;

in this embodiment, the motion constraint loss function includes: a non-adjacent frame luminosity loss function and a pose continuity loss function;

The calculating the motion constraint loss function of the image sequence of the input pose estimation network specifically comprises the following steps:

b1, for a non-adjacent frame image pair in an image sequence with the length of N before an image at the current time t, estimating the relative camera pose and brightness alignment parameters by using a pose estimation network, calculating a corresponding brightness alignment-based luminosity loss function, distributing different weights for the brightness alignment-based luminosity loss functions of all the non-adjacent frame image pairs in the sequence, and adding to obtain the non-adjacent frame luminosity loss function of the image sequence; specifically, the method comprises the following steps:

b11, for a non-adjacent frame image pair in an image sequence with the length of N before the current t moment image, estimating relative camera pose and brightness alignment parameters between the image pairs with the image time interval of 2 to N in the image sequence by using a pose estimation network, and calculating a corresponding brightness alignment-based luminosity loss function value;

b12, assigning different weights to the luminosity loss distribution based on brightness alignment between the non-adjacent frame image pairs according to the interval between the image pairs to obtain non-adjacent frame luminosity loss L of the whole image sequence _na ：

μ＝10 ^i-j 。

b2, converting the relative camera pose of each group of adjacent image pairs in the image sequence estimated by the pose estimation network into a pose transformation matrix, multiplying all the pose transformation matrices to obtain a relative pose transformation matrix between head and tail frame images of the image sequence, estimating the relative camera pose between head and tail true images by using the pose estimation network, converting the relative camera pose into the relative pose transformation matrix, and calculating the L1 norm between the relative pose transformation matrices between the head and tail frame images of the image sequence obtained in the two modes to obtain a pose continuity loss function of the image sequence, wherein the method specifically comprises the following steps of:

b21, there is a correlation between several consecutive relative camera poses during the camera's motion. In a sequence of images of length M, each adjacent pair of images is of the type I _i And I _i+1 There is a position transformation matrix betweenWherein i=0, 1, …, M-2, < >>From adjacent image pairs I _i And I _i+1 Is formed by converting the relative camera pose;

b22, transforming all the pose into matrix under the camera coordinate systemContinuously multiplying left to obtain a relative pose transformation matrix between head and tail frame images of the image sequence>

B23, according to the resultAnd->The L1 norm between the two to obtain the pose continuity loss L of the image sequence with the length of M _pc，M ：

B24, for an image sequence with the length of N before the current image at the moment t, adding the pose continuity loss of each sub-sequence with the sampling interval of 2,3, … and N-2 in the image sequence to obtain the total pose continuity loss:

s104, through a pose optimization network, estimating the relative camera pose output at the current moment of the network according to the relative camera pose estimated by using the pose estimation network at the past moment, and calculating a pose optimization loss function by utilizing the relative camera pose output by the pose optimization network;

in this embodiment, as shown in fig. 6, the relative camera pose estimated by the pose estimation network at time n-1 before the current time t is estimated Relative camera pose estimated by pose estimation network at current t moment +.>Inputting the pose optimization network, and outputting a 6-degree-of-freedom relative camera pose with higher precision by the pose optimization network>

Wherein t is larger than n-1, pose_Optmization () is a pose optimization network,including a relative rotation of 3 degrees of freedom and a relative displacement of 3 degrees of freedom;

the relative camera pose is used for image re-projection operation, namely: usingD _t+1 Will I _t Reprojection to I' _t+1 And calculates a corresponding luminosity loss function based on brightness alignment asLoss function L of training pose optimization network _PO 。

S105, inputting all images in the image sequence into a pose estimation network, a depth estimation network and a pose optimization network, and training the constructed pose estimation network, depth estimation network and pose optimization network based on the obtained luminosity loss function, motion constraint loss function and pose optimization loss function based on brightness alignment;

in this embodiment, a depth smoothing loss function L is used in addition to the loss function mentioned above _s Geometric consistency constraint loss function L _GC The depth estimation network is trained. In addition, in order to reduce or eliminate the influence of the dynamic scene on the luminosity loss function in the training process, in this embodiment, the light intensity loss function is weighted pixel by using the discovery mask M, so in this embodiment, the total loss function L is used as follows:

Where α is a hyper-parameter controlling the ratio of photometric loss functions based on luminance alignment, β ₁ To control the super-parameters, beta, of the proportion of the photometric loss function of non-adjacent frames ₂ In order to control the super-parameters of the pose continuity loss function proportion, gamma is the super-parameters of the pose optimization loss function proportion, delta is the super-parameters of the depth smoothing loss function proportion, and epsilon is the super-parameters of the geometric consistency loss function proportion.

In this embodiment, the pose estimation network, the depth estimation network, and the pose optimization network are trained using the obtained overall loss function.

It should be noted that, the depth smoothing loss function, the geometric consistency constraint loss function and the self-finding mask M are all other literature usage methods, and the related information is as follows:

in training a depth estimation network, luminance-alignment-based photometric loss functions have limitations in low-texture scenes and depth non-uniform regions, inadequate constraints on the network, andthis introduces depth smoothness loss as an additional constraint in the self-supervising visual odometry. The depth smoothness loss may ensure that the estimated depth map shows similar texture and edge features as the corresponding input image. Depth smoothness loss L _s Can be expressed as:

wherein,representing the first derivative of the image in the spatial direction, D _t Input image I at time t output for depth estimation network _t A corresponding depth image.

In the self-supervision visual odometer method, depth estimation values of different frames in the same sequence of the depth estimation network have scale differences due to lack of scale constraint. Such variability can lead to inaccuracy in the camera's self-motion estimation. To solve this problem, a geometric consistency constraint loss function L is introduced in the present embodiment _GC And carrying out consistency constraint on depth estimation among different frames:

where V represents the set of pixel points that are effectively projected during image re-projection,is through D using view synthesis method _t Synthesized depth image at time t+1, D' _t+1 Is through D _t+1 An interpolated depth map.

The dynamic scene may violate the geometric assumption in the image re-projection process in the self-supervision algorithm, and in this embodiment, a self-discovery mask M is used to process moving objects and occlusions in the scene:

s106, estimating the relative camera pose corresponding to each frame of image in the image sequence of the pose to be estimated by using the trained pose estimation network and the pose optimization network.

In this embodiment, during training, all images in a video image sequence of one batch are input into the pose estimation network, the depth estimation network and the pose optimization network, and the three networks are trained simultaneously. After training is completed, the camera pose corresponding to each frame of image in the image sequence of the pose to be estimated is estimated by using the trained pose estimation network and the pose optimization network, and the depth estimation network does not participate in the process.

The multi-stage estimation monocular vision odometer method based on brightness alignment provided by the embodiment of the invention has at least the following advantages:

(1) The self-supervision visual mileage calculation method is based on the assumption of unchanged gray scale, however, in practical application, due to the change of ambient illumination or the exposure adjustment of a camera in the moving process, the brightness of adjacent frames may be inconsistent, so that the performance of the algorithm is limited and even the algorithm is out of order. In this way, the pose estimation network can estimate the pose of the relative camera and can adjust the brightness of two input images, so that the input images can meet the assumption of unchanged gray scale, and the robustness of the algorithm under the scene with obvious brightness change and the algorithm precision can be improved;

(2) In order to make use of implicit time information in an image sequence captured during camera motion to a greater extent, a method for estimating relative camera pose based on a multi-stage estimation strategy is proposed in this embodiment. In the method, the pose estimation network outputs approximate estimation of the relative camera pose at the current moment, and then the pose optimization network performs fine adjustment through integrating time information to obtain a higher-precision relative camera pose estimation result;

(3) Two new loss functions are introduced for the intrinsic characteristics of the camera motion process, aiming at improving the consistency and continuity of motion estimation. The first loss function, which is the luminance loss of non-adjacent frames and at the heart of the calculation of the luminance error between non-adjacent frames, facilitates a scale consistent estimation of the relative camera pose in pairs of images with different sampling intervals. The second loss function is named pose continuity loss, whose purpose is to establish a correlation between successive relative camera positions to reduce accumulated errors during algorithm operation.

In order to verify the effectiveness of the multi-stage estimated monocular vision odometer method based on brightness alignment provided by the embodiments of the present invention, the performance was tested using the evaluation index provided in the KITTI odometer dataset:

(1) Relative displacement mean square error (rel. Trans.): the average shift RMSE (Root Mean Square Error) of all sub-sequences of length 100, 200, … …, 800 meters in a sequence is measured in%, i.e. meters per 100 meters of deviation, with smaller values being better.

(2) Relative rotational mean square error (rel. Rot.): the average rotation RMSE of the sub-sequences of all lengths 100, 200, … …, 800 meters in a sequence, measured in deg/m, is the smaller the better.

In this embodiment, the eight sequences 00-07 in the KITTI odometer dataset are applied as training sets and verification sets to train the pose estimation network, the depth estimation network and the pose optimization network, and the two sequences 09-10 are used to test the performance of the method.

The KITTI odometer data set is a binocular image of the urban highway environment, radar points and actual trajectories acquired by devices such as vehicle-mounted cameras.

In practice, the operations of S101-S106 are performed, wherein the parameter α=1, β of the super-parameter of the photometric error loss function ₁ ＝0.25，β ₂ ＝0.25，γ＝0.2，δ＝0.1，δ=0.5; mu in luminosity loss function ₀ ＝0.85，μ ₁ =0.15. In the training process of the network, the initial learning rate is 3 multiplied by 10 ^-4 And gradually reducing along with training, carrying out 150 iterations by adopting an Adam optimizer, wherein the batch size of each iteration is 8, N=4 in the motion constraint loss function, and the number of input poses of the pose optimization network is n=5.

In order to verify the performance of the method described in this embodiment, a self-supervision monocular vision odometer method based on deep learning in recent years was selected for comparison, and experimental results are shown in table 2. The track generated in this embodiment is shown in fig. 7 (a) and fig. 7 (b), where groundtrunk is a real track, ours is a track estimated in this embodiment, and SC-sfmlearer is a track estimated by a basic method.

Table 2 comparison of the method of this example with other methods

In order to verify the significance of the parts of the method described in this example, an ablation experiment was also performed in this example. First, the help of brightness alignment to improve the estimation accuracy of the relative camera pose is verified. The experimental results are shown in FIG. 8. It was then verified that the basic method was augmented with various improved performance metrics proposed by the present method. Table 2 shows the significance of each module proposed in this example for improving the performance of the method; wherein, "basic" means that the basic process does not add any improvement; "+ba" means adding a pose estimation network to the algorithm; "+MC" means adding a motion constraint loss function to the algorithm; "+PO" indicates that a pose optimization network is added to the algorithm. It can be seen that the various modules provided by the embodiment are introduced into the basic method to improve the estimation accuracy of the algorithm on the relative camera pose, the performance of the algorithm gradually rises along with the increase of each part, and the effectiveness of each part in the method of the embodiment is proved.

Table 3 ablation experimental results

In addition, this example also performed additional test experiments on the method provided by this example over some other sequences of the KITTI dataset (11-21 sequences, not providing true camera pose), as these sequences did not provide true camera pose, the results of ORB-SLAM2 with closed loop were used as reference. Of these sequences, four sequences (11, 15, 16 and 17) were selected for additional experiments. The additional experimental results are shown in fig. 9 (a) - (d). The dashed and solid lines in FIGS. 9 (a) - (d) represent the trajectories obtained by ORB-SLAM and the method proposed in this example, respectively.

Fig. 10 is a schematic structural diagram of an electronic device 600 according to an embodiment of the present invention, where the electronic device 600 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 601 and one or more memories 602, where at least one instruction is stored in the memories 602, and the at least one instruction is loaded and executed by the processors 601 to implement the above-mentioned multi-stage estimation monocular vision odometry method based on brightness alignment.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the above multi-stage estimated monocular vision odometer method based on brightness alignment, is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A multi-stage estimated monocular vision odometer method based on brightness alignment, comprising:

2. The multi-stage estimation monocular vision odometry method based on intensity alignment of claim 1, wherein the pose estimation network employs an encoder-decoder architecture;

3. The multi-stage luminance alignment-based monocular vision odometry method of claim 1, wherein calculating the luminance alignment-based photometric loss function between adjacent frame images from the relative camera pose between each pair of adjacent frame images output by the pose estimation network, the luminance alignment parameters, and the depth image of the input frame output by the depth estimation network comprises:

For a pair of images having a relative motion relationship I _t And I _t+1 Estimating corresponding relative camera pose and brightness alignment parameters by using a pose estimation network, estimating a depth image of an input frame by using a depth estimation network, aligning brightness information of two images by using the brightness alignment parameters, and using the estimated relative camera pose and the corresponding depth image to align the brightness of a first frame image I after brightness alignment _t Re-projection to second frame image I _t+1 And calculating luminosity loss between the re-projection image and the real image based on brightness alignment, and obtaining a luminosity loss function value based on brightness alignment.

4. A multi-stage estimation monocular vision odometer method based on brightness alignment according to claim 3, wherein the estimating brightness loss based on brightness alignment between the re-projected image and the real image by estimating the corresponding relative camera pose and brightness alignment parameters using the pose estimation network, estimating the depth image of the input frame using the depth estimation network, aligning the brightness information of the two images using the brightness alignment parameters, re-projecting the first frame image after brightness alignment to the second frame using the estimated relative camera pose and the corresponding depth image, calculating the brightness loss based on brightness alignment between the re-projected image and the real image comprises:

At the current time t, the image pair I with relative motion relation _t And I _t+1 Inputting the pose estimation network, and outputting the image pair I of the adjacent frames by the pose estimation network _t And I _t+1 6 degrees of freedom relative to camera poseAnd brightness alignment parameters a and b; wherein,expressed as:

For the followingConverting it into a 4 x 4 pose transform matrix using euler transform formula>

Wherein Euler () represents the Euler transform formula;

wherein K is an intra-camera parameter, D _t+1 A depth image at time t+1;

by applying to image I _t Sampling to obtain an image I at the time t+1 _t+1 Is a re-projection image I of (2) _t ^′ ₊₁ ：

Determining a re-projection image I _t ^′ ₊₁ And real image I _t+1 Luminosity loss between based on brightness alignmentExpressed as:

wherein SSIM (I) _t+1 ,I _t ^′ ₊₁ ) Representing a real image I _t+1 And re-projecting image I _t ^′ ₊₁ Structural similarity between mu ₀ 、μ ₁ To control the superparameter of the proportion of the corresponding parts ₁ Representing the L1 norm.

5. The multi-stage estimated monocular vision odometry method based on intensity alignment of claim 1, wherein the motion constraint loss function comprises: a non-adjacent frame luminosity loss function and a pose continuity loss function;

6. The multi-stage luminance alignment-based monocular vision odometer method of claim 5, wherein the estimating the relative camera pose and luminance alignment parameters using the pose estimation network for the non-adjacent frame image pairs in the sequence of images of length N preceding the current time t image, calculating the corresponding luminance alignment-based luminance loss functions, and adding the luminance alignment-based luminance loss functions of all the non-adjacent frame image pairs in the sequence with different weights to obtain the non-adjacent frame luminance loss functions of the sequence of images comprises:

Wherein I is _i ^′ _→j Representing a reprojected image from the image at the i-th moment to the j-th moment, Reprojection image I _i ^′ _→j And real image I _j Luminance loss based on brightness alignment, μ is an artificially set super parameter, whose value is inversely related to the interval between two input frames of images:

μ＝10 ^i-j 。

7. the multi-stage estimation monocular vision odometer method based on brightness alignment of claim 5, wherein the converting the relative camera pose of each group of adjacent image pairs in the image sequence estimated by the pose estimation network into a pose transformation matrix, multiplying all the pose transformation matrices to obtain a relative pose transformation matrix between the head and tail frame images of the image sequence, estimating the relative camera pose between the head and tail true images using the pose estimation network and converting the relative camera pose into a relative pose transformation matrix, and calculating the L1 norm between the head and tail frame images of the image sequence obtained by the two modes to obtain a pose continuity loss function of the image sequence comprises:

transforming all the pose into matrix under the coordinate system of the camera Continuously multiplying left to obtain a relative pose transformation matrix between head and tail frame images of the image sequence>

According to the obtainedAnd->The L1 norm between the two to obtain the pose continuity loss L of the image sequence with the length of M _pc,M ：/>

8. the multi-stage estimation monocular vision odometer method based on brightness alignment of claim 1, wherein the calculating, by the pose optimization network, the pose optimization loss function from the relative camera pose output at the current time of the relative camera pose estimation network estimated at the past time using the pose estimation network, and using the relative camera pose output by the pose optimization network, comprises:

estimating the relative camera pose estimated by an n-1 moment pose estimation network before the current t moment……，Relative camera pose estimated by pose estimation network at current t moment +. >Inputting the pose optimization network, and outputting a 6-degree-of-freedom relative camera pose with higher precision by the pose optimization network>