CN113989318A

CN113989318A - Pose optimization and error correction method of monocular visual odometry based on deep learning

Info

Publication number: CN113989318A
Application number: CN202111221271.6A
Authority: CN
Inventors: 肖卓凌; 宋濡君; 朱然
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-01-28
Anticipated expiration: 2041-10-20
Also published as: CN113989318B

Abstract

The invention discloses a monocular vision odometer pose optimization and error correction method based on deep learning, which comprises the steps of obtaining image data and calculating a corresponding optical flow image sequence; segmenting the optical flow picture sequence by adopting a fixed step length sliding window to obtain a plurality of segmented input sequence data, and obtaining high-dimensional motion characteristics of each input sequence data by utilizing an encoder; inputting the high-dimensional motion characteristics into an artificial neural network; performing motion similarity modeling on the time sequence relation of the motion and local context information of the motion by using a pose transformation similarity calculation module, and guiding and optimizing pose characteristics by using an attention mechanism to obtain motion characteristics purified by motion similarity; and inputting the motion characteristics after the motion similarity is purified into a pose correction prediction network to realize pose optimization and error correction. The invention fully excavates and models the time sequence relation and the similarity of continuous motion in the image motion data, and improves the robustness.

Description

Monocular vision odometer pose optimization and error correction method based on deep learning

Technical Field

The invention relates to the field of computer vision, in particular to a monocular vision odometer pose optimization and error correction method based on deep learning.

Background

In recent years, with the rapid development of the related applications of the internet of things, the demand of Location Based Services (LBS) is driven to rise, which makes the demand for high-precision real-time positioning schemes increasingly urgent. A stable, accurate and real-time positioning system is an important guarantee for realizing application of the Internet of things such as robot control, unmanned driving, Virtual Reality (VR), commodity retail and the like.

Although positioning by a Global Navigation Satellite System (GNSS) such as a Global Positioning System (GPS), a beidou satellite navigation system (BDS), a Galileo satellite positioning system (Galileo), and a GLONASS (GLONASS) positioning system has been very popular at present, positioning by satellites may be inaccurate for some outdoor environments with severe shielding (such as tunnels, forests, etc.) or indoor scenes affected by shielding and interference of building structures to satellite signals. A Visual Odometer (VO) using a visual sensor is an effective way to solve the above problems, and has many advantages of abundant visual input information, wide applicable scenes, low cost, and the like, and is a common means for implementing positioning applications.

However, the monocular visual odometer mainly predicts the inter-frame pose transformation of the camera carrier by using image input at adjacent acquisition moments, and then accumulates to obtain the overall motion trajectory, and accumulated errors are generated to cause the trajectory estimation to diverge with the increase of the motion distance. Therefore, the key to realizing the high-precision monocular vision odometer positioning system is to effectively eliminate the accumulated error of the pose prediction of the vision odometer. At the present stage, common methods for relieving the accumulated error of the monocular vision odometer and improving the pose prediction precision include: 1) and constructing a pose graph of monocular camera carrier motion and loop detection, and performing rear-end optimization on the predicted pose. For example, the ORB-SLAM positioning system locally and globally optimizes the predicted trajectory of the positioning system based on the principle of co-visualization of landmarks. 2) And correcting the visual odometer positioning system by using other kinds of information through a data fusion method. For example, a visual-inertial odometer (VIO) is a high-precision positioning system that eliminates data drift of a visual measurement unit by combining inertial navigation information. 3) And performing motion correlation modeling on the image sequence data in a time dimension to optimize pose prediction. For example, the deep learning monocular visual odometer SRNN model is a system scheme for guiding and optimizing the predicted pose by constructing the correlation of pose transformation at adjacent moments. However, the first solution for eliminating the cumulative error of the visual odometer prediction has certain limitations, which are mainly reflected in high dependency on environmental scenes and weak universality. For example, for a real motion scene which does not occur, environmental landmarks cannot be constructed in advance, and the possibility that the motion trajectory cannot be closed is existed, so that the pose graph optimization and loop detection module is likely to fail. In addition, the second scheme for eliminating the prediction accumulated error of the monocular vision odometer also has certain limitation, which is mainly shown in that when the quality of the measured data of other types of sensors is poor, the prediction precision of the original monocular vision odometer can be obviously influenced, and meanwhile, the data fusion algorithm can also have great influence on the final prediction effect.

Disclosure of Invention

Aiming at the defects in the prior art, the method for optimizing the pose and correcting the error of the monocular vision odometer based on the deep learning solves the problem that the pose optimization and the error correction in the prior art are poor in accuracy and robustness.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

the method for optimizing the pose and correcting the error of the monocular vision odometer based on the deep learning comprises the following steps:

s1, acquiring image data and calculating a corresponding optical flow image sequence;

s2, segmenting the optical flow picture sequence by adopting a fixed step length sliding window to obtain a plurality of segmented input sequence data, and obtaining high-dimensional motion characteristics of each input sequence data by utilizing an encoder;

s3, inputting the high-dimensional motion characteristics into an artificial neural network to obtain the time sequence relation of motion and local context information of the motion;

s4, inputting the result of the step S3 into a pose transformation similarity calculation module for motion similarity modeling to obtain motion correlation characteristics of motion time sequence relation and motion correlation characteristics of motion local context information; optimizing the pose characteristics by utilizing an attention mechanism based on the motion correlation characteristics to obtain motion characteristics after motion similarity purification;

and S5, inputting the motion characteristics purified through the motion similarity into a pose correction prediction network for pose optimization and error correction.

Further, the specific method of step S1 is:

s1-1, setting the sampling frequency of the monocular vision sensor, and sampling to obtain a three-channel color RGB image sequence;

s1-2, Flo according to formula_t＝F(I_t-1,I_t) Calculating an optical flow image sequence of the three-channel color RGB image sequence; wherein Flo_tIs an optical flow image at the time t, F (-) is an optical flow calculation formula, I_t-1Three-channel color RGB image at time t-1, and I_tThree-channel color RGB image at time t.

Further: the sampling frequency of the monocular vision sensor is set to be 20 Hz; the data dimension of the three-channel color RGB image is (1226,370,3), and the data dimension of the optical flow image is (1226,370, 2); and correspondingly calculating each two three-channel color RGB image frames to obtain a corresponding optical flow image frame.

Further, the specific method for obtaining the plurality of segmented input sequence data in step S2 is as follows:

segmenting the optical flow image sequence by utilizing a sliding window with the length of 9 and the step length of 9 to obtain input sequence data with the length of 9; wherein each input sequence data is four-dimensional tensor data with the dimension of (9,1226,370,2), and comprises data of the optical flow image in three dimensions under the length of a sliding window.

Further, the specific method in step S3 is:

inputting the high-dimensional motion characteristics into an artificial neural network comprising two layers of long-time and short-time memory networks connected in series, and according to a formula:

i_t＝σ(ω_ixx_t+ω_ihh_t-1+b_i)

g_t＝tanh(ω_gxx_t+ω_ghh_t-1+b_g)

f_t＝σ(ω_fxx_t+ω_fhh_t-1+b_f)

c_t＝f_t⊙c_t-1+i_t⊙g_t

o_t＝σ(ω_oxx_t+ω_ohh_t-1+b_o)

h_t＝o_t⊙tanh(c_t-1)

obtaining local context information h of motion_tI.e. hidden unit state at time t, and temporal relation of motion o_tThe output of the network at the time t is memorized in long and short time; wherein i_tFor memorizing the state of an input gate at the moment t of the network in long and short time, sigma (-) is a sigmoid activation function, omega_ixAs weights of the input data, x_tIs the input state at time t, ω_ihFor input data corresponding to the weight of the hidden unit, h_t-1Hidden unit state at time t-1, b_iFor the corresponding offset of the input data, g_tFor the candidate information of the input data at time t, tanh (-) is the activation function, ω_gxAs weights, omega, of the input data candidate information_ghFor input data candidates corresponding to the weight of the hidden unit, b_gFor the corresponding offset of the input data candidate information, f_tForgetting the door state at time t, ω_fxWeight to forget gate state, ω_fhWeight of hidden unit corresponding to forgetting door state, b_fFor forgetting the corresponding offset of the door state, c_tNeuron state at time t, c_t-1Neuron state at time t-1, ω_oxIs the weight of the output gate state, ω_ohFor output gate states corresponding to the weight of the hidden unit, b_oAn offset corresponding to the output gate state; a Hadamard product of a vector;

the output of the last layer of long and short time memory network is the time sequence relation of motion, the dimension is (1,1024), the hidden unit states of the two layers of long and short time memory networks store the local context information of the motion, and the dimension is (2,1024).

Further, the specific method for obtaining the motion characteristics of the motion similarity purification in step S4 is as follows:

according to the formula:

X″_t＝f^1×1([X′_t,H′_t])

obtaining the optimized pose characteristic X' output at the time t_tMotion characteristics refined by motion similarity; wherein X'_tIn order to obtain the motion characteristics of the purified t moment under the guidance of attention mechanism based on motion similarity, exp (-) is a logarithmic function taking a natural logarithm as a base, S (-) is a cosine similarity function, and X (-) is_t-1Motion features, X, extracted for the artificial neural network at time t-1_tThe motion feature extracted from the artificial neural network at the moment t, namely the motion correlation feature of local context information of the motion, W is the vector dimension of the motion feature, H'_tFor the refined motion local context information at time t under the guidance of attention mechanism based on motion similarity, H_nThe local context information of the motion stored in the hidden unit state of the last layer of the long-time memory network of the artificial neural network, namely the motion correlation characteristic of the motion time sequence relation, f^1×1(. h) is a convolution layer with convolution kernel size of 1 × 1, [ X'_t,H′_t]The method is a splicing process of the purified motion characteristics and the purified motion local context information.

Further, the pose correction prediction network in step S5 includes a first long-short time memory network, a second long-short time memory network, a first fully-connected layer, and a second fully-connected layer, which are connected in sequence; the output dimensionalities of the two long and short time memory networks are 1024; the number of neurons of the first fully-connected layer is 128, and the activation function is included; the number of neurons in the second fully-connected layer was 6, with no activation function.

The invention has the beneficial effects that:

1. based on high-dimensional motion characteristics calculated by an encoder through optical flow data corresponding to an image sequence, a time sequence relation of continuous motion and local context information of the motion are mined by designing a pose optimization and error correction method based on a deep learning method, similarity information of the continuous motion in sensor data is modeled, high-dimensional motion characteristics are guided and optimized through an attention mechanism, and finally pose change prediction of a camera carrier between adjacent camera sampling points is obtained through fitting, so that the goals of pose optimization and error correction are achieved, and the accuracy of a system is ensured.

2. Under the condition that the camera parameters of a visual sensor, the scene point landmarks and the depth information of the scene points are not needed, the time sequence relation of continuous motion is extracted through an artificial neural network based on high-dimensional motion characteristics calculated by an encoder through optical flow data corresponding to an image sequence, the characteristic data is optimized under the guidance of an attention mechanism by utilizing motion similarity characteristics, and finally, the optimization prediction and error correction of the pose can be accurately and robustly obtained, namely, the robustness of the system is ensured, and the absolute scale of the motion trail is completely and autonomously recovered under the condition of only depending on monocular image data.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, the method for optimizing the pose and correcting the error of the monocular vision odometer based on the deep learning comprises the following steps:

The specific method of step S1 is:

The sampling frequency of the monocular vision sensor is set to be 20 Hz; the data dimension of the three-channel color RGB image is (1226,370,3), and the data dimension of the optical flow image is (1226,370, 2); and correspondingly calculating each two three-channel color RGB image frames to obtain a corresponding optical flow image frame.

The specific method for obtaining the plurality of segmented input sequence data in step S2 is as follows:

The specific method in step S3 is:

i_t＝σ(ω_ixx_t+ω_ihh_t-1+b_i)

g_t＝tanh(ω_gxx_t+ω_ghh_t-1+b_g)

f_t＝σ(ω_fxx_t+ω_fhh_t-1+b_f)

c_t＝f_t⊙c_t-1+i_t⊙g_t

o_t＝σ(ω_oxx_t+ω_ohh_t-1+b_o)

h_t＝o_t⊙tanh(c_t-1)

obtaining local context information h of motion_tI.e. hidden unit state at time t, and temporal relation of motion o_tThe output of the network at the time t is memorized in long and short time; wherein i_tFor memorizing the state of an input gate at the moment t of the network in long and short time, sigma (-) is a sigmoid activation function, omega_ixAs weights of the input data, x_tIs the input state at time t, ω_ihFor input data corresponding to the weight of the hidden unit, h_t-1Hidden unit state at time t-1, b_iFor the corresponding offset of the input data, g_tFor the candidate information of the input data at time t, tanh (-) is the activation function, ω_gxFor inputting data candidatesWeight, ω_ghFor input data candidates corresponding to the weight of the hidden unit, b_gFor the corresponding offset of the input data candidate information, f_tForgetting the door state at time t, ω_fxWeight to forget gate state, ω_fhWeight of hidden unit corresponding to forgetting door state, b_fFor forgetting the corresponding offset of the door state, c_tNeuron state at time t, c_t-1Neuron state at time t-1, ω_oxIs the weight of the output gate state, ω_ohFor output gate states corresponding to the weight of the hidden unit, b_oAn offset corresponding to the output gate state; a Hadamard product of a vector;

The specific method for obtaining the motion characteristics of the motion similarity purification in the step S4 is as follows:

according to the formula:

X″_t＝f^1×1([X′_t,H′_t])

obtaining the optimized pose characteristic X' output at the time t_tMotion characteristics refined by motion similarity; wherein X'_tIn order to obtain the motion characteristics of the purified t moment under the guidance of attention mechanism based on motion similarity, exp (-) is a logarithmic function taking a natural logarithm as a base, S (-) is a cosine similarity function, and X (-) is_t-1Motion features, X, extracted for the artificial neural network at time t-1_tThe motion characteristics extracted for the artificial neural network at the time t, namely the motion correlation characteristics of the local context information of the motion, and W isVector dimension of motion feature, H'_tFor the refined motion local context information at time t under the guidance of attention mechanism based on motion similarity, H_nThe local context information of the motion stored in the hidden unit state of the last layer of the long-time memory network of the artificial neural network, namely the motion correlation characteristic of the motion time sequence relation, f^1×1(. h) is a convolution layer with convolution kernel size of 1 × 1, [ X'_t,H′_t]The method is a splicing process of the purified motion characteristics and the purified motion local context information.

The pose correction prediction network in the step S5 includes a first long-short term memory network, a second long-short term memory network, a first fully-connected layer and a second fully-connected layer, which are connected in sequence; the output dimensionalities of the two long and short time memory networks are 1024; the number of neurons of the first fully-connected layer is 128, and the activation function is included; the number of neurons in the second fully-connected layer was 6, with no activation function.

The output of the first full-link layer of the pose correction prediction network is F₁：

Where Relu (. cndot.) is the activation function for the non-linear mapping, x_1×iFor a 1 x i dimension input data matrix,

is a weight matrix to be trained with dimensions of j × i fully connected layers, b_1×jIs the offset matrix for the fully connected layer with dimension 1 x j, and T is the transpose of the matrix.

The method is based on high-dimensional motion characteristics calculated by an encoder through optical flow data corresponding to an image sequence, a time sequence relation of continuous motion and local context information of the motion are mined by designing a pose optimization and error correction method based on a deep learning method, similarity information of the continuous motion in sensor data is modeled, high-dimensional motion characteristics are guided and optimized through an attention mechanism, and finally pose change prediction of a camera carrier between adjacent camera sampling points is obtained through fitting, so that the goals of pose optimization and error correction are achieved, and the accuracy of the system is ensured.

Under the condition that the camera parameters of a visual sensor, the scene point landmarks and the depth information of the scene points are not needed, the time sequence relation of continuous motion is extracted through an artificial neural network based on high-dimensional motion characteristics calculated by an encoder through optical flow data corresponding to an image sequence, the characteristic data is optimized under the guidance of an attention mechanism by utilizing motion similarity characteristics, and finally, the optimization prediction and error correction of the pose can be accurately and robustly obtained, namely, the robustness of the system is ensured, and the absolute scale of the motion trail is completely and autonomously recovered under the condition of only depending on monocular image data.

Claims

1. A monocular vision odometer pose optimization and error correction method based on deep learning is characterized by comprising the following steps:

2. The deep learning-based monocular vision odometer pose optimization and error correction method according to claim 1, wherein the specific method of step S1 is:

3. The deep learning-based monocular vision odometer pose optimization and error correction method of claim 2, characterized in that: the sampling frequency of the monocular vision sensor is set to be 20 Hz; the data dimension of the three-channel color RGB image is (1226,370,3), and the data dimension of the optical flow image is (1226,370, 2); and correspondingly calculating each two three-channel color RGB image frames to obtain a corresponding optical flow image frame.

4. The deep learning-based monocular vision odometer pose optimization and error correction method according to claim 1, wherein the specific method for obtaining the plurality of segmented input sequence data in step S2 is as follows:

5. The deep learning-based monocular vision odometer pose optimization and error correction method according to claim 1, wherein the specific method in step S3 is:

i_t＝σ(ω_ixx_t+ω_ihh_t-1+b_i)

g_t＝tanh(ω_gxx_t+ω_ghh_t-1+b_g)

f_t＝σ(ω_fxx_t+ω_fhh_t-1+b_f)

c_t＝f_t⊙c_t-1+i_t⊙g_t

o_t＝σ(ω_oxx_t+ω_ohh_t-1+b_o)

h_t＝o_t⊙tanh(c_t-1)

6. The deep learning-based monocular vision odometer pose optimization and error correction method of claim 1, wherein the specific method for obtaining the motion features with refined motion similarity in step S4 is as follows:

according to the formula:

X_t″＝f^1×1([X_t′,H_t′])

obtaining optimized pose feature X output at time t_t", i.e., the motion characteristics refined by motion similarity; wherein X_t' is the motion characteristic of t moment after purification based on motion similarity under the guidance of attention mechanism, exp (-) is a logarithmic function taking natural logarithm as the base, S (-) is a cosine similarity function, X_t-1Motion features, X, extracted for the artificial neural network at time t-1_tThe motion characteristics extracted for the artificial neural network at the time t, namely the motion correlation characteristics of the local context information of the motion, W is the vector dimension of the motion characteristics, H_t' is the refined motion local context information at the t moment under the guidance of attention mechanism based on motion similarity, H_nThe local context information of the motion stored in the hidden unit state of the last layer of the long-time memory network of the artificial neural network, namely the motion correlation characteristic of the motion time sequence relation, f^1×1(. h) is a convolutional layer with a convolutional kernel size of 1 × 1, [ X ]_t′,H_t′]For the purified motion characteristics andand (5) splicing the purified motion local context information.

7. The deep learning-based monocular visual odometry pose optimization and error correction method of claim 1, wherein the pose correction prediction network in step S5 comprises a first long-short term memory network, a second long-short term memory network, a first fully-connected layer and a second fully-connected layer which are connected in sequence; the output dimensionalities of the two long and short time memory networks are 1024; the number of neurons of the first fully-connected layer is 128, and the activation function is included; the number of neurons in the second fully-connected layer was 6, with no activation function.