CN116681759A

CN116681759A - Camera pose estimation method based on self-supervision visual inertial odometer

Info

Publication number: CN116681759A
Application number: CN202310419746.5A
Authority: CN
Inventors: 李嘉茂; 张天宇; 朱冬晨; 石文君; 张晓林
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-09-01
Anticipated expiration: 2043-04-19
Also published as: CN116681759B

Abstract

The application relates to a camera pose estimation method based on a self-supervision visual inertial odometer, which comprises the following steps: acquiring IMU data between every two frames of images; inputting the multi-frame images and IMU data into a network model to obtain pose transformation information and depth information; the network model is constructed based on a visual inertial fusion odometer network, and a self-attention mechanism-based scale recovery module is added in front of an IMU network module of the visual inertial fusion odometer network; the self-attention mechanism scale recovery module is used for estimating scale information. The application can improve the accuracy of the odometer.

Description

Camera pose estimation method based on self-supervision visual inertial odometer

Technical Field

The application relates to the technical field of computer vision, in particular to a camera pose estimation method based on a self-supervision visual inertial odometer.

Background

Estimating the position and pose information of each frame of the camera from the camera sequence images and IMU inertial data is called a visual odometer, a fundamental but very important task in the computer vision and robotics fields at present. For the current intelligent mobile equipment, high-precision position and gesture information, also called pose information, is a premise that the intelligent mobile equipment can safely work in different scenes, and has huge application value. Any intelligent equipment needing to move, such as a mobile robot, a mobile phone in augmented reality and an automatic driving automobile, need to obtain the accurate pose of the intelligent equipment under a world coordinate system by using a mileage calculation method in the working process, so as to perform tasks such as navigation.

Visual inertial mileage scores based on deep learning are classified into supervised and self-supervised methods. The supervised method combines the characteristics of the convolutional neural network and the time sequence network, the convolutional neural network is utilized to extract visual and inertial characteristics for fusion, the time sequence network is utilized to obtain the current pose according to the previous pose, and training is carried out under the supervision of a true value, so that the network gradually has the capability of estimating the pose. However, since the supervised method generally uses GPS (outdoor), a motion capture system (indoor), etc. as a true value acquisition mode, the cost is high and there are limitations. The self-supervision method has great potential for obtaining the odometer result without a true value.

The self-supervision visual inertial odometer mainly comprises a pose estimation network and a depth estimation network, and the pose transformation of a target frame and a source frame and the depth of the target frame are respectively predicted. Given the estimated depth and pose, the source frame can be transformed into the coordinate system of the target frame to obtain a reconstructed image, and the two networks can be supervised for simultaneous training by utilizing the difference in luminosity between the target frame and the reconstructed image, namely luminosity loss. As the luminosity loss decreases, the depth and pose of the network estimate is increasingly accurate. However, the existing self-supervision visual odometer has two problems of scale uncertainty and neglecting differences among inertial data modalities.

Disclosure of Invention

The application provides a camera pose estimation method based on a self-supervision visual inertial odometer, which aims to solve the problems of uncertain scale and neglecting differences among inertial data modes.

The technical scheme adopted for solving the technical problems is as follows: the camera pose estimation method based on the self-supervision visual inertial odometer comprises the following steps:

acquiring IMU data between every two frames of images;

inputting the multi-frame images and IMU data into a network model to obtain pose transformation information and depth information;

the network model is constructed based on a visual inertial fusion odometer network, and a self-attention mechanism-based scale recovery module is added in front of an IMU network module of the visual inertial fusion odometer network; the self-attention mechanism scale recovery module is used for estimating scale information.

The self-attention mechanism scale recovery module comprises: an IMU network layer based on a self-attention mechanism is used for carrying out remodelling on input IMU data to obtain improved IMU data; the integrator is used for integrating the improved IMU data to obtain a pseudo pose monitoring signal from time i to time j; and the pseudo pose supervision signal and the pose estimated by the pose estimation network layer of the visual inertial fusion odometer network are established to establish pose consistency constraint.

The expression of the integrator is:wherein (1)>And->Representing the position of the body relative to the world coordinate system at times i and j, respectively, < >>Indicating the speed in world coordinate system at time i, Δt indicating the time interval, +.>Representing the time i from the body coordinate system to the world coordinate systemRotation matrix->The acceleration of the machine body coordinate system at the moment i is expressed by g ^w Represents the gravity vector in world coordinate system, +.>And->Quaternion representing rotation of body coordinate system to world coordinate system at i and j moments respectively, +.>The angular velocity of the body coordinate system at time i is indicated.

The pose estimation network layer of the visual inertial fusion odometer network is a decoupling pose estimation network layer, and the decoupling pose estimation network layer is used for separately processing rotation and translation.

The decoupling pose estimation network layer comprises: a translation feature extraction section for extracting translation features of the IMU from the improved IMU data; a rotational feature extraction section for extracting rotational features of the IMU from the angular velocity components of the improved IMU data; the first fusion module is used for fusing the translation characteristics of the IMU with the visual characteristics extracted by the visual network layer of the visual inertial fusion odometer network to obtain first fusion characteristics; the second fusion module is used for fusing the rotation characteristics of the IMU with the visual characteristics extracted by the visual network layer of the visual inertial fusion odometer network to obtain second fusion characteristics; and the full-connection layer is used for respectively processing the first fusion characteristic and the second fusion characteristic to obtain a translation amount and a rotation amount.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the application has the following advantages and positive effects: according to the application, absolute scale information is recovered based on the self-attention mechanism scale recovery module, and the characteristics of the inertial data are fully utilized through the decoupling pose estimation network to realize the respective processing of the inertial data of different modes, so that the accuracy of the odometer is improved.

Drawings

FIG. 1 is a flow chart of a method for estimating camera pose based on a self-supervising visual odometer according to an embodiment of the application;

FIG. 2 is a schematic diagram of a UnVIO network in an embodiment of the present application;

FIG. 3 is a schematic diagram of a network with the addition of a self-attention mechanism based scale restoration module in an embodiment of the present application;

FIG. 4 is a schematic diagram of an IMU network layer based on a self-attention mechanism in an embodiment of the present application;

FIG. 5 is a schematic diagram of a decoupling pose estimation network layer in an embodiment of the present application;

fig. 6 is a schematic diagram of a visual inertial fusion odometer with a decoupling pose estimation network layer added in an embodiment of the application.

Detailed Description

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

The embodiment of the application relates to a camera pose estimation method based on a self-supervision visual inertial odometer, which is shown in fig. 1 and comprises the following steps of: acquiring IMU data between every two frames of images; and inputting the multi-frame images and the IMU data into a network model to obtain pose transformation information and depth information.

The network model in the present embodiment is constructed based on a visual inertial fusion odometer (un vio) network, and since the present embodiment only improves the structure of the un vio network, and does not involve the loss function caused by the multi-frame input and sliding window proposed by the un vio network, only the structural part of the un vio network is described herein.

As shown in fig. 2, the un vio network includes:

a depth estimation network layer for estimating depth information of the image;

a visual network layer for extracting visual characteristics of the image, which extracts adjacent image frames I in the image sequence _t-1 ,I _t Superimposed together as input, the FlowNet constructed by seven layers of convolution layers and one layer of global average pooling are used for obtaining 512-dimensional visual characteristics F _t ^V Expressed as: representing a superposition of the number of channels;

an IMU network layer for extracting inertial motion characteristics of IMU data, which takes the whole IMU segment data as input and obtains 512-dimensional inertial motion characteristics F through a characteristic extraction part consisting of double-layer LSTM _i ^I Expressed as: f (F) _i ^I ,Wherein (1)>Represent the cyclic equation, H ⁱ Indicating a hidden state, alpha and omega being linear and angular acceleration, respectively;

a converged network layer for extracting visual characteristics F from the visual network layer _t ^V Inertial motion characteristics F extracted from IMU network layer _i ^I Fusion is carried out to obtain fusion characteristics, and visual characteristics F are obtained _t ^V And inertial motion characteristics F _i ^I The features F are obtained by superposition in the channel dimension and by means of a learnable linear layer F ', i.e. F' =g·f+b, the weights W, i.e. w=σ (F) _F (F′))，F _F Is the decoding function of the fusion module, sigma is the sigmoid function, and the final fusion characteristic is obtained according to the obtained weight W Representing the Hadamard product;

and the pose estimation network layer is used for estimating the motion characteristics according to the fusion characteristics so as to obtain pose information.

On the basis of the UnVIO network, the embodiment provides a solution to the problem that the self-supervision visual odometer is uncertain in scale and ignores differences among inertial data modes.

From the principle of luminosity loss, pose and depth are not required to have absolute scale information, and the scale must be obtained by alignment before true value if the scale is recovered. While for odometer results it is critical to obtain reliable results online, so that scale recovery is important. IMU data with scale information is introduced into the UnVIO network, but the method has poor expansibility due to unreasonable utilization. In order to solve the problem, the present embodiment proposes a self-attention mechanism-based scale restoration module.

When the self-attention-based mechanism dimension recovery module is applied to a UnVIO network, as shown in FIG. 3, the self-attention-based mechanism dimension recovery module is used for relieving linear acceleration measured by an accelerometer in an IMUAnd angular velocity measured by a gyroscope +.>Since the existing measurements all contain random walk bias { b } ^a ,b ^g Sum of random noise { n } ^a ,n ^g This also results in the original IMU signal not being directly integrated to obtain the pose transformation result.

The self-attention mechanism-based scale recovery module of the present embodiment includes:

an IMU network layer (see fig. 4) based on self-attention mechanisms, for re-modeling incoming IMU data,improved IMU data IMU _new 。

An integrator for IMU-data IMU of said improved IMU _new And integrating to obtain a pseudo pose monitoring signal from the moment i to the moment j. The expression of the integrator is:wherein (1)>And->Representing the position of the body relative to the world coordinate system at times i and j, respectively, < >>Indicating the speed in world coordinate system at time i, Δt indicating the time interval, +.>Representing a rotation matrix from the body coordinate system to the world coordinate system at time i +.>The acceleration of the machine body coordinate system at the moment i is expressed by g ^w Represents the gravity vector in world coordinate system, +.>And->Quaternion representing rotation of body coordinate system to world coordinate system at i and j moments respectively, +.>The angular velocity of the body coordinate system at time i is indicated.

In order to restrict the IMU to generate more accurate results, a pose consistency constraint is established between the pseudo-pose supervision signal and the pose estimated by the pose estimation network layer of the UnVIO network, and finally the absolute scale optimization odometer input result of the odometer result is calculated by using the pseudo-supervision signal and the network estimated result.

The embodiment also improves the pose estimation network layer of the UnVIO network and adopts a decoupling pose estimation network layer. The decoupling pose estimation network layer is designed to reduce the problem of inaccurate rotation estimation caused by the introduction of acceleration information in the visual inertial odometer.

Visual odometry often yields a 3-dimensional euler angle and a 3-dimensional translation vector by matching corresponding points between two adjacent frames and then by regression. The visual odometer typically outputs a 6-dimensional input as the same vector when returning to the result, which may constrain the two outputs to have some correlation. However, for the method of pose estimation based on the IMU, the modal difference exists between the linear acceleration and the angular velocity measured by the IMU, the rotation estimation is often irrelevant to the linear acceleration, and the linear acceleration can only bring worse results. Thus, in response to this problem, the present embodiment designs a decoupled pose estimation network layer that can handle rotation and translation separately.

As shown in fig. 5, the decoupling pose estimation network layer includes: a translation feature extraction section for extracting translation features of the IMU from the improved IMU data; a rotational feature extraction section for extracting rotational features of the IMU from the angular velocity components of the improved IMU data; the first fusion module is used for fusing the translation characteristics of the IMU with the visual characteristics extracted by the visual network layer of the UnVIO network to obtain first fusion characteristics; the second fusion module is used for fusing the rotation characteristics of the IMU with the visual characteristics extracted by the visual network layer of the UnVIO network to obtain second fusion characteristics; and the full-connection layer is used for respectively processing the first fusion characteristic and the second fusion characteristic to obtain a translation amount and a rotation amount.

As shown in fig. 6, when the decoupling pose estimation network layer is applied to the un vio network, the visual features of the two superimposed frames of images are first extracted to obtain 512-dimensional visual features F _V . Then changeIMU data IMU after being good _new Inputting a translation branch, and extracting translation feature F of 512-dimensional IMU by using two-layer LSTM network _t ^I . IMU data IMU after improvement simultaneously _new Is fed into a rotating network consisting of LSTM to extract rotation characteristics of IMUThen respectively using the fusion modules proposed by the UnVIO network to make F _t ^I 、/>And F is equal to _V And carrying out fusion, and respectively regressing the fused features into translation and rotation by using a full-connection layer, so that the problem of difference between the modes of the inertial data is solved.

It should be noted that the scale recovery module based on the self-attention mechanism in this embodiment may be used as an additional module of any self-supervision visual odometer to estimate the scale at the same time and recover the scale. The decoupling pose estimation network layer in the embodiment is a plug and play module, and for any self-supervision visual odometer framework, if the self-supervision visual inertial odometer framework is to be expanded, the pose estimation network layer can be changed into the decoupling pose estimation network layer.

It is easy to find that the absolute scale information is recovered by the scale recovery module based on the self-attention mechanism, and the characteristics of the inertial data are fully utilized by the decoupling pose estimation network to realize the respective processing of the inertial data of different modes, so that the accuracy of the odometer is improved.

Claims

1. The camera pose estimation method based on the self-supervision visual inertial odometer is characterized by comprising the following steps of:

acquiring IMU data between every two frames of images;

2. The self-supervising visual odometer-based camera pose estimation method of claim 1, wherein the self-attention mechanism scale recovery module comprises: an IMU network layer based on a self-attention mechanism is used for carrying out remodelling on input IMU data to obtain improved IMU data; the integrator is used for integrating the improved IMU data to obtain a pseudo pose monitoring signal from time i to time j; and the pseudo pose supervision signal and the pose estimated by the pose estimation network layer of the visual inertial fusion odometer network are established to establish pose consistency constraint.

3. The self-supervising visual odometer-based camera pose estimation method of claim 2, wherein the integrator is expressed as:wherein (1)>And->Representing the position of the body relative to the world coordinate system at times i and j, respectively, < >>Indicating the speed in world coordinate system at time i, Δt indicating the time interval, +.>Representing a rotation matrix from the body coordinate system to the world coordinate system at time i +.>The acceleration of the machine body coordinate system at the moment i is expressed by g ^w Represents the gravity vector in world coordinate system, +.>And->Quaternion representing rotation of body coordinate system to world coordinate system at i and j moments respectively, +.>The angular velocity of the body coordinate system at time i is indicated.

4. The camera pose estimation method based on self-monitoring visual inertial odometer of claim 2, wherein the pose estimation network layer of the visual inertial fusion odometer network is a decoupled pose estimation network layer, the decoupled pose estimation network layer being used for separately processing rotation and translation.

5. The self-supervising visual odometer-based camera pose estimation method of claim 4, wherein the decoupling pose estimation network layer comprises: a translation feature extraction section for extracting translation features of the IMU from the improved IMU data; a rotational feature extraction section for extracting rotational features of the IMU from the angular velocity components of the improved IMU data; the first fusion module is used for fusing the translation characteristics of the IMU with the visual characteristics extracted by the visual network layer of the visual inertial fusion odometer network to obtain first fusion characteristics; the second fusion module is used for fusing the rotation characteristics of the IMU with the visual characteristics extracted by the visual network layer of the visual inertial fusion odometer network to obtain second fusion characteristics; and the full-connection layer is used for respectively processing the first fusion characteristic and the second fusion characteristic to obtain a translation amount and a rotation amount.