CN116824433A

CN116824433A - Visual-inertial navigation-radar fusion self-positioning method based on self-supervision neural network

Info

Publication number: CN116824433A
Application number: CN202310495293.4A
Authority: CN
Inventors: 韩松芮; 刘华巍; 童官军; 宋尧哲
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-09-29

Abstract

The application relates to a vision-inertial navigation-radar fusion self-positioning method based on a self-supervision neural network, which comprises the following steps: acquiring a video frame sequence, inertial navigation data and a laser radar point cloud picture; inputting the video frame sequence, the inertial navigation data and the laser radar point cloud image into a pose estimation network model to obtain the relative pose estimation of the input video frame; wherein, the pose estimation network model includes: the depth prediction network is used for obtaining a depth map according to an input video frame sequence; the feature extraction network is used for extracting feature information from an input video frame sequence, inertial navigation data and a laser radar point cloud image respectively to obtain visual features, momentum features and radar features; the feature fusion network is used for fusing the visual features and the radar features to obtain pre-fusion features, and fusing the pre-fusion features and the momentum features to obtain fusion features; and the pose estimation network is used for predicting a pose transformation matrix according to the fusion characteristics. The application improves the positioning precision of the self-supervision depth positioning algorithm.

Description

Visual-inertial navigation-radar fusion self-positioning method based on self-supervision neural network

Technical Field

The application relates to the technical field of self-positioning, in particular to a vision-inertial navigation-radar fusion self-positioning method based on a self-supervision neural network.

Background

The self-positioning technique is widely used in the fields of automatic driving, SLAM and the like. In some scenes (buildings in cities, tunnels, dense forests in mountains, mountain holes or satellite refusal sites), it is difficult to use satellite, beidou, GPS and other radio wave technologies for positioning, and at the moment, autonomous positioning is required by means of cameras carried by automobiles, unmanned aerial vehicles or robots. The traditional self-positioning algorithm based on the vision sensor comprises the following basic flow: feature extraction, feature matching, pose calculation and back-end optimization. The traditional self-positioning algorithm can naturally obtain higher positioning precision by means of complete global map construction optimization, but the calculation process is complex, the calculation speed is slow, and real-time reasoning is difficult to realize. With the popularity and continuous development of deep learning, a neural network-based visual self-positioning algorithm, namely a deep visual odometer VO, can train out a model offline by utilizing a data set in advance, and the trained model is directly used for reasoning, so that the algorithm does not need characteristic matching and rear-end optimization steps, and real-time reasoning is easy to realize. The depth VO improves positioning performance by training the optimized objective loss function through supervised or unsupervised learning. The model can be trained without acquiring a real pose label by non-supervision (or self-supervision) learning, so that a large amount of unlabeled data can be utilized, and the training cost is lower compared with that of a supervision mode.

Self-monitoring VO was first proposed by tinchui methou in 2017. The self-supervision VO firstly sends the input continuous frame images into a neural network for pose estimation and a neural network for depth estimation to calculate pose transformation between a depth image of the input images and the images, then uses the depth image and the pose to calculate the reprojection error between the continuous frame images through the projection relation established by epipolar geometry, and finally calculates the gradient of the error relative to the pose estimation and the depth estimation neural network and counter-propagates the updated parameters so as to achieve the effect of optimizing estimation.

However, locating by only single modality data may suffer from data loss, insufficient information, and the like. In the present big data age, more and more sensors are invented, more and more data types can be acquired, and multi-mode sensor fusion is a main trend of future positioning technology development. Compared with the Shan Motai sensor positioning technology, the multi-mode sensor fusion positioning technology can integrate the advantages of all mode data to perform information fusion and information compensation, so that positioning with higher accuracy is realized. The video frame signal can provide the most intuitive RGB time sequence information, which is important for neural network learning feature detection, however, dynamic objects may exist in the video frame so as to cause misjudgment of the motion speed, and the monocular video frame signal lacks depth information; the inertial navigation signal can provide direct information of individual acceleration and speed, but various parameter biases exist in the inertial navigation signal, and error accumulation can be generated, so that the inertial navigation signal is not beneficial to long-time work; lidar signals provide depth map information but lack color visual features, so that lidar signals may be a good information supplement to monocular video frame signals that lack depth information. Binocular video signals, although also providing depth information, require the help of left and right eye matching algorithms, so the computational complexity is high, while binocular cameras are very sensitive to illumination changes and texture details, and inconsistent left and right eye illumination and too monotonous texture scenes may cause matching failure. Various bimodal sensor fusion positioning algorithms based on neural networks exist. For example: depth vision-momentum meter VIO, depth radar vision meter VLO and depth momentum radar meter LIO, which lack more or less input of some important information based on bimodal sensor fusion, and research of three and more modal sensor fusion positioning technologies is still in the traditional algorithm stage.

Disclosure of Invention

The application provides a vision-inertial navigation-radar fusion self-positioning method based on a self-supervision neural network, which solves the problem of insufficient self-positioning information depending on a single mode or a double mode.

The technical scheme adopted for solving the technical problems is as follows: the visual-inertial navigation-radar fusion self-positioning method based on the self-supervision neural network comprises the following steps:

acquiring a video frame sequence, inertial navigation data and a laser radar point cloud picture;

inputting the video frame sequence, the inertial navigation data and the laser radar point cloud image into a pose estimation network model to obtain relative pose estimation of an input video frame;

wherein, the pose estimation network model comprises:

the depth prediction network is used for obtaining a depth map according to the input video frame sequence;

the feature extraction network is used for extracting feature information from an input video frame sequence, inertial navigation data and a laser radar point cloud image respectively to obtain visual features, momentum features and radar features;

the feature fusion network is used for fusing the visual features and the radar features to obtain pre-fused corrected visual features and corrected radar features, and fusing the pre-fused corrected visual features and corrected radar features with the momentum features to obtain fusion features;

the pose estimation network is used for predicting a pose transformation matrix according to the fusion characteristics;

and the parameter optimization module is used for calculating a loss function according to the depth map, the pose transformation matrix and the video frame, and adjusting parameters of the pose estimation network model according to the loss function.

The feature extraction network includes:

a first feature extraction section that extracts visual features from the video frame sequence using a first convolutional network;

a second feature extraction part for extracting the motion feature from the inertial navigation data by adopting an LSTM network;

the third feature extraction part projects the laser radar point cloud image to a 2D plane, codes the laser radar point cloud image projected to the 2D plane in a three-channel coding mode, and extracts radar features by adopting a second convolution network;

the first convolution network and the second convolution network have the same structure and share the weight of all network layers except the BN layer;

the first convolution network and the second convolution network have the same structure and share the weight of all network layers except the BN layer.

The feature fusion network comprises:

the first fusion part is used for fusing the visual characteristics and the radar characteristics by adopting a channel exchange strategy to obtain pre-fused corrected visual characteristics and corrected radar characteristics;

and the second fusion part is used for carrying out channel splicing on the presplitting corrected visual features and the corrected radar features and the momentum features to obtain fusion features.

The channel switching policy is thatWherein V 'is' _k,c Visual characteristics of the c-th channel output by the k-th layer convolution layer after the exchange strategy are represented, V _k,c Visual characteristics of the c-th channel representing the output of the k-th convolutional layer, L _k,c A radar feature representing the c-th channel of the k-th convolutional layer output, a _v,k,c 、b _v,k,c 、σ _v,k,c Sum mu _v,k,c Respectively representing the slope, bias, mean and variance of BN layer in the first convolution network; a, a _l,k,c 、b _l,k,c 、σ _l,k,c Sum mu _l,k,c Respectively representing the slope, bias, mean and variance of BN layer in the second convolution network; delta is a threshold.

The loss function includes a reconstruction errorDepth smoothing loss and geometric consistency loss, expressed as:wherein L is _all As a loss function, L _pe Representing reconstruction error, L _smooth Representing depth smoothing loss, L _geo Represents the geometric consistency loss, l represents the scale number, ω ₁ ,ω ₂ ,ω ₃ Weights representing reconstruction error, depth smoothing loss, and geometric consistency loss, respectively.

The expression of the reconstruction error is:wherein I is _s P is the source image _s For source image I _s Point on I _t For the target image, p' _s For source image I _s Point p on _s Corresponding to the target image I _t SSIM () is a structural similarity function, lambda ₁ And lambda (lambda) ₂ Is a weight coefficient.

The expression of the reconstruction error is: l (L) _pe ＝λ ₁ |I′ _s -I _s |+λ ₂ SSIM(I′ _s ,I _s ) Wherein I _s For source image, I' _s Is based on the target image I _t Reconstructed source image, SSIM () is a structural similarity function, λ ₁ And lambda (lambda) ₂ Is a weight coefficient.

The expression of the depth smoothing loss is:wherein D is _t For the depth map corresponding to the image at time t in the video frame sequence S, I _t For the target image +.>And->Representing partial derivatives of the x-direction and y-direction of the two-dimensional image coordinates, respectively.

Table of the geometric consistency lossThe expression is:wherein D is _t For the depth map corresponding to the image at time t in the sequence of video frames S, D' _t And representing the depth map of the current t moment which is generated by reconstructing the pose transformation by utilizing the depth map of the subsequent moment.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the application has the following advantages and positive effects: according to the application, through designing the feature extraction network of each mode and the multimode data fusion network, the full interaction and perfection between the multimode information are realized by utilizing the similar-mode priority fusion strategy and the channel exchange strategy, and the positioning precision of the self-supervision depth positioning algorithm is improved.

Drawings

FIG. 1 is a flow chart of a vision-inertial navigation-radar fusion self-positioning method based on a self-supervising neural network in accordance with an embodiment of the present application;

FIG. 2 is a framework diagram of a pose estimation network model in an embodiment of the application;

FIG. 3 is a schematic diagram of feature fusion in an embodiment of the application;

fig. 4 is a graph comparing the path trace and the true path of the test output on the KITTI data sets seq09 and seq10 according to an embodiment of the application.

Detailed Description

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

The embodiment of the application relates to a vision-inertial navigation-radar fusion self-positioning method based on a self-supervision neural network, which is shown in fig. 1 and comprises the following steps:

step 1, acquiring a video frame sequence, inertial navigation data and a laser radar point cloud picture;

and step 2, inputting the video frame sequence, the inertial navigation data and the laser radar point cloud image into a pose estimation network model to obtain the relative pose estimation of the input video frame.

The pose estimation network model is shown in fig. 2, takes a video frame sequence, inertial navigation data and a laser radar point cloud image as input, and outputs the pose estimation network model as relative pose estimation of each moment of the input sequence. For an input video frame sequence, three parallel processing branches are provided, wherein one branch is a depth map D obtained through a depth prediction network DepthNet; secondly, inputting inertial navigation data and a laser radar point cloud image into a multi-mode feature extraction network, and then obtaining relative pose estimation (pose transformation matrix T) at corresponding moment through a pose estimation network PoseNet with a feature fusion function; thirdly, the loss function is calculated by combining the depth map D and the pose transformation T. The pose estimation network model calculates a self-supervision loss function by utilizing a video frame, a depth map and a pose and optimizes network parameters through gradient descent back propagation. Multiple training iterations are performed until the model converges.

The pose estimation network model in the present embodiment includes: the system comprises a depth prediction network, a feature extraction network, a feature fusion network, a pose estimation network and a parameter optimization module.

The feature extraction network is used for extracting the features of each mode.

The visual features are extracted for a sequence of video frames from a monocular camera using a seven-layer convolutional network (ConvNet), each of which is followed by a batch normalization layer (BN layer) and a nonlinear activation function layer (relu layer). Each frame of the input video frame sequence is set to be video frame F _t And its next frame F _t+1 After being spliced according to the channels, the video signal is input into a network to extract space information and time sequence information, and visual characteristics V are obtained _t ：

V _t ＝ConvNet(F _t ,F _t+1 )

For IMU signals X from inertial navigation _t Extracting time domain information by adopting an LSTM network, and taking the value of the last hidden layer of the LSTM network as an output momentum characteristic I _t ：

I _t ＝LSTM(X _t )

For point cloud signals with the data format of (x, y, z, i) from the laser radar, wherein x, y, z represent three-dimensional space coordinates, i represents intensity (related to object surface illuminance and reflectivity), a sparse unordered point cloud is projected onto a 2D plane by adopting a cylindrical coordinate projection mode:

alpha and beta represent pitch and yaw angles, respectively, of the cylindrical coordinates, and are also the abscissas and abscissas of the planar projection view. When a plurality of 3D points are mapped to the same coordinates, only the 3D point nearest to the origin of coordinates (laser radar sensor center) is taken as the corresponding projection point. The projected 2D projection image is encoded into the following steps by adopting a three-channel encoding mode(three channel pixel values at each coordinate in the projection view). If some coordinates do not have any 3D points corresponding to them, the channel value at these coordinates is set to the default value of 0. Direct adoption of ConvNet network pair projected radar projection map M identical to video signal processing _t Extracting features L _t Wherein the weights of all networks except the BN layer are shared:

L _t ＝ConvNet(M _t ,M _t+1 )

the common multi-mode feature fusion method based on the neural network is that features of each mode data are extracted by using different feature encoders and aligned according to a certain dimension, and then the mode features are spliced according to the dimension and then sent to a fusion network for processing. This is most common in depth odometers based on bimodal fusion. However, in the tri-modal fusion, the method of directly using tri-modal data channel stitching needs to learn, for the model, when a certain modality should be fused with which modality (or fused with two other modalities simultaneously) (i.e. the fusion sequence) in addition to how the certain modality is fused with two other modalities (i.e. the fusion mode), so that training of the model becomes more difficult and slow.

The feature fusion network in this embodiment follows a fixed fusion order: the fusion between the simpler and visual similar modes is firstly carried out, and then the fusion between the complex and abstract dissimilar modes is carried out. The radar point cloud data can be converted into 2D image representation through projection transformation, then the 2D image representation can be fused with the characteristics of video frame sequence signals in similar modes, and then the fusion result is further fused with the characteristics of IMU signals with larger data format differences.

When the fusion of the video features and the radar features is carried out, firstly, data of two similar modes are respectively input into a feature extraction network (ConvNet) with shared weights except a BN layer, and a channel switching strategy is adopted:

the two rows above represent the visual characteristic V of the c-th channel output by the k-th convolution layer, respectively _k,c And radar feature L _k,c Expression (a) of passing BN layer _v,k,c 、b _v,k,c 、σ _v,k,c Sum mu _v,k,c Respectively representing the slope, bias, mean and variance of BN layer in the first convolution network; a, a _l,k,c 、b _l,k,c 、σ _l,k,c Sum mu _l,k,c Respectively representing the slope, bias, mean and variance of the BN layer in the second convolutional network), the channel switching strategy judges the importance and integrity of each channel information of the output feature map of the previous layer according to the slope a in the weight value of the BN layer, and an excessively small slope value (here, the threshold value is set as delta) means that the parameter updating weight of the model at the channel is small when the gradient return parameter is updated, namely, the information of the channel is incomplete or unimportant, so that the components of the feature map output by the convolutional layer at the channel are replaced at the next iterationThe channel component at the corresponding position of the feature map of another similar mode (the condition corresponding to the second row in the above formula is that the visual feature channel is replaced by the radar feature channel) so as to achieve the effect of supplementing the missing information, because the feature component at the corresponding channel of the feature map output after the similar mode experiences some identical network layers (weight sharing) should contain similar semantic information (illumination, depth and the like), and therefore, the information among the modes can be mutually supplemented and perfected.

Obtaining pre-fused corrected visual characteristics V 'through similar modal fusion' _t And radar feature L' _t Then the presegregated visual features and radar features are extracted from the IMU features to obtain momentum features I output by the network _t And splicing on the channels to obtain fusion characteristics.

The pose estimation network in this embodiment is composed of a PoseNet composed of an SE module and an LSTM, and takes the fusion characteristics as input to obtain a predicted pose transformation matrix T _t ：

T _t ＝PoseNet(V′ _t ,L′ _t ,I _t )

The depth prediction network in this embodiment gives a video frame image I when predicting _t It and its next frame image I _s The depth prediction network (DepthNet) is input into the splice, and the depth prediction network calculates I by comparing the images of the previous and the next frames _t Depth map D of (2) _t ：

D _t ＝DepthNet(I _t ,I _s )

The parameter optimization module in this embodiment calculates a loss function according to the depth map, the pose transformation matrix and the video frame, and adjusts parameters of the pose estimation network model according to the loss function. The loss function includes reconstruction error, depth smoothing loss, and geometric consistency loss, among others.

In calculating the reconstruction error, two consecutive frame pictures I are given _t (target image) and I _s (source image) now have their depth map D obtained through a depth prediction network (DepthNet) and a pose estimation network (pousenet) _t And D _s And pose transformation T between them _t→s For I _s Any point p in (2) _s ∈I _s The pixel-camera coordinate transformation formula can be used for obtaining the pixel-camera coordinate transformation formula in I _t Position p 'on' _s ：

Wherein, the representative position corresponds to I _t Point p 'on' _s And I _s Point p on _s Is the same point in the real world, K represents the camera reference matrix. Similarly, all I _s All points on are projected to I _t All I's are obtained by going up (ignoring points beyond the boundary after projection) _s Point on I _t The position of the upper part can be according to I _t Reconstructing the RGB value of the pixel point at the corresponding position to reconstruct I' _s And then reconstruct I' _s And I _s Comparing and calculating reconstruction error L _pe Reconstruction error L _pe The method comprises the following steps:

or L _pe ＝λ ₁ |I′ _s -I _s |+λ ₂ SSIM(I′ _s ,I _s )

Wherein SSIM () is a structural similarity function, lambda ₁ And lambda (lambda) ₂ As the weight coefficient, I' _s The method meets the following conditions:

in order to make the depth map output by the depth prediction network DepthNet have smoothness and scale consistency, a depth smoothing loss L needs to be adopted _smooth And geometric consistency loss L _geo Additional constraints are placed on DepthNet:

wherein D is _t For the depth map corresponding to the image at time t in the sequence of video frames S,and->Respectively representing partial derivatives of the two-dimensional image coordinates in the x-direction and the y-direction, D' _t And representing the depth map of the current t moment which is generated by reconstructing the pose transformation by utilizing the depth map of the subsequent moment.

To reduce invalid point matches during projection, the final loss function L _all The expression of (2) is:

wherein l represents a scale number, ω ₁ ,ω ₂ ,ω ₃ Weights representing reconstruction error, depth smoothing loss, and geometric consistency loss, respectively. By calculating L _all The back propagation updates the network parameters with respect to the gradients of all network parameters, through multiple iterations until the model converges.

To verify the effectiveness of this embodiment, the performance of this embodiment and other unsupervised deep learning based positioning algorithms over the KITTI data set in recent years was compared, as indicated in tables 1 and 2.

Table 1 comparison of average translational error and average rotational error for various unsupervised neural network positioning algorithms

TABLE 2 comparison of absolute track errors and relative pose errors for various unsupervised neural network positioning algorithms

Table 1 is the test results on the KITTI data sets seq09 and seq10, all of which were trained using seq00 through seq08 of the KITTI data set, tested with seq09 and seq10, where t _rel Representing rotational offset error, t _tel Representing translational offset errors, representing how many meters of translational and rotational offset, respectively, are produced per 100 meters of travel on average. (unit: m/100 m).

Table 2 shows the results of the test of this embodiment and other algorithms on KITTI datasets seq09 and seq10, where ATE represents the root mean square of the predicted camera pose and the true camera pose difference, and RPE represents the relative pose error from frame to frame.

Fig. 4 is a schematic diagram comparing the path trace and the real path of the test output on the KITTI data sets seq09 and seq10 according to the present embodiment, and it can be seen from fig. 4 that the two curves are substantially identical, which indicates that the positioning of the present embodiment is accurate.

Claims

1. The vision-inertial navigation-radar fusion self-positioning method based on the self-supervision neural network is characterized by comprising the following steps of:

wherein, the pose estimation network model comprises:

2. The self-supervised neural network based vision-inertial navigation-radar fusion self-localization method of claim 1, wherein the feature extraction network comprises:

the third feature extraction part projects the laser radar point cloud image to a 2D plane, codes the laser radar point cloud image projected to the 2D plane in a three-channel coding mode, and extracts radar features by adopting a second convolution network; the first convolution network and the second convolution network have the same structure and share the weight of all network layers except the BN layer.

3. The self-supervision neural network-based vision-inertial navigation-radar fusion self-localization method according to claim 1 or 2, wherein the feature fusion network comprises:

4. According to the weightsThe self-supervised neural network-based vision-inertial navigation-radar fusion self-localization method of claim 3, wherein the channel switching strategy isWherein V 'is' _k,c Visual characteristics of the c-th channel output by the k-th layer convolution layer after the exchange strategy are represented, V _k,c Visual characteristics of the c-th channel representing the output of the k-th convolutional layer, L _k,c A radar feature representing the c-th channel of the k-th convolutional layer output, a _v,k,c 、b _v,k,c 、σ _v,k,c Sum mu _v,k,c Respectively representing the slope, bias, mean and variance of BN layer in the first convolution network; a, a _l,k,c 、b _l,k,c 、σ _l,k,c Sum mu _l,k,c Respectively representing the slope, bias, mean and variance of BN layer in the second convolution network; delta is a threshold.

5. The self-supervised neural network based vision-inertial navigation-radar fusion self-localization method of claim 1, wherein the loss function comprises reconstruction errors, depth smoothing losses, and geometric consistency losses, expressed as:wherein L is _all As a loss function, L _pe Representing reconstruction error, L _smooth Representing depth smoothing loss, L _geo Represents the geometric consistency loss, l represents the scale number, ω ₁ ,ω ₂ ,ω ₃ Weights representing reconstruction error, depth smoothing loss, and geometric consistency loss, respectively.

6. The self-supervised neural network based vision-inertial navigation-radar fusion self-localization method of claim 5, wherein the reconstruction error is expressed as:wherein I is _s P is the source image _s For source image I _s Point on I _t For the target image, p' _s For source image I _s Point p on _s Corresponding to the target image I _t SSIM () is a structural similarity function, lambda ₁ And lambda (lambda) ₂ Is a weight coefficient.

7. The self-supervised neural network based vision-inertial navigation-radar fusion self-localization method of claim 5, wherein the reconstruction error is expressed as: l (L) _pe ＝λ ₁ |I′ _s -I _s |+λ ₂ SSIM(I′ _s ,I _s ) Wherein I _s For source image, I' _s Is based on the target image I _t Reconstructed source image, SSIM () is a structural similarity function, λ ₁ And lambda (lambda) ₂ Is a weight coefficient.

8. The self-supervised neural network based vision-inertial navigation-radar fusion self-localization method of claim 5, wherein the expression of the depth smoothing loss is:wherein D is _t For the depth map corresponding to the image at time t in the video frame sequence S, I _t For the target image +.>And->Representing partial derivatives of the x-direction and y-direction of the two-dimensional image coordinates, respectively.

9. The self-supervising neural network-based vision-inertial navigation-radar fusion self-positioning method of claim 5, wherein the expression of the geometric consistency loss is:wherein D is _t For the depth map corresponding to the image at time t in the sequence of video frames S, D _t ' represents the depth map at the current t-time generated by pose transformation reconstruction using the subsequent time depth map.