CN114463420A

CN114463420A - Visual mileage calculation method based on attention convolution neural network

Info

Publication number: CN114463420A
Application number: CN202210113074.0A
Authority: CN
Inventors: 高学金; 牟雨曼; 任明荣
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-01-29
Filing date: 2022-01-29
Publication date: 2022-05-10

Abstract

The invention discloses a visual mileage calculation method based on an attention convolution neural network. Aiming at the problems that the traditional visual odometer requires that the picture contains a large amount of texture information, the solving process is complex, and the visual odometer based on the convolutional neural network has low precision, the visual odometer based on the attention convolutional neural network and the gate control cycle unit is provided. And the attention mechanism is utilized to improve the accuracy of feature extraction of the convolution module, so that the accuracy of visual positioning is improved. Compared with the conventional visual mileage calculation method, the method has the advantages that the precision is ensured, and meanwhile, a complex solving process is abandoned, so that the method is more suitable for practical engineering application.

Description

Visual mileage calculation method based on attention convolution neural network

Technical Field

The invention relates to the field of deep learning and a Visual positioning technology, and provides a Visual odometer based on an attention convolution Neural network and a Gated Recurrent Unit (GRU) aiming at the problems that a traditional Visual Odometer (VO) requires a picture to contain a large amount of texture information, the solving process is complex, and the precision of the Visual odometer based on the Convolution Neural Network (CNN) is low. And the attention mechanism is utilized to improve the accuracy of the feature extraction of the convolution module, thereby improving the accuracy of the visual positioning.

Background

For autonomous navigation of intelligent vehicles, the self-positioning capability of the vehicle during motion is very important. Early vehicle positioning systems typically employed wheel speed encoders to calculate vehicle range, however, this method had significant cumulative errors. With the development of computer vision technology, vision sensors are increasingly used for vehicle positioning and motion estimation. The visual sensor not only can provide abundant perception information, but also has the characteristics of low cost, small size and the like. We generally refer to the problem of visually acquiring camera pose as visual odometer. A related study of visual odometry began in the 80's of the 20 th century. In recent years, with the continuous and deep research on visual odometers by scholars at home and abroad, VOs are gradually applied to various fields such as robots, automatic driving, pedestrian navigation and the like.

The traditional visual mileage calculation method mainly comprises the steps of image feature extraction, feature matching, pose estimation and the like, and finally obtains the camera pose with six degrees of freedom, namely displacement (x, y, z) and rotation (roll, pitch, yaw). The traditional visual odometry method is mainly divided into two types: direct methods and feature point methods. The direct method calculates camera motion mainly based on gray scale information of pixels in an image, and typical algorithms include SVO, LSD-SLAM, and the like. The feature point method is considered as the mainstream method of the visual odometer, firstly, feature extraction and matching are carried out on images, then, the motion of a camera is estimated according to the matched feature points, and the methods such as ICP (inductively coupled plasma), epipolar geometry and PNP (plug-and-play) can be adopted according to different types of the camera. Algorithms such as the famous SIFT and ORB estimate the pose of the camera by extracting feature points. However, the method is time-consuming and large in calculation amount when a large number of extracted feature points are available, and when a small number of extracted feature points are available, the feature is lost and the camera motion cannot be recovered, and the feature point method still faces a great challenge in scenes with few textures or dynamic target motion. The direct method can avoid feature loss and feature calculation time, but is often used in a scene with alternating light and shade.

With the continuous development of deep learning, the method is gradually applied to various fields, such as fault diagnosis, machine translation, image classification and the like. In recent years, this technique is also used in visual odometers. In 2015, Kishore and the like firstly utilize a convolutional neural network to study a visual odometer, and designs two different convolutional networks for learning the speed and rotation of movement respectively. In the same year, Kendall et al propose a PoseNet model, estimate the position and attitude of a camera by inputting a single picture, and realize end-to-end camera attitude estimation by using a convolutional neural network for the first time. In 2017, Wang et al proposed a deep vo model, and a Recurrent Neural Network (RNN) was added on the basis of the convolutional Neural Network to maintain the timing connectivity between images. The model adopts adjacent picture sequences in KITTI data set as input, firstly extracts image features through a convolutional neural network module, then inputs the image features into a cyclic neural network to learn geometric association of images, and finally outputs camera gestures.

The existing VO estimation method based on deep learning still cannot compare the accuracy with the traditional method. Although the latest research greatly improves the pose accuracy under the condition of introducing optical flow, the method is difficult to be widely applied. At present, most of existing visual odometers based on deep learning are based on a single convolutional neural network or the combination of the convolutional neural network and a cyclic neural network, and the accuracy of the estimated trajectory is still low, so that a great research space is provided in the aspects of model construction and the like.

Disclosure of Invention

In view of the above problems, we propose an acgr (attention restriction and Gate recovery unit) model, as shown in fig. 1, which incorporates an attention mechanism based on CNN and RNN, and uses the attention mechanism to improve the accuracy of feature extraction, thereby improving the accuracy of estimated trajectory.

The method comprises the following specific steps:

1. feature extraction based on the Attention recommendation Model.

The model introduces a spatial and channel attention mechanism in the CNN at the same time. The structure is shown in fig. 2.

The ACGR designs a convolution network module according to the implementation principle of an optical flow method in the traditional VO method, and utilizes the convolution network module to extract the characteristics of the picture, thereby calculating the motion characteristics of the picture and expressing the motion characteristics in a vector mode.

2. GRU based timing modeling.

The GRU in the model has two layers, and each layer contains 1024 hidden units. In order to maintain the original data distribution, the activation function of the original GRU is changed into a ReLU function.

3. Full-connection layer dimension reduction output pose

Two full connection layers are added in the model, and 128 hidden units and 6 hidden units are respectively contained in the model.

Compared with the prior art, the invention has the following beneficial effects:

firstly, the visual odometer based on deep learning is realized, the method abandons the complicated steps of the traditional visual odometer method and realizes an end-to-end positioning mode; secondly, a gate control cycle unit is added in the traditional visual odometry method based on the convolutional neural network and is used for learning the time sequence relevance in the image data; thirdly, an attention mechanism is integrated into the feature extraction module, so that the network can learn more important geometric features in pictures more intelligently, and the feature extraction capability of the convolutional neural network is enhanced.

Drawings

FIG. 1 is a system framework diagram of a method in accordance with the present invention;

FIG. 2 is a view of the attention mechanism;

FIG. 3(a) is a trace plot of different algorithms in sequence 03;

FIG. 3(b) is a trace plot of the different algorithms in sequence 04;

FIG. 3(c) is a trace plot of the different algorithms in sequence 09;

FIG. 3(d) is a trace plot of different algorithms in sequence 10;

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The invention utilizes an attention mechanism, a convolutional neural network and a gate control cycle unit to realize the visual odometer technology.

The method specifically comprises the following steps:

1. feature extraction based on Attention convention Model

(1) Convolutional neural network

The ACGR designs a convolution network module according to the implementation principle of an optical flow method in the traditional VO method, and utilizes the convolution network module to extract the characteristics of the picture, so that the motion characteristics of the picture are calculated and expressed in a vector mode.

The parameters of CNN are shown in table 1, and include 9 Convolutional layers (Convolutional layers) in total, and add Batch normalization (Batch normalization) after each Conv to maintain the data distribution before and after Convolutional transformation, speed up the model training speed, and avoid gradient explosion in back propagation. In the CNN, the size of a convolution kernel in the first layer is 7 multiplied by 7, the sizes of the second layer and the third layer are 5 multiplied by 5, the sizes of the last six layers are 3 multiplied by 3, and the ReLU is selected as an activation function. The input of the module is pictures of two adjacent frames in a KITTI data set, and in order to keep the geometric characteristics of original data, the sizes of the pictures are modified and unified into 1280 x 384. The tensor size of the picture after 9 layers of convolution feature extraction is 10 multiplied by 3 multiplied by 512.

TABLE 1 convolutional layer parameter List

Tab.1Convolutional layer parameter list

(2) Attention mechanism

The attention mechanism is embedded into the convolutional network, so that the network can learn more important geometric features in pictures more intelligently, and the feature extraction capability of the convolutional neural network is enhanced. The attention module calculation formula is as follows:

M_c(F)＝σ(MLP(AP(F))+MLP(MP(F))) (1)

wherein the MLP represents a fully-connected layer,

is specially designed forThe feature map is a map of the feature,

for one-dimensional channel attention features, AP, MP represent average pooling and maximum pooling respectively,

is a two-dimensional spatial attention feature. σ is sigmoid function, f^7*7Representing a convolution operation with a filter size of 7 x 7.

2. GRU-based time sequence modeling

The picture data for the visual odometer have strong time sequence correlation, so the correlation can be learned by utilizing a recurrent neural network. The GRU is simpler in structure and less in parameters, so that training time can be effectively saved by using the GRU.

The current time is represented as t, and the input is x_tThe hidden state at the previous moment is denoted as h_t-1The outputs of the GRUs and the state update equation are as follows.

Wherein z is_tRepresenting the update gate in GRU, σ is sigmoid activation function for controlling the output value between 0 and 1, r_tRepresenting a reset gate in the GRU, and a tanh activation function for controlling the output value between-1 and 1]Indicates that the two vectors are connected and indicates the product of the matrices. W, W_z、W_rAnd representing the weight, adopting a random initialization mode, and continuously updating in the training process.

The GRU in the model has two layers, and each layer contains 1024 hidden units. The input picture sequence is subjected to feature extraction by an Attention constraint Model to obtain a tensor of 10 × 3 × 512, and then is input into the first layer GRU, and the layer output is input into the second layer GRU, which is the output of the entire GRU module.

3. Full connection layer

The model has two full connection layers which respectively comprise 128 and 6 hidden units. And the full connection layer reduces the dimension of the high-dimensional features output by the GRU module, and the finally output six-dimensional tensor is the relative posture between the current moment and the picture at the previous moment.

4. Loss function

The pose estimation problem in the visual odometer is expressed as a conditional probability problem. As shown in equation 4, for a given (n +1) pictures:

X＝(X₁,X₂,...,X_n+1) (4)

the pose between adjacent pictures can be obtained through calculation:

Y＝(Y₁,Y₂,...,Y_n) (5)

considering the above problem as a conditional probability problem, the calculation formula is:

p(Y|X)＝p(Y₁,Y₂,...,Y_n∣X₁,X₂,...,X_n+1) (6)

solving for the optimal network parameter w^*Maximizing the probability in equation (6):

the Mean Squared Error (MSE) is used as a loss function, as shown in equation 8:

P_1idenotes the i-th shift of group Truth to the sequential input, Φ_1iThe rotation angle of group Truth of the ith pair of sequential inputs is shown.

Indicating the displacement of the ith pair of sample positive sequence inputs,

represents the ith pair of samplesThe corner of the positive sequence input. P_2iIndicates the displacement of group Truth of the ith pair of reverse order inputs, phi_2iThe rotation angle of group Truth of the ith pair of reverse order inputs is shown.

Indicating the displacement of the ith pair of sample input in reverse order,

indicating the rotation angle of the ith pair of sample reverse order inputs. M represents the number of samples, β₁And beta₂Scale factors representing the positive and negative input errors, respectively.

5. Experimental procedures and results

(1) Introduction to data set

The ACGR model completes the experiment using the common data set, KITTI data set. The KITTI Visual Odometry is a large-scale open source data set and is widely applied to evaluating various Visual Odometry models. Since only the first 11 scenes of the dataset provide ground truth, the first 11 sequences of data were chosen for the experiments. In order to meet the requirements of the network on input data while preserving the geometric features of the image, all picture sizes are uniformly adjusted to 1280 × 384. In addition, the sequences 03, 04, 09, and 10 have the least number of pictures, and the larger the number of pictures in the training set, the better the model training result, so the sequences 03, 04, 09, and 10 are selected as the test set. Sequences 00, 01, 02, 05, 06, 07 and 08 contain a large number of pictures, and therefore the sequences are selected for training the model. Then randomly selecting data in the 7 training set sequences for verification, wherein the number of pictures in the verification set accounts for about one third of the number of the training sets. The specific division is shown in table 2.

TABLE 2 training set, validation set, and test set

Tab.2training set,setverification set and test set

(2) Model training

The model is built on a deep learning frame PyTorch, and the model of the display card is Nvidia Geforce RTX2080 ti. The optimizer selects adam (adaptive motion estimation), and the optimization algorithm selects Batch Gradient Decline (BGD).

And in the training process, the weights of all parameters in the network adopt a xavier initialization method, and the deviation adopts a zero initialization method. The number of iterations was set to 100 and the initial learning rate was set to 0.01. And save the model parameters of item 100 for subsequent testing procedures.

(3) Experimental results and error analysis

In the experiment, the ACGR is trained by using 7 sequences in a training set, then the model test is carried out by using the 4 sequences, and the performance of the ACGR is evaluated according to the test result. The estimated trajectories of the test set sequences 03, 04, 09, and 10 are shown in fig. 3.

In order to verify the effect of the model estimated trajectory, four methods were compared in the experiment, including ORBSLAM [22], SFMLearner algorithm based on convolutional neural network, CAGR-VO (constraint and Gate recovery Unit Visual overview) algorithm based on convolutional, cyclic neural network, and ACGR-VO algorithm. Wherein ORBSLAM is a classical visual SLAM system, and except the visual odometer, the ORBSLAM also comprises a loop detection module, and for the fairness of the comparison experiment, no loop detection is added in the ORBSLAM adopted in the experiment. As can be seen from fig. 3, the four algorithms can approximate the shape of the trajectory. The ORBSLAM based on the characteristic points is developed more mature, and the ORBSLAM selected in the experiment is not added with loop detection, so that the fitting effect of the estimated track and the true track in the 09 sequence is poor, but the fitting effect of the sequences 03, 04 and 10 is good, the track prediction precision is high, the research of the visual odometer based on deep learning is still in the development stage, and a large research space is provided. Although the other three visual odometer model-based estimations in the experiment have slightly less effect than the ORB algorithm, the fitting effect of the ACGR-VO model is better than that of the SFMLearner model based on the convolution neural network and the CAGR-VO model based on the convolution and cyclic neural network. Therefore, the accuracy of the ACGR-VO model is improved compared with that of other models based on the convolutional neural network, and the accuracy of the predicted track of the model can be effectively improved by adding an attention mechanism.

Claims

1. A visual mileage calculation method based on an attention convolution neural network is characterized by comprising the following steps:

1) feature extraction based on Attention convention Model

(1) Convolutional neural network

The parameters of CNN are shown in table 1, containing a total of 9 convolutional layers, with batch normalization added after each Conv; in the CNN, the size of a convolution kernel at the first layer is 7 multiplied by 7, the sizes of the convolution kernel at the second layer and the convolution kernel at the third layer are 5 multiplied by 5, the sizes of the convolution kernel at the last six layers are 3 multiplied by 3, and a ReLU is selected as an activation function; the input of the module is pictures of two adjacent frames in a KITTI data set, and in order to keep the geometric characteristics of original data, the sizes of the pictures are modified and unified into 1280 multiplied by 384; the tensor size of the picture after 9 layers of convolution feature extraction is 10 multiplied by 3 multiplied by 512;

TABLE 1 convolutional layer parameter List

(2) Attention mechanism

Embedding the attention mechanism into the convolutional network, the attention module calculates the formula as follows:

M_c(F)＝σ(MLP(AP(F))+MLP(MP(F))) (1)

wherein the MLP represents a fully-connected layer,

is a characteristic map, and the characteristic map is a characteristic map,

for one-dimensional channel attention characteristics, AP and MP are dividedRespectively mean pooling and maximum pooling,

is a two-dimensional spatial attention feature; σ is sigmoid function, f^7*7Convolution operation representing a filter size of 7 × 7;

2) timing modeling based on GRU

The current time is represented as t, and the input is x_tThe hidden state at the previous moment is denoted as h_t-1If so, the output of the GRU and the state update equation are as follows;

wherein z is_tRepresenting the update gate in GRU, σ is sigmoid activation function for controlling the output value between 0 and 1, r_tRepresenting a reset gate in the GRU, and a tanh activation function for controlling the output value between-1 and 1]Representing the product of two connected vectors and a matrix; w, W_z、W_rRepresenting the weight, adopting a random initialization mode, and continuously updating in the training process;

the GRU in the model has two layers, and each layer contains 1024 hidden units; the input picture sequence is subjected to feature extraction through an Attention conversion Model to obtain a tensor with the size of 10 multiplied by 3 multiplied by 512, then the tensor is input into a first layer of GRU, the output of the layer is input into a second layer of GRU, and the output of the second layer of GRU is the output of the whole GRU module;

3) full connection layer

There are two layers of full connection layers, which respectively contain 128 and 6 hidden units; the full connection layer reduces the dimension of the high-dimensional features output by the GRU module, and the finally output six-dimensional tensor is the relative posture between the current moment and the picture at the previous moment;

4) loss function

Expressing a pose estimation problem in the visual odometer as a conditional probability problem; as shown in equation 4, for a given (n +1) pictures:

X＝(X₁,X₂,…,X_n+1) (4)

and calculating to obtain the pose between the adjacent pictures:

Y＝(Y₁,Y₂,...,Y_n) (5)

p(Y|X)＝p(Y₁,Y₂,...,Y_n∣X₁,X₂,...,X_n+1) (6)

P_1idenotes the i-th shift of group Truth to the sequential input, Φ_1iThe rotation angle of the group Truth of the ith pair of sequential input is shown;

indicating the displacement of the ith pair of sample positive sequence inputs,

representing the rotation angle of the ith pair of sample positive sequence inputs; p_2iIndicates the displacement of group Truth of the ith pair of reverse order inputs, phi_2iThe rotation angle of the group Truth input in the ith pair of reverse orders is shown;

indicating the displacement of the ith pair of sample input in reverse order,

representing the rotation angle of the ith pair of sample reverse order inputs; m represents the number of samples, β₁And beta₂Scale factors respectively representing positive sequence input errors and negative sequence input errors;

the weights of all parameters in the network during training adopt a xavier initialization method, and the deviation adopts a zero initialization method; the iteration times are set to 100, and the initial learning rate is set to 0.01; and save the model parameters of item 100 for subsequent testing procedures.