CN111127557A

CN111127557A - Visual SLAM front-end attitude estimation method based on deep learning

Info

Publication number: CN111127557A
Application number: CN201911278664.3A
Authority: CN
Inventors: 高嘉瑜; 李斌; 李阳; 景鑫
Original assignee: CETC 20 Research Institute
Current assignee: CETC 20 Research Institute
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-05-08
Anticipated expiration: 2039-12-13
Also published as: CN111127557B

Abstract

The invention provides a visual SLAM front-end pose estimation method based on deep learning, which is used for estimating pose transformation between frames in real time. Firstly, carrying out data preprocessing on an original data set, and then constructing a Brox network to carry out dense optical flow extraction on input continuous frame images; dividing the extracted optical flow diagram into two networks for feature extraction, extracting high-dimensional features by adopting global information for one branch, dividing the optical flow diagram into 4 sub-images in the other branch, and respectively sampling to obtain image features; and finally, fusing the features obtained by the training of the two branches, and performing pose estimation on the final cascade full-connection network to acquire the pose between two adjacent frames. The invention solves the problem of true scale estimation in monocular vision, can extract camera motion and proportion information by using global information and local information, and improves the learning ability and the intelligent level of the robot.

Description

Visual SLAM front-end attitude estimation method based on deep learning

Technical Field

The invention relates to the field of visual navigation, in particular to a visual SLAM front-end posture estimation method. After continuous image frames are input in an end-to-end mode, pose transformation among frames is estimated in real time, and a visual SLAM method with high robustness based on deep learning can be provided for the unmanned aerial vehicle.

Background

Simultaneous localization and mapping (SLAM) is a technology in which an intelligent object such as an unmanned aerial vehicle, etc., carries its sensor to realize the establishment of a surrounding environment map in the course of movement and performs its own localization according to the established environment map. When the unmanned aerial vehicle enters some special environments for operation, the unmanned aerial vehicle is easy to be interfered by the environment, so that a GPS signal is weakened or completely disabled. A complete SLAM framework consists of 4 aspects of front-end tracking, back-end optimization, loop detection and map reconstruction. The front-end tracking, namely a visual odometer, is responsible for preliminarily estimating the positions of the pose between the camera frames and the map points; the rear-end optimization is responsible for receiving pose information measured by the front end of the visual odometer and calculating maximum posterior probability estimation; the loop detection is responsible for judging whether the robot returns to the original position or not, and correcting the estimation error by loop closing; and the map reconstruction is responsible for constructing a map which is adaptive to the task requirement according to the camera pose and the image.

However, since 2017, the traditional visual SLAM scheme has no substantial progress, and the robustness of the algorithm is not very high under adverse conditions such as severe illumination conditions or large illumination change;

with the development of deep learning in the field of computer vision, more and more vision problems are broken through in a deep learning mode. The combination of deep learning and SLAM improves the application limitation caused by manual design characteristics such as visual odometry, scene recognition and the like, and improves the learning ability and the intelligent level of the robot. The feature point extraction in the traditional SLAM algorithm is easily influenced by scene factors, particularly illumination intensity and scene content, and the features extracted by the deep network have better generalization performance.

The vision pose estimation is a basic composition module of the vision SLAM system, and the function of a front-end vision mileometer of the system is realized. The current visual odometer is mainly realized by a learning method and a geometric method. For the geometric method, the method is mainly realized by extracting features (such as ORB features, S1FT features and the like) in two continuous pictures, and matching and calculating in the two pictures.

However, both the two methods have certain defects that the universality is poor for the learning method, and particularly, when the change of a test data scene and a training scene is large or the movement speed is changed, the performance of the algorithm is greatly influenced.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a visual SLAM front-end posture estimation method based on deep learning. The method solves the technical problems of poor universality of the existing vision pose estimation realized by adopting a learning method and the technical problems of poor real-time performance, difficult feature detection and poor robustness of the vision odometer realized by adopting a geometric method.

The technical scheme adopted by the invention for solving the technical problem comprises the following specific steps:

step 1): carrying out data preprocessing on the training data set;

1.1) firstly, cutting images in a KITTI database until the images have the same size;

1.2) then utilizing a conversion matrix between adjacent frames to expand the data set;

and selecting the step length as N for expansion: the number of the original data sets is S, and a position posture transformation matrix between the time i and the time j is set as T_ijThe pose moment T between time T and (T + N)_t，t+N＝T_t，t+1·T_t+1，t+2·T_t+2，t+3....T_t+N-1，t+NSelecting the expansion step length to be N by utilizing the conversion relation, expanding the data set to be NS, and providing training sample number for the KITTI data set by S;

1.3) data conversion;

converting the track data provided by KITTI from a pose matrix form into a relative pose transformation vector between adjacent frames by using a Robotics Toolbox of Peter Corke, namely converting a rotation matrix into an Euler angle and converting a displacement part into a displacement vector;

step 2): constructing an offline deep neural network model;

setting the interframe pose estimation 6 freedom parameter as f, mapping an input parameter variable x of a target f to obtain, and then taking x as an auxiliary parameter; w is a training sequence coefficient obtained by a training data set, b is a residual error value of a true value and a calculated value and is used for correction;

2.1) division of training set and validation set: as only 00-10 sequences of a data set provided by KITTI can be used for off-line training, the first M sequences in the 00-10 sequence pairs of the data set provided by KITTI are used as a training set, the last 11-M sequences are used as a test set, the training set is used for network training, and the test set is used for verifying the accuracy of network learning;

2.2) building an offline learning deep neural network model;

2.2.1) building an optical flow extraction network, and finishing the extraction of an initial optical flow field by using adjacent image frames: adopting a Brox algorithm network as an optical flow extractor, calculating optical flow between two frames of images at time t and t +1, and quantizing the calculated optical flow field by using RGB (red, green and blue) coding, so that input data is in a three-channel eight-bit depth image format;

2.2.2) building a global feature extraction network;

carrying out T1 downsampling on the whole image, then carrying out deep network training, selecting a convolutional neural network for feature extraction, carrying out training by using the global information of the optical flow diagram, and acquiring the global features of the optical flow diagram;

2.2.3) building a local feature extraction network;

dividing the depth image into four sub-images, downsampling each quadrant for T2 times, then training through a CNN filter, performing two-stage training on each sub-image, performing CNN1 and CNN2, and finally cascading a full connection layer;

the first part of the local feature extraction network consists of four branches, and each subimage is trained respectively; each of the four quadrants of the image contains motion information for calculating a motion estimate; then, correlating the output of the first CNN-pooling layer pair with the second CNN-pooling layer; CNN1 and CNN2 extract different information from the optical flow images; CNN1 extracts finer details, while CNN2 extracts coarser details, and these information do not overlap completely;

combining four complex features together to form a feature containing global image information, so that the network can resolve motion blur with symmetric information, the last layer connecting a fully connected network using information of all four quadrants at two resolutions;

2.2.4) merging the global feature extraction network and the local feature extraction network to build an optical flow graph feature extraction network;

the CNN filters of the global feature extraction network and the local feature extraction network are used for feeding the output of the CNN filters to a next layer of full-connection layer network, and the global information of the global feature extraction network and the local information of the local feature extraction network are combined to improve the performance of the network;

2.2.5) constructing a pose estimation network, wherein the pose estimation network is responsible for integrating all the characteristics and finally finishing an estimation task of an interframe pose vector;

the pose estimation network is a regressor for finally finishing interframe pose vector estimation, all the features extracted by the feature extraction network are input into the fully-connected network in a form of combining a convolutional network and the fully-connected network, and the fully-connected layer is responsible for final feature integration and regressively fitting the nonlinear relation of the pose estimation problem in the geometric mapping; the fully-connected layer comprises three layers, and the last layer finally fits the feature vectors into six-dimensional pose vectors to realize estimation of the pose vector y;

step 3): selecting a network loss function;

selecting a Euclidean distance loss function as an error calculation mode, wherein the error firstly calculates Euclidean distances by using predicted values and real labels of all samples in a current batch of training set, and then calculates an average value, namely Mean Square Error (MSE), of squares of all the distances;

assuming that the predicted value and the true value of the model function are in accordance with normal distribution, if the model is closest to the measured value, the probability product of all samples in the formula is required to be maximum, and then relevant derivation is carried out to obtain the final result: the sum of squares minimum formula. Therefore, the Euclidean distance loss function

Wherein W represents a weight parameter of the network model,

being an output model of the network, Y_iThe pose vector is a standard real pose vector, and N is the size of a training set;

step 4): training a network;

4.1) randomly initializing network parameters by using Gaussian distribution;

4.2) pre-training layer by greedy to optimize network parameters layer by layer;

the greedy layer-by-layer pre-training method finds the optimal local minimum value for the filter coefficient of each layer, and then carries out global training to carry out fine tuning on the weighted value; for each branch, train the CNN1 filter and fully-connected layers with its connected fully-connected layers, then discard the fully-connected layers, feed the output of CNN1 to CNN2 and train only the new fully-connected layers; discarding the full connection layer again, connecting the two outputs of the CNN and training the third estimator; repeating this process for each branch, then discarding the last fully connected layer and connecting the four quadrant outputs to the last fully connected network, which trains the final estimator and fine-tunes the CNNs coefficients;

4.3) aiming at the batch normalization layer, setting a use-global-stats parameter as false in the training process, and setting the use-global-stats parameter as true in the testing process, wherein the network adopts an Euclidean distance loss function;

randomly scattering all samples during training, and adjusting and optimizing network parameters in a mini-batch mode of a deep learning model; and optimizing the network by adopting an adam algorithm.

The size of the target image in the step 1.1) is the size with the largest number of image sizes in the image sequence.

The step size in step 1.2) is selected to be N-2.

M in said step 2.1) is 8.

T1 in said step 2.2.2) is 8.

T2 in said step 2.2.3) is 4.

The invention has the beneficial effects that:

A) different from the traditional geometric-optimization algorithm, the method combines the characteristics of the deep learning algorithm, learns the fitting pose estimation function through the training process on the premise of not needing any camera external parameters, and simultaneously solves the problem of real scale estimation in monocular vision.

B) The global and local information can be used to extract camera motion and scale information while processing noise in the input, and the new features extracted using CNN are robust in images with different contrast and blur parameters.

C) The combination of deep learning and SLAM improves the application limitation caused by manual design characteristics such as visual odometry, scene recognition and the like, and improves the learning ability and the intelligent level of the robot. The feature point extraction in the traditional SLAM algorithm is easily influenced by scene factors, particularly illumination intensity and scene content, and the features extracted by the deep network have better generalization performance.

Drawings

FIG. 1 is a schematic diagram of the basic flow of the present invention.

FIG. 2 is a method for constructing a local feature extraction network according to the present invention.

FIG. 3 is an offline deep neural network model constructed by the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

And estimating the pose transformation between frames in real time after continuous image frames are input in an end-to-end mode. Firstly, carrying out data preprocessing on an original data set, including data conversion of an expansion data set and the data set, and then constructing a Brox network to carry out dense optical flow extraction on input continuous frame images; dividing the extracted optical flow diagram into two networks for feature extraction, extracting high-dimensional features by adopting global information for one branch, dividing the optical flow diagram into 4 sub-images in the other branch, and respectively sampling to obtain image features; and finally fusing the features obtained by the training of the two branches. And finally, carrying out pose estimation on the cascaded fully-connected network to acquire the pose between two adjacent frames. And training the network to obtain proper parameters. And testing the precision and time on the test set data by using the trained network. On the premise of not adopting any definite geometric operation and external parameters, the network structure is utilized to automatically learn the functional relation existing between the input data and the pose estimation. The CNN architecture is superior to the prior art inter-frame estimation methods and guarantees the robustness of the algorithm in case of image abnormalities (e.g., blurring, large contrast and brightness variations). The deep learning method is adopted to solve the problems of poor real-time performance and poor robustness of the SLAM visual odometer and the problem of rapid positioning of intelligent carriers such as unmanned aerial vehicles in unknown environment.

Step 1): carrying out data preprocessing on the training data set;

because the extended image needs to have a certain overlapping area, the step length is selected to be N for extension: the number of the original data sets is S, and a position posture transformation matrix between the time i and the time j is set as T_ijThen the pose matrix T between time T and (T + N)_t，t+N＝T_t，t+1·T_t+1，t+2·T_t+2，t+3....T_t+N-1，t+NSelecting the expansion step length to be N by utilizing the conversion relation, expanding the data set to be NS, and providing training sample number for the KITTI data set by S;

1.3) data conversion;

transforming the track data provided by KITTI from a pose matrix form (a row main sequence is stretched into a row) into a relative pose transformation vector between adjacent frames by means of a Robotics Toolbox of Peter Corke, namely transforming a rotation matrix into an Euler angle and transforming a displacement part into a displacement vector;

step 2): and (3) constructing an offline deep neural network model, as shown in FIG. 3.

According to the traditional function mapping f-wx + b analysis, the target value f is obtained by adding a bias value to the product of the coefficient w and the variable x; setting the interframe pose estimation 6 freedom parameter as f, mapping an input parameter variable x of a target f to obtain the target f, wherein x is an auxiliary parameter, and x and f are mapped one by one theoretically due to inconsistency of the auxiliary parameter of a discrete space; w is considered as the training sequence coefficient obtained from the training data set, and b is the residual error between the true value and the calculated value for correction;

2.2) building an offline learning deep neural network model;

2.2.1) building an optical flow extraction network, and extracting an initial optical flow field by using adjacent image frames; adopting a Brox algorithm network as an optical flow extractor, calculating optical flow between two frames of images at time t and t +1, and quantizing the calculated optical flow field by using RGB (red, green and blue) coding, so that input data is in a three-channel eight-bit depth image format;

2.2.2) building a global feature extraction network;

2.2.3) building a local feature extraction network;

the first part of the local feature extraction network consists of four branches, and each subimage is trained respectively; each of the four quadrants of the image contains some motion information for calculating the motion estimate; then, correlating the output of the first CNN-pooling layer pair with the second CNN-pooling layer; CNN1 and CNN2 extract different information from the optical flow images; assume that CNN1 extracts finer details, while CNN2 extracts coarser details, and that the information does not overlap completely;

after this stage, four complex features are combined together to form an image information feature that contains the global, so the network can resolve motion blur with symmetric information. The last layer connects to a fully connected network that uses the information of all four quadrants at two resolutions.

the pose estimation network is a regressor for finally finishing interframe pose vector estimation, the form of combining a convolution network and a fully-connected network is still adopted on the whole, all the features extracted by the feature extraction network are input into the fully-connected network, and the fully-connected layer is responsible for final feature integration and regressively fits the nonlinear relation of the pose estimation problem in the geometric mapping; the fully-connected layer comprises three layers, and the last layer finally fits the feature vectors into six-dimensional pose vectors to realize estimation of the pose vector y;

step 3): selecting a network loss function;

because the problem is a general regression problem, an Euclidean distance loss function is selected as an error calculation mode, the Euclidean distance is firstly calculated by using the predicted values and the real labels of all samples in the current batch of training set, and then the average value, namely the Mean Square Error (MSE), of the squares of all the distances is calculated;

Wherein W represents a weight parameter of the network model,

step 4): training a network;

4.1) randomly initializing network parameters by using Gaussian distribution;

the greedy layer-by-layer pre-training method finds the optimal local minimum value for the filter coefficient of each layer, and then carries out global training to carry out fine tuning on the weighted value; for each branch, train the CNN1 filter and fully-connected layers with its connected fully-connected layer, then discard the fully-connected layer, feed the output of CNN1 to CNN2 and train only this new fully-connected layer; discarding the full connection layer again, connecting the two outputs of the CNN and training the third estimator; this process is repeated for each branch, then the last fully connected layer is discarded and the four quadrant outputs are connected to the last fully connected network, which trains the final estimator and fine-tunes the CNNs coefficients;

because the data set has image sequences of various camera internal parameters, in order to improve the training speed and the training quality of the network and avoid the situation that the pose regressor is biased to a certain distribution, all samples are scattered randomly during training, and the adjustment and optimization of the network parameters are carried out by adopting a mini-batch mode commonly used during deep learning model optimization. The network is optimized by adopting the adam algorithm, and the value of the learning rate has great influence on the training of the network, which is a good parameter selected through multiple experiments.

The size of the target image in the step 1.1) is the size with the largest number of image sizes in the image sequence, and the images are adjusted to be uniform in size, so that subsequent image processing is facilitated and training is simplified.

The step size in step 1.2) is selected to be N-2. The image sequence has certain continuity, and pose vectors between image frames with different intervals can be obtained between continuous adjacent frames in a pose matrix transformation mode, so that a data set is expanded. Because a certain overlap area needs to exist between adjacent frame images of the extended data set, the step size is selected too large, and no overlap area exists between adjacent frames.

M in said step 2.1) is 8. Because a large training set is required for training, but a sufficient number of test sets are also required for testing. According to the method, the M is selected to be 8, so that the number of training sets and the verification of the test are guaranteed.

The T1 in the step 2.2.2) is 8, and 8 times of downsampling are carried out on the image, so that the image characteristics can be guaranteed, and overlarge calculation amount is avoided.

And 4 in the step 2.2.3), T2 is selected to be 4, and the image is downsampled for 4 times, so that the image characteristics can be ensured, and an overlarge calculation amount is avoided.

The embodiment example is shown in figure 1: a visual SLAM front-end attitude estimation method based on deep learning comprises the following specific implementation steps:

step 1): carrying out data preprocessing on the training data set;

1.1) first the first 4 sequences of the KITTI database. The images of the 00-03 sequence are cropped to be as large as the size of the images of the next 7 sequences, namely 1226 × 370;

1.2) preparation ofAnd then, expanding the data set by using a conversion matrix between adjacent frames: the number of the original data sets is S, and a position posture transformation matrix between the time i and the time j is set as T_ijThe position matrix T between time T and (T +2)_t，t+2＝T_t，t+1·T_t+1，t+2. Using this transformation, an expansion step size of 2 is selected to expand the data set to 2S. S is the number of training samples provided for the KITTI dataset. The original data T_t，t+1,T_t+1，t+2Adding a T_t，t+2。

1.3) data conversion

Track data provided by KITTI is converted into relative pose transformation vectors between adjacent frames by means of a Robotics Toolbox of Peter Corke, namely a rotation matrix is converted into an Euler angle, and a displacement part is converted into a displacement vector.

Step 2): and (5) constructing an offline deep neural network model.

According to the conventional function mapping f-wx + b analysis, the target value f is obtained by adding the offset value to the product of the coefficient w and the variable x. By utilizing the thought, if the parameter of the degree of freedom of the interframe pose estimation 6 is f, the target f is obtained by mapping the input parameter variable x, and x can be understood as an auxiliary parameter, because of the inconsistency of the auxiliary parameter of the discrete space, x and f can be mapped one by one theoretically; w may be considered as training sequence coefficients obtained from a training data set, while b may be understood as the residual between the true and calculated values for correction.

2.1) division of training and validation sets. Only 00-10 sequences of data sets provided by KITTI can be used for off-line training. The first eight sequences (00-07) in the 00-10 sequence pairs are used as training sets, and the last three sequences (08-10) are used as test sets. And performing network training by using the training set, and verifying the accuracy of network learning by using the test set.

2.2) building an offline learning deep neural network model.

2.2.1) building an optical flow extraction network, and extracting an initial optical flow field by using adjacent image frames; a Brox algorithm network is adopted as an optical flow extractor, optical flow between two frames of images at time t and t +1 is calculated, and an RGB code is used for quantizing the calculated optical flow field, so that input data is in a three-channel and eight-bit depth image format.

2.2.2) building a global feature extraction network;

and carrying out eight times of downsampling on the whole image, then carrying out deep network training, and carrying out training by using global information. 2.2.3) constructing a local feature extraction network as shown in FIG. 2;

the depth image is divided into four sub-images. Each quadrant is downsampled 4 times and then trained through the CNN filter of the feature extraction network. The last layer is trained using the output layers of the four CNN networks to derive a global inter-frame estimate.

The first part of the local feature extraction network consists of four branches, of the same complexity, but trained separately, performing the first two convolution steps (CNN1 and CNN2) note that each of the four quadrants of the image contains some motion information that can be used to compute the motion estimate. The output of the first CNN-poling layer pair is then correlated with the second. CNN1 and CNN2 extract different information from the optical flow images. Assume that CNN1 extracts finer details, while CNN2 extracts coarser details, and that the information does not overlap completely.

2.2.4) merging the global feature extraction network and the local feature extraction network to build the optical flow graph feature extraction network

The CNN filters of the global feature extraction network and the local feature extraction network are used to feed their outputs to the next layer of the fully-connected layer network. And the global information of the global feature extraction network is combined with the local information of the local feature extraction network to improve the performance of the network.

2.2.5) constructing a pose estimation network, wherein the pose estimation network is responsible for integrating all the characteristics and finally finishing the estimation task of the pose vectors between frames.

The pose estimation network is a regressor for finally finishing the inter-frame pose vector estimation, and the form of combining a convolution network and a fully-connected network is still adopted on the whole. And inputting all the features extracted by the feature extraction network into a full-connection network, wherein the full-connection layer is responsible for final feature integration and regresses and fits the nonlinear relation of the pose estimation problem in the geometric mapping. The fully-connected layer comprises three layers, and the last layer finally fits the feature vectors into six-dimensional pose vectors to realize estimation of the pose vector y.

Step 3): a network loss function;

because the problem is a general regression problem, the Euclidean distance loss function is selected as an error calculation mode, the Euclidean distance is firstly calculated by using the predicted values and the real labels of all samples in the current batch of training set, and then the mean value, namely the Mean Square Error (MSE), is calculated for the squares of all the distances.

Assuming that the predicted value and the true value of the model function are in accordance with normal distribution, if the model and the measured value are closest, the probability product of all samples in the formula is required to be maximum, and then correlation derivation is carried out, so that the final result is obtained: the sum of squares minimum formula. Therefore, the Euclidean distance loss function is shown as follows:

wherein W represents a weight parameter of the network model,

being an output model of the network, Y_iAnd N is the size of the training set.

Step 4): training a network;

and carrying out training by using a greedy layer-by-layer pre-training method of a deep network.

4.1) randomly initializing network parameters using a gaussian distribution,

and 4.2) pre-training layer by greedy to optimize network parameters layer by layer.

A greedy, layer-by-layer pre-training approach finds the optimal local minima for the filter coefficients of each layer, and then performs global training to fine tune them. For each branch, the CNN1 filter and the fully-connected layer are trained using its connected fully-connected layer, then the fully-connected layer is dropped, the output of CNN1 is fed to CNN2 and only this new fully-connected layer is trained. The fully-connected layer is again discarded and the two outputs of CNN are connected and the third estimator is trained. This process is repeated for each branch, then the last fully connected layer is discarded and the four quadrant outputs are connected to the last fully connected net, which trains the final estimator and fine-tunes the CNNs coefficients.

4.3) aiming at the batch normalization layer, setting the use-global-stats parameter as false in the training process, and setting the use-global-stats parameter as true in the testing process. The network employs a euclidean distance loss function.

Because image sequences of various camera internal parameters exist in the data set, in order to improve the training speed and the training quality of the network and avoid the situation that the pose regressor is biased to a certain distribution, all samples are scattered randomly during training, the adjustment and optimization of the network parameters are carried out in a mini-batch mode commonly used during deep learning model optimization, 1024 groups of samples are selected for each batch, and the whole training set can be traversed by about 50 iterations on average. The network is optimized by adopting the adam algorithm, the learning rate is set to be 0.0002, the momentum parameter is set to be 0.9, and the value of the learning rate has great influence on the training of the network, which is a better parameter selected through multiple experiments.

Step 5), testing a data set; to evaluate the proposed method, experiments were performed using the published data set. To further explore the robustness of the architecture, the tested sequence was modified manually to increase blur and change contrast and brightness, and to simulate many environments of complex scenes, such as low brightness and motion blur.

The most common dataset in the field of visual SLAM and visual odometry is the KITTI dataset, which is a common testing platform for many visual algorithms. The automatic automobile driving platform collects data on urban streets through a left lens and a right lens and provides accurate track data for monocular vision, binocular vision and even laser related research work, wherein for monocular vision research, only an image sequence of a single lens is adopted. The image has been undistorted, with a resolution of 1240 × 386, and a resolution slightly higher for a frame rate of 10Hz (in some cases) sequence: in these cases we perform simple cropping to unify all frames.

The data set provides 22 image frame sequences in total, wherein the first 11 sequences are downloaded to a user in an open mode, and the last 11 sequences can only submit results on line and serve as bases for algorithm performance competition. Therefore, only 11 columns of 00-10 can be used in designing the pose estimation model. The first 8 were used as training sets and finally the three sequences 08,09 and 10 were evaluated as test sets.

The present invention is compared to ORB-SALM. The comparison shows that the method using deep learning is obviously superior to the ORB-SALM algorithm in time and has higher precision in the predicted track than the ORB-SALM algorithm. And the stability of the algorithm is greatly improved from the aspect of framework design, the average displacement error of the track is maintained at about 10 percent, and the drastic difference of the performance is not caused along with the difference of the sequences. The network structure is realized by adopting a TensorFlow framework, and NVIDIA GTX1080Ti GPU is used for training, the accumulated time of all algorithm process links is less than 40ms, and approximately more than ten frames of images per second in a KITTI data set can meet the real-time calculation requirement.

Claims

1. A visual SLAM front-end attitude estimation method based on deep learning is characterized by comprising the following steps:

step 1): carrying out data preprocessing on the training data set;

and selecting the step length as N for expansion: the number of the original data sets is S, and a position posture transformation matrix between the time i and the time j is set as T_ijThen, thenPose moment T between time T and (T + N)_t，t+N＝T_t，t+1·T_t+1，t+2·T_t+2，t+3....T_t+N-1，t+NSelecting the expansion step length to be N by utilizing the conversion relation, expanding the data set to be NS, and providing training sample number for the KITTI data set by S;

1.3) data conversion;

step 2): constructing an offline deep neural network model;

2.2) building an offline learning deep neural network model;

2.2.2) building a global feature extraction network;

2.2.3) building a local feature extraction network;

step 3): selecting a network loss function;

assuming that the predicted value and the true value of the model function are in accordance with normal distribution, if the model is closest to the measured value, the probability product of all samples in the formula is required to be maximum, and then relevant derivation is carried out to obtain the final result: a sum of squares minimum formula; therefore, the Euclidean distance loss function

Wherein W represents a weight parameter of the network model,

step 4): training a network;

4.1) randomly initializing network parameters by using Gaussian distribution;

2. The visual SLAM front end pose estimation method based on deep learning of claim 1, wherein the method comprises the following steps:

3. The visual SLAM front end pose estimation method based on deep learning of claim 1, wherein the method comprises the following steps:

the step size in step 1.2) is selected to be N-2.

4. The visual SLAM front end pose estimation method based on deep learning of claim 1, wherein the method comprises the following steps:

m in said step 2.1) is 8.

5. The visual SLAM front end pose estimation method based on deep learning of claim 1, wherein the method comprises the following steps:

t1 in said step 2.2.2) is 8.

6. The visual SLAM front end pose estimation method based on deep learning of claim 1, wherein the method comprises the following steps:

t2 in said step 2.2.3) is 4.