CN111127557A - Visual SLAM front-end attitude estimation method based on deep learning - Google Patents

Visual SLAM front-end attitude estimation method based on deep learning Download PDF

Info

Publication number
CN111127557A
CN111127557A CN201911278664.3A CN201911278664A CN111127557A CN 111127557 A CN111127557 A CN 111127557A CN 201911278664 A CN201911278664 A CN 201911278664A CN 111127557 A CN111127557 A CN 111127557A
Authority
CN
China
Prior art keywords
network
training
layer
pose
feature extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911278664.3A
Other languages
Chinese (zh)
Other versions
CN111127557B (en
Inventor
高嘉瑜
李斌
李阳
景鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 20 Research Institute
Original Assignee
CETC 20 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 20 Research Institute filed Critical CETC 20 Research Institute
Priority to CN201911278664.3A priority Critical patent/CN111127557B/en
Publication of CN111127557A publication Critical patent/CN111127557A/en
Application granted granted Critical
Publication of CN111127557B publication Critical patent/CN111127557B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Abstract

The invention provides a visual SLAM front-end pose estimation method based on deep learning, which is used for estimating pose transformation between frames in real time. Firstly, carrying out data preprocessing on an original data set, and then constructing a Brox network to carry out dense optical flow extraction on input continuous frame images; dividing the extracted optical flow diagram into two networks for feature extraction, extracting high-dimensional features by adopting global information for one branch, dividing the optical flow diagram into 4 sub-images in the other branch, and respectively sampling to obtain image features; and finally, fusing the features obtained by the training of the two branches, and performing pose estimation on the final cascade full-connection network to acquire the pose between two adjacent frames. The invention solves the problem of true scale estimation in monocular vision, can extract camera motion and proportion information by using global information and local information, and improves the learning ability and the intelligent level of the robot.

Description

Visual SLAM front-end attitude estimation method based on deep learning
Technical Field
The invention relates to the field of visual navigation, in particular to a visual SLAM front-end posture estimation method. After continuous image frames are input in an end-to-end mode, pose transformation among frames is estimated in real time, and a visual SLAM method with high robustness based on deep learning can be provided for the unmanned aerial vehicle.
Background
Simultaneous localization and mapping (SLAM) is a technology in which an intelligent object such as an unmanned aerial vehicle, etc., carries its sensor to realize the establishment of a surrounding environment map in the course of movement and performs its own localization according to the established environment map. When the unmanned aerial vehicle enters some special environments for operation, the unmanned aerial vehicle is easy to be interfered by the environment, so that a GPS signal is weakened or completely disabled. A complete SLAM framework consists of 4 aspects of front-end tracking, back-end optimization, loop detection and map reconstruction. The front-end tracking, namely a visual odometer, is responsible for preliminarily estimating the positions of the pose between the camera frames and the map points; the rear-end optimization is responsible for receiving pose information measured by the front end of the visual odometer and calculating maximum posterior probability estimation; the loop detection is responsible for judging whether the robot returns to the original position or not, and correcting the estimation error by loop closing; and the map reconstruction is responsible for constructing a map which is adaptive to the task requirement according to the camera pose and the image.
However, since 2017, the traditional visual SLAM scheme has no substantial progress, and the robustness of the algorithm is not very high under adverse conditions such as severe illumination conditions or large illumination change;
with the development of deep learning in the field of computer vision, more and more vision problems are broken through in a deep learning mode. The combination of deep learning and SLAM improves the application limitation caused by manual design characteristics such as visual odometry, scene recognition and the like, and improves the learning ability and the intelligent level of the robot. The feature point extraction in the traditional SLAM algorithm is easily influenced by scene factors, particularly illumination intensity and scene content, and the features extracted by the deep network have better generalization performance.
The vision pose estimation is a basic composition module of the vision SLAM system, and the function of a front-end vision mileometer of the system is realized. The current visual odometer is mainly realized by a learning method and a geometric method. For the geometric method, the method is mainly realized by extracting features (such as ORB features, S1FT features and the like) in two continuous pictures, and matching and calculating in the two pictures.
However, both the two methods have certain defects that the universality is poor for the learning method, and particularly, when the change of a test data scene and a training scene is large or the movement speed is changed, the performance of the algorithm is greatly influenced.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a visual SLAM front-end posture estimation method based on deep learning. The method solves the technical problems of poor universality of the existing vision pose estimation realized by adopting a learning method and the technical problems of poor real-time performance, difficult feature detection and poor robustness of the vision odometer realized by adopting a geometric method.
The technical scheme adopted by the invention for solving the technical problem comprises the following specific steps:
step 1): carrying out data preprocessing on the training data set;
1.1) firstly, cutting images in a KITTI database until the images have the same size;
1.2) then utilizing a conversion matrix between adjacent frames to expand the data set;
and selecting the step length as N for expansion: the number of the original data sets is S, and a position posture transformation matrix between the time i and the time j is set as TijThe pose moment T between time T and (T + N)t,t+N=Tt,t+1·Tt+1,t+2·Tt+2,t+3....Tt+N-1,t+NSelecting the expansion step length to be N by utilizing the conversion relation, expanding the data set to be NS, and providing training sample number for the KITTI data set by S;
1.3) data conversion;
converting the track data provided by KITTI from a pose matrix form into a relative pose transformation vector between adjacent frames by using a Robotics Toolbox of Peter Corke, namely converting a rotation matrix into an Euler angle and converting a displacement part into a displacement vector;
step 2): constructing an offline deep neural network model;
setting the interframe pose estimation 6 freedom parameter as f, mapping an input parameter variable x of a target f to obtain, and then taking x as an auxiliary parameter; w is a training sequence coefficient obtained by a training data set, b is a residual error value of a true value and a calculated value and is used for correction;
2.1) division of training set and validation set: as only 00-10 sequences of a data set provided by KITTI can be used for off-line training, the first M sequences in the 00-10 sequence pairs of the data set provided by KITTI are used as a training set, the last 11-M sequences are used as a test set, the training set is used for network training, and the test set is used for verifying the accuracy of network learning;
2.2) building an offline learning deep neural network model;
2.2.1) building an optical flow extraction network, and finishing the extraction of an initial optical flow field by using adjacent image frames: adopting a Brox algorithm network as an optical flow extractor, calculating optical flow between two frames of images at time t and t +1, and quantizing the calculated optical flow field by using RGB (red, green and blue) coding, so that input data is in a three-channel eight-bit depth image format;
2.2.2) building a global feature extraction network;
carrying out T1 downsampling on the whole image, then carrying out deep network training, selecting a convolutional neural network for feature extraction, carrying out training by using the global information of the optical flow diagram, and acquiring the global features of the optical flow diagram;
2.2.3) building a local feature extraction network;
dividing the depth image into four sub-images, downsampling each quadrant for T2 times, then training through a CNN filter, performing two-stage training on each sub-image, performing CNN1 and CNN2, and finally cascading a full connection layer;
the first part of the local feature extraction network consists of four branches, and each subimage is trained respectively; each of the four quadrants of the image contains motion information for calculating a motion estimate; then, correlating the output of the first CNN-pooling layer pair with the second CNN-pooling layer; CNN1 and CNN2 extract different information from the optical flow images; CNN1 extracts finer details, while CNN2 extracts coarser details, and these information do not overlap completely;
combining four complex features together to form a feature containing global image information, so that the network can resolve motion blur with symmetric information, the last layer connecting a fully connected network using information of all four quadrants at two resolutions;
2.2.4) merging the global feature extraction network and the local feature extraction network to build an optical flow graph feature extraction network;
the CNN filters of the global feature extraction network and the local feature extraction network are used for feeding the output of the CNN filters to a next layer of full-connection layer network, and the global information of the global feature extraction network and the local information of the local feature extraction network are combined to improve the performance of the network;
2.2.5) constructing a pose estimation network, wherein the pose estimation network is responsible for integrating all the characteristics and finally finishing an estimation task of an interframe pose vector;
the pose estimation network is a regressor for finally finishing interframe pose vector estimation, all the features extracted by the feature extraction network are input into the fully-connected network in a form of combining a convolutional network and the fully-connected network, and the fully-connected layer is responsible for final feature integration and regressively fitting the nonlinear relation of the pose estimation problem in the geometric mapping; the fully-connected layer comprises three layers, and the last layer finally fits the feature vectors into six-dimensional pose vectors to realize estimation of the pose vector y;
step 3): selecting a network loss function;
selecting a Euclidean distance loss function as an error calculation mode, wherein the error firstly calculates Euclidean distances by using predicted values and real labels of all samples in a current batch of training set, and then calculates an average value, namely Mean Square Error (MSE), of squares of all the distances;
assuming that the predicted value and the true value of the model function are in accordance with normal distribution, if the model is closest to the measured value, the probability product of all samples in the formula is required to be maximum, and then relevant derivation is carried out to obtain the final result: the sum of squares minimum formula. Therefore, the Euclidean distance loss function
Figure BDA0002316063130000041
Wherein W represents a weight parameter of the network model,
Figure BDA0002316063130000042
being an output model of the network, YiThe pose vector is a standard real pose vector, and N is the size of a training set;
step 4): training a network;
4.1) randomly initializing network parameters by using Gaussian distribution;
4.2) pre-training layer by greedy to optimize network parameters layer by layer;
the greedy layer-by-layer pre-training method finds the optimal local minimum value for the filter coefficient of each layer, and then carries out global training to carry out fine tuning on the weighted value; for each branch, train the CNN1 filter and fully-connected layers with its connected fully-connected layers, then discard the fully-connected layers, feed the output of CNN1 to CNN2 and train only the new fully-connected layers; discarding the full connection layer again, connecting the two outputs of the CNN and training the third estimator; repeating this process for each branch, then discarding the last fully connected layer and connecting the four quadrant outputs to the last fully connected network, which trains the final estimator and fine-tunes the CNNs coefficients;
4.3) aiming at the batch normalization layer, setting a use-global-stats parameter as false in the training process, and setting the use-global-stats parameter as true in the testing process, wherein the network adopts an Euclidean distance loss function;
randomly scattering all samples during training, and adjusting and optimizing network parameters in a mini-batch mode of a deep learning model; and optimizing the network by adopting an adam algorithm.
The size of the target image in the step 1.1) is the size with the largest number of image sizes in the image sequence.
The step size in step 1.2) is selected to be N-2.
M in said step 2.1) is 8.
T1 in said step 2.2.2) is 8.
T2 in said step 2.2.3) is 4.
The invention has the beneficial effects that:
A) different from the traditional geometric-optimization algorithm, the method combines the characteristics of the deep learning algorithm, learns the fitting pose estimation function through the training process on the premise of not needing any camera external parameters, and simultaneously solves the problem of real scale estimation in monocular vision.
B) The global and local information can be used to extract camera motion and scale information while processing noise in the input, and the new features extracted using CNN are robust in images with different contrast and blur parameters.
C) The combination of deep learning and SLAM improves the application limitation caused by manual design characteristics such as visual odometry, scene recognition and the like, and improves the learning ability and the intelligent level of the robot. The feature point extraction in the traditional SLAM algorithm is easily influenced by scene factors, particularly illumination intensity and scene content, and the features extracted by the deep network have better generalization performance.
Drawings
FIG. 1 is a schematic diagram of the basic flow of the present invention.
FIG. 2 is a method for constructing a local feature extraction network according to the present invention.
FIG. 3 is an offline deep neural network model constructed by the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
And estimating the pose transformation between frames in real time after continuous image frames are input in an end-to-end mode. Firstly, carrying out data preprocessing on an original data set, including data conversion of an expansion data set and the data set, and then constructing a Brox network to carry out dense optical flow extraction on input continuous frame images; dividing the extracted optical flow diagram into two networks for feature extraction, extracting high-dimensional features by adopting global information for one branch, dividing the optical flow diagram into 4 sub-images in the other branch, and respectively sampling to obtain image features; and finally fusing the features obtained by the training of the two branches. And finally, carrying out pose estimation on the cascaded fully-connected network to acquire the pose between two adjacent frames. And training the network to obtain proper parameters. And testing the precision and time on the test set data by using the trained network. On the premise of not adopting any definite geometric operation and external parameters, the network structure is utilized to automatically learn the functional relation existing between the input data and the pose estimation. The CNN architecture is superior to the prior art inter-frame estimation methods and guarantees the robustness of the algorithm in case of image abnormalities (e.g., blurring, large contrast and brightness variations). The deep learning method is adopted to solve the problems of poor real-time performance and poor robustness of the SLAM visual odometer and the problem of rapid positioning of intelligent carriers such as unmanned aerial vehicles in unknown environment.
Step 1): carrying out data preprocessing on the training data set;
1.1) firstly, cutting images in a KITTI database until the images have the same size;
1.2) then utilizing a conversion matrix between adjacent frames to expand the data set;
because the extended image needs to have a certain overlapping area, the step length is selected to be N for extension: the number of the original data sets is S, and a position posture transformation matrix between the time i and the time j is set as TijThen the pose matrix T between time T and (T + N)t,t+N=Tt,t+1·Tt+1,t+2·Tt+2,t+3....Tt+N-1,t+NSelecting the expansion step length to be N by utilizing the conversion relation, expanding the data set to be NS, and providing training sample number for the KITTI data set by S;
1.3) data conversion;
transforming the track data provided by KITTI from a pose matrix form (a row main sequence is stretched into a row) into a relative pose transformation vector between adjacent frames by means of a Robotics Toolbox of Peter Corke, namely transforming a rotation matrix into an Euler angle and transforming a displacement part into a displacement vector;
step 2): and (3) constructing an offline deep neural network model, as shown in FIG. 3.
According to the traditional function mapping f-wx + b analysis, the target value f is obtained by adding a bias value to the product of the coefficient w and the variable x; setting the interframe pose estimation 6 freedom parameter as f, mapping an input parameter variable x of a target f to obtain the target f, wherein x is an auxiliary parameter, and x and f are mapped one by one theoretically due to inconsistency of the auxiliary parameter of a discrete space; w is considered as the training sequence coefficient obtained from the training data set, and b is the residual error between the true value and the calculated value for correction;
2.1) division of training set and validation set: as only 00-10 sequences of a data set provided by KITTI can be used for off-line training, the first M sequences in the 00-10 sequence pairs of the data set provided by KITTI are used as a training set, the last 11-M sequences are used as a test set, the training set is used for network training, and the test set is used for verifying the accuracy of network learning;
2.2) building an offline learning deep neural network model;
2.2.1) building an optical flow extraction network, and extracting an initial optical flow field by using adjacent image frames; adopting a Brox algorithm network as an optical flow extractor, calculating optical flow between two frames of images at time t and t +1, and quantizing the calculated optical flow field by using RGB (red, green and blue) coding, so that input data is in a three-channel eight-bit depth image format;
2.2.2) building a global feature extraction network;
carrying out T1 downsampling on the whole image, then carrying out deep network training, selecting a convolutional neural network for feature extraction, carrying out training by using the global information of the optical flow diagram, and acquiring the global features of the optical flow diagram;
2.2.3) building a local feature extraction network;
dividing the depth image into four sub-images, downsampling each quadrant for T2 times, then training through a CNN filter, performing two-stage training on each sub-image, performing CNN1 and CNN2, and finally cascading a full connection layer;
the first part of the local feature extraction network consists of four branches, and each subimage is trained respectively; each of the four quadrants of the image contains some motion information for calculating the motion estimate; then, correlating the output of the first CNN-pooling layer pair with the second CNN-pooling layer; CNN1 and CNN2 extract different information from the optical flow images; assume that CNN1 extracts finer details, while CNN2 extracts coarser details, and that the information does not overlap completely;
after this stage, four complex features are combined together to form an image information feature that contains the global, so the network can resolve motion blur with symmetric information. The last layer connects to a fully connected network that uses the information of all four quadrants at two resolutions.
2.2.4) merging the global feature extraction network and the local feature extraction network to build an optical flow graph feature extraction network;
the CNN filters of the global feature extraction network and the local feature extraction network are used for feeding the output of the CNN filters to a next layer of full-connection layer network, and the global information of the global feature extraction network and the local information of the local feature extraction network are combined to improve the performance of the network;
2.2.5) constructing a pose estimation network, wherein the pose estimation network is responsible for integrating all the characteristics and finally finishing an estimation task of an interframe pose vector;
the pose estimation network is a regressor for finally finishing interframe pose vector estimation, the form of combining a convolution network and a fully-connected network is still adopted on the whole, all the features extracted by the feature extraction network are input into the fully-connected network, and the fully-connected layer is responsible for final feature integration and regressively fits the nonlinear relation of the pose estimation problem in the geometric mapping; the fully-connected layer comprises three layers, and the last layer finally fits the feature vectors into six-dimensional pose vectors to realize estimation of the pose vector y;
step 3): selecting a network loss function;
because the problem is a general regression problem, an Euclidean distance loss function is selected as an error calculation mode, the Euclidean distance is firstly calculated by using the predicted values and the real labels of all samples in the current batch of training set, and then the average value, namely the Mean Square Error (MSE), of the squares of all the distances is calculated;
assuming that the predicted value and the true value of the model function are in accordance with normal distribution, if the model is closest to the measured value, the probability product of all samples in the formula is required to be maximum, and then relevant derivation is carried out to obtain the final result: the sum of squares minimum formula. Therefore, the Euclidean distance loss function
Figure BDA0002316063130000071
Wherein W represents a weight parameter of the network model,
Figure BDA0002316063130000072
being an output model of the network, YiThe pose vector is a standard real pose vector, and N is the size of a training set;
step 4): training a network;
4.1) randomly initializing network parameters by using Gaussian distribution;
4.2) pre-training layer by greedy to optimize network parameters layer by layer;
the greedy layer-by-layer pre-training method finds the optimal local minimum value for the filter coefficient of each layer, and then carries out global training to carry out fine tuning on the weighted value; for each branch, train the CNN1 filter and fully-connected layers with its connected fully-connected layer, then discard the fully-connected layer, feed the output of CNN1 to CNN2 and train only this new fully-connected layer; discarding the full connection layer again, connecting the two outputs of the CNN and training the third estimator; this process is repeated for each branch, then the last fully connected layer is discarded and the four quadrant outputs are connected to the last fully connected network, which trains the final estimator and fine-tunes the CNNs coefficients;
4.3) aiming at the batch normalization layer, setting a use-global-stats parameter as false in the training process, and setting the use-global-stats parameter as true in the testing process, wherein the network adopts an Euclidean distance loss function;
because the data set has image sequences of various camera internal parameters, in order to improve the training speed and the training quality of the network and avoid the situation that the pose regressor is biased to a certain distribution, all samples are scattered randomly during training, and the adjustment and optimization of the network parameters are carried out by adopting a mini-batch mode commonly used during deep learning model optimization. The network is optimized by adopting the adam algorithm, and the value of the learning rate has great influence on the training of the network, which is a good parameter selected through multiple experiments.
The size of the target image in the step 1.1) is the size with the largest number of image sizes in the image sequence, and the images are adjusted to be uniform in size, so that subsequent image processing is facilitated and training is simplified.
The step size in step 1.2) is selected to be N-2. The image sequence has certain continuity, and pose vectors between image frames with different intervals can be obtained between continuous adjacent frames in a pose matrix transformation mode, so that a data set is expanded. Because a certain overlap area needs to exist between adjacent frame images of the extended data set, the step size is selected too large, and no overlap area exists between adjacent frames.
M in said step 2.1) is 8. Because a large training set is required for training, but a sufficient number of test sets are also required for testing. According to the method, the M is selected to be 8, so that the number of training sets and the verification of the test are guaranteed.
The T1 in the step 2.2.2) is 8, and 8 times of downsampling are carried out on the image, so that the image characteristics can be guaranteed, and overlarge calculation amount is avoided.
And 4 in the step 2.2.3), T2 is selected to be 4, and the image is downsampled for 4 times, so that the image characteristics can be ensured, and an overlarge calculation amount is avoided.
The embodiment example is shown in figure 1: a visual SLAM front-end attitude estimation method based on deep learning comprises the following specific implementation steps:
step 1): carrying out data preprocessing on the training data set;
1.1) first the first 4 sequences of the KITTI database. The images of the 00-03 sequence are cropped to be as large as the size of the images of the next 7 sequences, namely 1226 × 370;
1.2) preparation ofAnd then, expanding the data set by using a conversion matrix between adjacent frames: the number of the original data sets is S, and a position posture transformation matrix between the time i and the time j is set as TijThe position matrix T between time T and (T +2)t,t+2=Tt,t+1·Tt+1,t+2. Using this transformation, an expansion step size of 2 is selected to expand the data set to 2S. S is the number of training samples provided for the KITTI dataset. The original data Tt,t+1,Tt+1,t+2Adding a Tt,t+2
1.3) data conversion
Track data provided by KITTI is converted into relative pose transformation vectors between adjacent frames by means of a Robotics Toolbox of Peter Corke, namely a rotation matrix is converted into an Euler angle, and a displacement part is converted into a displacement vector.
Step 2): and (5) constructing an offline deep neural network model.
According to the conventional function mapping f-wx + b analysis, the target value f is obtained by adding the offset value to the product of the coefficient w and the variable x. By utilizing the thought, if the parameter of the degree of freedom of the interframe pose estimation 6 is f, the target f is obtained by mapping the input parameter variable x, and x can be understood as an auxiliary parameter, because of the inconsistency of the auxiliary parameter of the discrete space, x and f can be mapped one by one theoretically; w may be considered as training sequence coefficients obtained from a training data set, while b may be understood as the residual between the true and calculated values for correction.
2.1) division of training and validation sets. Only 00-10 sequences of data sets provided by KITTI can be used for off-line training. The first eight sequences (00-07) in the 00-10 sequence pairs are used as training sets, and the last three sequences (08-10) are used as test sets. And performing network training by using the training set, and verifying the accuracy of network learning by using the test set.
2.2) building an offline learning deep neural network model.
2.2.1) building an optical flow extraction network, and extracting an initial optical flow field by using adjacent image frames; a Brox algorithm network is adopted as an optical flow extractor, optical flow between two frames of images at time t and t +1 is calculated, and an RGB code is used for quantizing the calculated optical flow field, so that input data is in a three-channel and eight-bit depth image format.
2.2.2) building a global feature extraction network;
and carrying out eight times of downsampling on the whole image, then carrying out deep network training, and carrying out training by using global information. 2.2.3) constructing a local feature extraction network as shown in FIG. 2;
the depth image is divided into four sub-images. Each quadrant is downsampled 4 times and then trained through the CNN filter of the feature extraction network. The last layer is trained using the output layers of the four CNN networks to derive a global inter-frame estimate.
The first part of the local feature extraction network consists of four branches, of the same complexity, but trained separately, performing the first two convolution steps (CNN1 and CNN2) note that each of the four quadrants of the image contains some motion information that can be used to compute the motion estimate. The output of the first CNN-poling layer pair is then correlated with the second. CNN1 and CNN2 extract different information from the optical flow images. Assume that CNN1 extracts finer details, while CNN2 extracts coarser details, and that the information does not overlap completely.
After this stage, four complex features are combined together to form an image information feature that contains the global, so the network can resolve motion blur with symmetric information. The last layer connects to a fully connected network that uses the information of all four quadrants at two resolutions.
2.2.4) merging the global feature extraction network and the local feature extraction network to build the optical flow graph feature extraction network
The CNN filters of the global feature extraction network and the local feature extraction network are used to feed their outputs to the next layer of the fully-connected layer network. And the global information of the global feature extraction network is combined with the local information of the local feature extraction network to improve the performance of the network.
2.2.5) constructing a pose estimation network, wherein the pose estimation network is responsible for integrating all the characteristics and finally finishing the estimation task of the pose vectors between frames.
The pose estimation network is a regressor for finally finishing the inter-frame pose vector estimation, and the form of combining a convolution network and a fully-connected network is still adopted on the whole. And inputting all the features extracted by the feature extraction network into a full-connection network, wherein the full-connection layer is responsible for final feature integration and regresses and fits the nonlinear relation of the pose estimation problem in the geometric mapping. The fully-connected layer comprises three layers, and the last layer finally fits the feature vectors into six-dimensional pose vectors to realize estimation of the pose vector y.
Step 3): a network loss function;
because the problem is a general regression problem, the Euclidean distance loss function is selected as an error calculation mode, the Euclidean distance is firstly calculated by using the predicted values and the real labels of all samples in the current batch of training set, and then the mean value, namely the Mean Square Error (MSE), is calculated for the squares of all the distances.
Figure BDA0002316063130000101
Assuming that the predicted value and the true value of the model function are in accordance with normal distribution, if the model and the measured value are closest, the probability product of all samples in the formula is required to be maximum, and then correlation derivation is carried out, so that the final result is obtained: the sum of squares minimum formula. Therefore, the Euclidean distance loss function is shown as follows:
Figure BDA0002316063130000111
wherein W represents a weight parameter of the network model,
Figure BDA0002316063130000112
being an output model of the network, YiAnd N is the size of the training set.
Step 4): training a network;
and carrying out training by using a greedy layer-by-layer pre-training method of a deep network.
4.1) randomly initializing network parameters using a gaussian distribution,
and 4.2) pre-training layer by greedy to optimize network parameters layer by layer.
A greedy, layer-by-layer pre-training approach finds the optimal local minima for the filter coefficients of each layer, and then performs global training to fine tune them. For each branch, the CNN1 filter and the fully-connected layer are trained using its connected fully-connected layer, then the fully-connected layer is dropped, the output of CNN1 is fed to CNN2 and only this new fully-connected layer is trained. The fully-connected layer is again discarded and the two outputs of CNN are connected and the third estimator is trained. This process is repeated for each branch, then the last fully connected layer is discarded and the four quadrant outputs are connected to the last fully connected net, which trains the final estimator and fine-tunes the CNNs coefficients.
4.3) aiming at the batch normalization layer, setting the use-global-stats parameter as false in the training process, and setting the use-global-stats parameter as true in the testing process. The network employs a euclidean distance loss function.
Because image sequences of various camera internal parameters exist in the data set, in order to improve the training speed and the training quality of the network and avoid the situation that the pose regressor is biased to a certain distribution, all samples are scattered randomly during training, the adjustment and optimization of the network parameters are carried out in a mini-batch mode commonly used during deep learning model optimization, 1024 groups of samples are selected for each batch, and the whole training set can be traversed by about 50 iterations on average. The network is optimized by adopting the adam algorithm, the learning rate is set to be 0.0002, the momentum parameter is set to be 0.9, and the value of the learning rate has great influence on the training of the network, which is a better parameter selected through multiple experiments.
Step 5), testing a data set; to evaluate the proposed method, experiments were performed using the published data set. To further explore the robustness of the architecture, the tested sequence was modified manually to increase blur and change contrast and brightness, and to simulate many environments of complex scenes, such as low brightness and motion blur.
The most common dataset in the field of visual SLAM and visual odometry is the KITTI dataset, which is a common testing platform for many visual algorithms. The automatic automobile driving platform collects data on urban streets through a left lens and a right lens and provides accurate track data for monocular vision, binocular vision and even laser related research work, wherein for monocular vision research, only an image sequence of a single lens is adopted. The image has been undistorted, with a resolution of 1240 × 386, and a resolution slightly higher for a frame rate of 10Hz (in some cases) sequence: in these cases we perform simple cropping to unify all frames.
The data set provides 22 image frame sequences in total, wherein the first 11 sequences are downloaded to a user in an open mode, and the last 11 sequences can only submit results on line and serve as bases for algorithm performance competition. Therefore, only 11 columns of 00-10 can be used in designing the pose estimation model. The first 8 were used as training sets and finally the three sequences 08,09 and 10 were evaluated as test sets.
The present invention is compared to ORB-SALM. The comparison shows that the method using deep learning is obviously superior to the ORB-SALM algorithm in time and has higher precision in the predicted track than the ORB-SALM algorithm. And the stability of the algorithm is greatly improved from the aspect of framework design, the average displacement error of the track is maintained at about 10 percent, and the drastic difference of the performance is not caused along with the difference of the sequences. The network structure is realized by adopting a TensorFlow framework, and NVIDIA GTX1080Ti GPU is used for training, the accumulated time of all algorithm process links is less than 40ms, and approximately more than ten frames of images per second in a KITTI data set can meet the real-time calculation requirement.

Claims (6)

1. A visual SLAM front-end attitude estimation method based on deep learning is characterized by comprising the following steps:
step 1): carrying out data preprocessing on the training data set;
1.1) firstly, cutting images in a KITTI database until the images have the same size;
1.2) then utilizing a conversion matrix between adjacent frames to expand the data set;
and selecting the step length as N for expansion: the number of the original data sets is S, and a position posture transformation matrix between the time i and the time j is set as TijThen, thenPose moment T between time T and (T + N)t,t+N=Tt,t+1·Tt+1,t+2·Tt+2,t+3....Tt+N-1,t+NSelecting the expansion step length to be N by utilizing the conversion relation, expanding the data set to be NS, and providing training sample number for the KITTI data set by S;
1.3) data conversion;
converting the track data provided by KITTI from a pose matrix form into a relative pose transformation vector between adjacent frames by using a Robotics Toolbox of Peter Corke, namely converting a rotation matrix into an Euler angle and converting a displacement part into a displacement vector;
step 2): constructing an offline deep neural network model;
setting the interframe pose estimation 6 freedom parameter as f, mapping an input parameter variable x of a target f to obtain, and then taking x as an auxiliary parameter; w is a training sequence coefficient obtained by a training data set, b is a residual error value of a true value and a calculated value and is used for correction;
2.1) division of training set and validation set: as only 00-10 sequences of a data set provided by KITTI can be used for off-line training, the first M sequences in the 00-10 sequence pairs of the data set provided by KITTI are used as a training set, the last 11-M sequences are used as a test set, the training set is used for network training, and the test set is used for verifying the accuracy of network learning;
2.2) building an offline learning deep neural network model;
2.2.1) building an optical flow extraction network, and finishing the extraction of an initial optical flow field by using adjacent image frames: adopting a Brox algorithm network as an optical flow extractor, calculating optical flow between two frames of images at time t and t +1, and quantizing the calculated optical flow field by using RGB (red, green and blue) coding, so that input data is in a three-channel eight-bit depth image format;
2.2.2) building a global feature extraction network;
carrying out T1 downsampling on the whole image, then carrying out deep network training, selecting a convolutional neural network for feature extraction, carrying out training by using the global information of the optical flow diagram, and acquiring the global features of the optical flow diagram;
2.2.3) building a local feature extraction network;
dividing the depth image into four sub-images, downsampling each quadrant for T2 times, then training through a CNN filter, performing two-stage training on each sub-image, performing CNN1 and CNN2, and finally cascading a full connection layer;
the first part of the local feature extraction network consists of four branches, and each subimage is trained respectively; each of the four quadrants of the image contains motion information for calculating a motion estimate; then, correlating the output of the first CNN-pooling layer pair with the second CNN-pooling layer; CNN1 and CNN2 extract different information from the optical flow images; CNN1 extracts finer details, while CNN2 extracts coarser details, and these information do not overlap completely;
combining four complex features together to form a feature containing global image information, so that the network can resolve motion blur with symmetric information, the last layer connecting a fully connected network using information of all four quadrants at two resolutions;
2.2.4) merging the global feature extraction network and the local feature extraction network to build an optical flow graph feature extraction network;
the CNN filters of the global feature extraction network and the local feature extraction network are used for feeding the output of the CNN filters to a next layer of full-connection layer network, and the global information of the global feature extraction network and the local information of the local feature extraction network are combined to improve the performance of the network;
2.2.5) constructing a pose estimation network, wherein the pose estimation network is responsible for integrating all the characteristics and finally finishing an estimation task of an interframe pose vector;
the pose estimation network is a regressor for finally finishing interframe pose vector estimation, all the features extracted by the feature extraction network are input into the fully-connected network in a form of combining a convolutional network and the fully-connected network, and the fully-connected layer is responsible for final feature integration and regressively fitting the nonlinear relation of the pose estimation problem in the geometric mapping; the fully-connected layer comprises three layers, and the last layer finally fits the feature vectors into six-dimensional pose vectors to realize estimation of the pose vector y;
step 3): selecting a network loss function;
selecting a Euclidean distance loss function as an error calculation mode, wherein the error firstly calculates Euclidean distances by using predicted values and real labels of all samples in a current batch of training set, and then calculates an average value, namely Mean Square Error (MSE), of squares of all the distances;
assuming that the predicted value and the true value of the model function are in accordance with normal distribution, if the model is closest to the measured value, the probability product of all samples in the formula is required to be maximum, and then relevant derivation is carried out to obtain the final result: a sum of squares minimum formula; therefore, the Euclidean distance loss function
Figure FDA0002316063120000021
Wherein W represents a weight parameter of the network model,
Figure FDA0002316063120000022
being an output model of the network, YiThe pose vector is a standard real pose vector, and N is the size of a training set;
step 4): training a network;
4.1) randomly initializing network parameters by using Gaussian distribution;
4.2) pre-training layer by greedy to optimize network parameters layer by layer;
the greedy layer-by-layer pre-training method finds the optimal local minimum value for the filter coefficient of each layer, and then carries out global training to carry out fine tuning on the weighted value; for each branch, train the CNN1 filter and fully-connected layers with its connected fully-connected layers, then discard the fully-connected layers, feed the output of CNN1 to CNN2 and train only the new fully-connected layers; discarding the full connection layer again, connecting the two outputs of the CNN and training the third estimator; repeating this process for each branch, then discarding the last fully connected layer and connecting the four quadrant outputs to the last fully connected network, which trains the final estimator and fine-tunes the CNNs coefficients;
4.3) aiming at the batch normalization layer, setting a use-global-stats parameter as false in the training process, and setting the use-global-stats parameter as true in the testing process, wherein the network adopts an Euclidean distance loss function;
randomly scattering all samples during training, and adjusting and optimizing network parameters in a mini-batch mode of a deep learning model; and optimizing the network by adopting an adam algorithm.
2. The visual SLAM front end pose estimation method based on deep learning of claim 1, wherein the method comprises the following steps:
the size of the target image in the step 1.1) is the size with the largest number of image sizes in the image sequence.
3. The visual SLAM front end pose estimation method based on deep learning of claim 1, wherein the method comprises the following steps:
the step size in step 1.2) is selected to be N-2.
4. The visual SLAM front end pose estimation method based on deep learning of claim 1, wherein the method comprises the following steps:
m in said step 2.1) is 8.
5. The visual SLAM front end pose estimation method based on deep learning of claim 1, wherein the method comprises the following steps:
t1 in said step 2.2.2) is 8.
6. The visual SLAM front end pose estimation method based on deep learning of claim 1, wherein the method comprises the following steps:
t2 in said step 2.2.3) is 4.
CN201911278664.3A 2019-12-13 2019-12-13 Visual SLAM front-end attitude estimation method based on deep learning Active CN111127557B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911278664.3A CN111127557B (en) 2019-12-13 2019-12-13 Visual SLAM front-end attitude estimation method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911278664.3A CN111127557B (en) 2019-12-13 2019-12-13 Visual SLAM front-end attitude estimation method based on deep learning

Publications (2)

Publication Number Publication Date
CN111127557A true CN111127557A (en) 2020-05-08
CN111127557B CN111127557B (en) 2022-12-13

Family

ID=70498943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911278664.3A Active CN111127557B (en) 2019-12-13 2019-12-13 Visual SLAM front-end attitude estimation method based on deep learning

Country Status (1)

Country Link
CN (1) CN111127557B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598951A (en) * 2020-05-18 2020-08-28 清华大学 Method, device and storage medium for identifying space target
CN111667535A (en) * 2020-06-04 2020-09-15 电子科技大学 Six-degree-of-freedom pose estimation method for occlusion scene
CN111833400A (en) * 2020-06-10 2020-10-27 广东工业大学 Camera position and posture positioning method
CN111931873A (en) * 2020-09-28 2020-11-13 支付宝(杭州)信息技术有限公司 Image recognition method and device
CN111967542A (en) * 2020-10-23 2020-11-20 江西小马机器人有限公司 Meter identification secondary positioning method based on depth feature points
CN112344922A (en) * 2020-10-26 2021-02-09 中国科学院自动化研究所 Monocular vision odometer positioning method and system
CN112446328A (en) * 2020-11-27 2021-03-05 汇纳科技股份有限公司 Monocular depth estimation system, method, device and computer-readable storage medium
CN112733921A (en) * 2020-12-31 2021-04-30 深圳辰视智能科技有限公司 Neural network loss function calculation method and system for predicting rigid body 6D posture
CN113436251A (en) * 2021-06-24 2021-09-24 东北大学 Pose estimation system and method based on improved YOLO6D algorithm
CN113724325A (en) * 2021-05-31 2021-11-30 西安理工大学 Multi-scene monocular camera pose regression method based on graph convolution network
CN113989318A (en) * 2021-10-20 2022-01-28 电子科技大学 Monocular vision odometer pose optimization and error correction method based on deep learning
CN114415698A (en) * 2022-03-31 2022-04-29 深圳市普渡科技有限公司 Robot, positioning method and device of robot and computer equipment
CN115187638A (en) * 2022-09-07 2022-10-14 南京逸智网络空间技术创新研究院有限公司 Unsupervised monocular depth estimation method based on optical flow mask
CN115358962A (en) * 2022-10-18 2022-11-18 中国第一汽车股份有限公司 End-to-end visual odometer method and device
CN117495970A (en) * 2024-01-03 2024-02-02 中国科学技术大学 Template multistage matching-based chemical instrument pose estimation method, equipment and medium
CN117495970B (en) * 2024-01-03 2024-05-14 中国科学技术大学 Template multistage matching-based chemical instrument pose estimation method, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106658023A (en) * 2016-12-21 2017-05-10 山东大学 End-to-end visual odometer and method based on deep learning
CN108921893A (en) * 2018-04-24 2018-11-30 华南理工大学 A kind of image cloud computing method and system based on online deep learning SLAM
CN110490928A (en) * 2019-07-05 2019-11-22 天津大学 A kind of camera Attitude estimation method based on deep neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106658023A (en) * 2016-12-21 2017-05-10 山东大学 End-to-end visual odometer and method based on deep learning
CN108921893A (en) * 2018-04-24 2018-11-30 华南理工大学 A kind of image cloud computing method and system based on online deep learning SLAM
CN110490928A (en) * 2019-07-05 2019-11-22 天津大学 A kind of camera Attitude estimation method based on deep neural network

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598951A (en) * 2020-05-18 2020-08-28 清华大学 Method, device and storage medium for identifying space target
CN111598951B (en) * 2020-05-18 2022-09-30 清华大学 Method, device and storage medium for identifying space target
CN111667535A (en) * 2020-06-04 2020-09-15 电子科技大学 Six-degree-of-freedom pose estimation method for occlusion scene
CN111833400A (en) * 2020-06-10 2020-10-27 广东工业大学 Camera position and posture positioning method
CN111833400B (en) * 2020-06-10 2023-07-28 广东工业大学 Camera pose positioning method
CN111931873A (en) * 2020-09-28 2020-11-13 支付宝(杭州)信息技术有限公司 Image recognition method and device
CN111931873B (en) * 2020-09-28 2020-12-22 支付宝(杭州)信息技术有限公司 Image recognition method and device
CN111967542A (en) * 2020-10-23 2020-11-20 江西小马机器人有限公司 Meter identification secondary positioning method based on depth feature points
CN111967542B (en) * 2020-10-23 2021-01-29 江西小马机器人有限公司 Meter identification secondary positioning method based on depth feature points
CN112344922A (en) * 2020-10-26 2021-02-09 中国科学院自动化研究所 Monocular vision odometer positioning method and system
CN112446328A (en) * 2020-11-27 2021-03-05 汇纳科技股份有限公司 Monocular depth estimation system, method, device and computer-readable storage medium
CN112446328B (en) * 2020-11-27 2023-11-17 汇纳科技股份有限公司 Monocular depth estimation system, method, apparatus, and computer-readable storage medium
CN112733921A (en) * 2020-12-31 2021-04-30 深圳辰视智能科技有限公司 Neural network loss function calculation method and system for predicting rigid body 6D posture
CN112733921B (en) * 2020-12-31 2024-05-17 深圳辰视智能科技有限公司 Neural network loss function calculation method and system for predicting rigid body 6D posture
CN113724325A (en) * 2021-05-31 2021-11-30 西安理工大学 Multi-scene monocular camera pose regression method based on graph convolution network
CN113436251A (en) * 2021-06-24 2021-09-24 东北大学 Pose estimation system and method based on improved YOLO6D algorithm
CN113436251B (en) * 2021-06-24 2024-01-09 东北大学 Pose estimation system and method based on improved YOLO6D algorithm
CN113989318A (en) * 2021-10-20 2022-01-28 电子科技大学 Monocular vision odometer pose optimization and error correction method based on deep learning
CN113989318B (en) * 2021-10-20 2023-04-07 电子科技大学 Monocular vision odometer pose optimization and error correction method based on deep learning
CN114415698B (en) * 2022-03-31 2022-11-29 深圳市普渡科技有限公司 Robot, positioning method and device of robot and computer equipment
CN114415698A (en) * 2022-03-31 2022-04-29 深圳市普渡科技有限公司 Robot, positioning method and device of robot and computer equipment
CN115187638A (en) * 2022-09-07 2022-10-14 南京逸智网络空间技术创新研究院有限公司 Unsupervised monocular depth estimation method based on optical flow mask
CN115358962B (en) * 2022-10-18 2023-01-10 中国第一汽车股份有限公司 End-to-end visual odometer method and device
CN115358962A (en) * 2022-10-18 2022-11-18 中国第一汽车股份有限公司 End-to-end visual odometer method and device
CN117495970A (en) * 2024-01-03 2024-02-02 中国科学技术大学 Template multistage matching-based chemical instrument pose estimation method, equipment and medium
CN117495970B (en) * 2024-01-03 2024-05-14 中国科学技术大学 Template multistage matching-based chemical instrument pose estimation method, equipment and medium

Also Published As

Publication number Publication date
CN111127557B (en) 2022-12-13

Similar Documents

Publication Publication Date Title
CN111127557B (en) Visual SLAM front-end attitude estimation method based on deep learning
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
US10719940B2 (en) Target tracking method and device oriented to airborne-based monitoring scenarios
CN109840471B (en) Feasible road segmentation method based on improved Unet network model
CN111862126B (en) Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm
CN110348445B (en) Instance segmentation method fusing void convolution and edge information
CN110781776B (en) Road extraction method based on prediction and residual refinement network
CN111709410B (en) Behavior identification method for strong dynamic video
Shi et al. Calibrcnn: Calibrating camera and lidar by recurrent convolutional neural network and geometric constraints
CN111986240A (en) Drowning person detection method and system based on visible light and thermal imaging data fusion
RU2476825C2 (en) Method of controlling moving object and apparatus for realising said method
CN114943963A (en) Remote sensing image cloud and cloud shadow segmentation method based on double-branch fusion network
CN109584299B (en) Positioning method, positioning device, terminal and storage medium
CN112419317B (en) Visual loop detection method based on self-coding network
CN110929649B (en) Network and difficult sample mining method for small target detection
CN116434088A (en) Lane line detection and lane auxiliary keeping method based on unmanned aerial vehicle aerial image
Françani et al. Dense prediction transformer for scale estimation in monocular visual odometry
CN113034398A (en) Method and system for eliminating jelly effect in urban surveying and mapping based on artificial intelligence
CN114419421A (en) Subway tunnel crack identification system and method based on images
CN113888629A (en) RGBD camera-based rapid object three-dimensional pose estimation method
CN116883457B (en) Light multi-target tracking method based on detection tracking joint network and mixed density network
CN115063717B (en) Video target detection and tracking method based on real scene modeling of key area
CN115984592A (en) Point-line fusion feature matching method based on SuperPoint + SuperGlue
CN114743105A (en) Depth privilege visual odometer method based on cross-modal knowledge distillation
CN111160115B (en) Video pedestrian re-identification method based on twin double-flow 3D convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant