CN112801201A

CN112801201A - Deep learning visual inertial navigation combined navigation design method based on standardization

Info

Publication number: CN112801201A
Application number: CN202110171232.3A
Authority: CN
Inventors: 胡斌杰; 丘金光
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-05-14
Anticipated expiration: 2041-02-08
Also published as: CN112801201B

Abstract

The invention discloses a deep learning visual inertial navigation combined navigation design method based on standardization, which comprises the following steps: designing and introducing standard operation on a training set label, calculating the mean value and variance of the label, converting the label into distribution with the mean value of 0 and the variance of 1, and storing the obtained mean value and variance; in the network design, in order to balance the contribution of image data and inertial navigation data, the inertial navigation characteristic and the image characteristic are changed to be in the same dimension m through a network; and in the verification stage or the test stage, the result output by the network is subjected to inverse standardization operation through the mean value and the variance obtained by the calculation to obtain a final result. According to the method, in the deep learning visual inertial navigation combined navigation design method based on standardization, the standardization operation is carried out on the training set labels, the selection of balance factors in the target function is reduced, the generalization performance of a neural network is improved, and the accuracy of the relative pose prediction is improved.

Description

Deep learning visual inertial navigation combined navigation design method based on standardization

Technical Field

The invention relates to the technical field of sensor fusion and motion estimation, in particular to a deep learning visual inertial navigation combined navigation design method based on standardization.

Background

With automatic driving and continuous development of unmanned aerial vehicles, high-precision and high-robustness positioning is an important premise for completing autonomous navigation and exploring tasks such as unknown areas, a pure visual odometer method is adopted, a system acquires surrounding environment information by using a visual sensor, and motion state of the system is estimated by analyzing image data. The vision inertial navigation odometer adds Inertial Measurement Unit (IMU) information on the basis of a pure vision odometer, and can improve the precision of motion state estimation under the condition of visual loss.

The conventional visual inertial navigation odometer technology has been fully researched, but some researches on data loss, data damage and the like are not well solved, and a large amount of manual feature selection and external reference calibration among sensors are required, which is undoubtedly time-consuming. In recent years, deep learning techniques have achieved enormous success in the field of computer vision, and are widely used in various fields. The visual inertial navigation combined navigation is taken as a regression task, and can also be trained by adopting a deep learning method, in the existing visual inertial navigation combined navigation task based on the deep learning, the target function balances the learning of translation and rotation through a balance factor, and a large amount of training time is needed for finding the balance factor, so that manpower and material resources are undoubtedly consumed, and aiming at the problem, a new target function needs to be designed to avoid manual setting of the balance factor.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a deep learning visual inertial navigation combined navigation design method based on standardization, so that the manual setting of balance factors of relative translation and relative rotation is avoided.

The purpose of the invention can be achieved by adopting the following technical scheme:

a deep learning visual inertial navigation combined navigation design method based on standardization comprises the following steps:

s1, establishing a deep learning network model, wherein the deep learning network model comprises a first main module, a second main module and a third main module, and the first main module is formed by stacking 10 layers of CNNs and is called as a main module A; the second master module comprises two layers of Bi-LSTM, referred to as master module B; the third main module is called as a main module C, the main module C comprises a first sub module, a second sub module and a third sub module, the first sub module is an Attention sub module, the second sub module is a two-layer Bi-LSTM sub module, and the third sub module is a full-connection layer sub module; inputting image data to a main module A to extract image characteristics; inputting inertial navigation data into the main module B to extract inertial navigation characteristics, and ensuring that the dimensionality of the inertial navigation characteristics is consistent with the dimensionality of image characteristics; the image characteristic and the inertial navigation characteristic are serially connected and input into an Attention submodule in a main module C, the output of the Attention submodule is multiplied by the input of the Attention submodule and then input into two layers of Bi-LSTM submodules of the main module C, and the output of the two layers of Bi-LSTM submodules is input into a fully-connected layer submodule to output a result;

s2, designing a loss function, standardizing the labels of the training set, transforming the labels of the training set into a distribution with a mean value of 0 and a variance of 1, storing the mean value and the variance obtained by standardized calculation, and subtracting the standardized labels from the output of the all-connection layer sub-module in the main module C to obtain a final loss function;

s3, training and storing results, wherein the activating functions adopted by the sub-modules of the full connection layer in the main module A and the main module C are Relu, the activating functions adopted by the Attention sub-module in the main module C are Relu and Sigmoid, training data are input to train the deep learning network model constructed in the step S1, and the deep learning network model is stored to an appointed path after training is finished;

and S4, inputting the test data into the deep learning network model obtained by training in the step S3 to obtain an output result, and then carrying out inverse standardization on the mean value and the variance obtained in the step S2 to obtain a prediction result.

Further, the navigation design method further comprises a test verification step, and the process is as follows:

and simulating four extreme conditions, namely a data non-damage condition, an image data shielding condition by foreign matters, an inertial navigation data loss condition and an image data loss condition, inputting the corresponding test data under the four extreme conditions into the deep learning network model obtained by training in the step S3 for testing, and carrying out inverse standardization on the output result of the deep learning network model according to the mean value and the variance stored in the step S2 to obtain a prediction result.

Further, the navigation design method divides the training set and the test set as follows: and taking 00-08 sequences in the KITTI data set as a training set, and taking 09 and 10 sequences as a test set.

Further, the main module a is formed by sequentially stacking 10 layers of CNNs in sequence, wherein the CNNs are formed by two-dimensional convolution, convolution kernels of the first three layers of CNNs are 7 × 7, 5 × 5 and 5 × 5, and convolution kernels of the last seven layers of CNNs are 3 × 3; the main module B consists of two layers of Bi-LSTM, and each layer of Bi-LSTM comprises 512 neurons; the main module C comprises an Attention submodule, two layers of Bi-LSTM submodules and a full-connection layer submodule, wherein each layer of Bi-LSTM in the two layers of Bi-LSTM submodules comprises 1000 neurons; the Attention submodule consists of two full-connection layers, the full-connection layer activation function of the first layer is Relu, and the full-connection layer activation function of the second layer is Sigmoid; the full-connection layer submodule is formed by cascading four full-connection layers, and the number of the neurons of the four full-connection layers is 512, 128, 64 and 6 respectively.

Further, the step S2 normalization process is as follows:

the training set label mean calculation method is as follows:

the training set label variance is calculated as follows:

the training set label standardization calculation mode is as follows:

wherein n is the number of training set labels; y is_rawFor training the original labels of the set, including the relative translation of the x, y, z axes and the relative rotation of the x, y, z axes, the dimension is 6; u is a mean value comprising relative translation of the x, y, z axes and relative rotation of the x, y, z axes, with a dimension of 6; sigma²The dimension is 6 for the variance of the relative translation based on the x, y, z axes and the relative rotation based on the x, y, z axes; sigma is sigma²The corresponding standard deviation of the measured signal is,

as a label after standardization.

Further, the loss function in step S3 is as follows:

wherein B refers to a batch of single input data in the training process, and i refers to a serial number of a corresponding batch; k is the dimension of the label, and the size is 6; t refers to the output result of the fully connected layer sub-module in the main module C and the element index corresponding to the standardized label,

is that

The single element corresponding to the position of t,

is a single element of the output result of the fully-connected layer sub-module in the main module C corresponding to the position t, | purple_iI-th of BThe output and the all-connected layer sub-module in the third main module C after the group data is transmitted into the deep learning network model

And performing absolute value operation after subtraction.

Further, in step S3, the fixed learning rate is 0.0001, the epoch is 200, the batch is 8, and an Adam optimizer is used.

Further, the inverse normalization in the step S4 is as follows:

where σ is the standard deviation of the training set labels, Y_invFor the last value predicted after inverse normalization, the dimension is 6;

the result is output for the fully connected layer sub-module in the third main module C in step S1, and the dimension is 6.

Further, the implementation process of simulating various extreme cases in the step S4 is as follows:

in the condition that the image data is shielded by foreign matters, the test set randomly selects pictures, then randomly selects a pixel coordinate of the pictures from the selected pictures, and adds a black mask block with the size of 100 x 100 by taking the pixel coordinate as the center;

in the case of inertial navigation data loss, the test set randomly selects inertial navigation data, and the selected inertial navigation data is set to be zero;

in the case of image data loss, the test set randomly selects a picture, and the selected picture is replaced with a pure black picture.

Compared with the prior art, the invention has the following advantages and effects:

1. in the method for designing the kernel of the visual inertial navigation system based on deep learning, the real label of the training set is subjected to standardized processing, so that the condition that the learning of relative translation and relative rotation needs to be balanced by manually setting a balance factor in other deep learning methods is avoided, the generalization capability of the kernel design method is improved, and the time consumed by manually setting the balance factor is avoided.

2. In the visual inertial navigation system kernel design method based on deep learning, the dimension of an image feature is reduced to m dimension, the dimension of an inertial navigation feature is increased to m dimension, and the dimensions of the image feature and the inertial navigation feature are the same.

3. In the method for designing the visual inertial navigation system kernel based on deep learning, disclosed by the invention, an attention mechanism is introduced, the features are subjected to self-adaptive weighting, unnecessary features are inhibited, and the precision of motion state estimation is improved.

Drawings

FIG. 1 is a flowchart of a method for designing a deep learning-based visual inertial navigation combination based on normalization, which is disclosed in the embodiment of the present invention;

FIG. 2 is a diagram of an Attention module in an embodiment of the present invention;

FIG. 3 is an overview of a deep learning network model in an embodiment of the invention;

fig. 4 is a trace diagram of scene sequence number 09 in the KITTI dataset.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

As shown in fig. 1, the embodiment discloses a deep learning visual inertial navigation integrated navigation design method based on standardization, in which a training set and a test set are divided as follows: and taking 00-08 sequences in the KITTI data set as a training set, and taking 09 and 10 sequences as a test set.

The method comprises the following steps:

s3, training and storing results, wherein an activation function adopted by a fully-connected layer sub-module in the main module A and the main module C is Relu, an activation function adopted by an Attention sub-module in the main module C is Relu and Sigmoid, the learning rate is fixed, training data are input to train the deep learning network model constructed in the step S1, and the deep learning network model is stored to an appointed path after training;

In addition, the navigation design method also comprises a test verification step, and the process is as follows:

Example two

On the basis of the method for designing the deep learning visual inertial navigation combination navigation based on standardization disclosed by the embodiment, the embodiment further discloses that the structure of the deep learning network model is as follows:

the main module a has structural parameters shown in table 1, and the main module a is formed by sequentially stacking 10 layers of CNNs in sequence, wherein the CNNs are formed by two-dimensional convolution, the sizes of convolution kernels of the first three layers of CNNs are 7 × 7, 5 × 5 and 5 × 5, and the sizes of convolution kernels of the second seven layers of CNNs are 3 × 3; the main module B consists of two layers of Bi-LSTM, and each layer of Bi-LSTM comprises 512 neurons; the main module C comprises an Attention sub-module, two layers of Bi-LSTM sub-modules and a full-connection layer sub-module, wherein each layer of Bi-LSTM in the two layers of Bi-LSTM sub-modules comprises 1000 neurons; the Attention submodule consists of two fully-connected layers, the structure of the Attention submodule is shown in figure 2, the activation function of the fully-connected layer of the first layer is Relu, and the activation function of the fully-connected layer of the second layer is Sigmoid; the full-connection layer submodule is formed by cascading four full-connection layers, and the number of neurons of the four full-connection layers is 512, 128, 64 and 6 respectively;

TABLE 1 first Module Structure parameter Table

As shown in table 1, in a parameter column, K refers to the size of a convolution kernel, S refers to a convolution step, P refers to whether zero padding operation is performed, and zero padding is required when the value of P is 1;

the specific implementation of step S2 is as follows:

when the task of the deep learning network model is to predict the relative translation of the x, y and z axes and the relative rotation of the x, y and z axes, the prediction effect of the relative translation of the x, y and z axes is good and the prediction effect of the relative rotation of the x, y and z axes is poor due to the fact that the magnitude of the relative translation of the x, y and z axes and the magnitude difference of the relative rotation of the x, y and z axes are extremely large; training to balance the relative translation of the x, y, z axes and the relative rotation of the x, y, z axes is usually performed by a scaling factor to make the relative translation of the x, y, z axes and the relative rotation of the x, y, z axes equal in magnitude, but the scaling factor is selected to be determined through multiple experiments, and the label is normalized, i.e., the relative translation of the x, y, z axes and the relative rotation of the x, y, z axes are normalized without adding the scaling factor, and the process of normalizing the label in step S2 is as follows:

the training set label mean is calculated according to:

the training set label variance is calculated according to:

training set label normalization was performed according to the following equation:

wherein n is the number of training set labels; y is_rawFor training the original labels of the set, including the relative translation of the x, y, z axes and the relative rotation of the x, y, z axes, the dimension is 6; u comprises the mean of the relative translation of the x, y, z axes and the relative rotation of the x, y, z axes, with a dimension of 6; sigma²The dimension is 6 for the variance of the relative translation based on the x, y, z axes and the relative rotation based on the x, y, z axes;sigma is sigma²The corresponding standard deviation of the measured signal is,

as a label after standardization.

The loss function described in step S2 is calculated according to the following equation:

wherein B is a batch of single input data volume in the training process, i refers to a serial number of the corresponding batch, k refers to a dimension of a label, the size is 6, t refers to an output result of a fully-connected layer sub-module in the main module C and an element subscript corresponding to the standardized label,

is that

The single element corresponding to the position of t,

is a single element of the output result of the fully-connected layer sub-module in the main module C corresponding to the t position | purple_iThe output and the output of the fully connected layer sub-module in the third main module C after the ith group of data of B is transmitted into the deep learning network model

And performing absolute value operation after subtraction.

The specific implementation of step S3 is as follows:

the fixed learning rate was 0.0001, epoch was 200, batch was 8, and an Adam optimizer was used.

The specific implementation of step S4 is as follows:

inverse normalization was performed according to the following formula:

where σ is the standard deviation of the training set labels, Y_invIs the last value predicted after inverse normalization;

outputting the result for the fully connected layer sub-module in the third main module C of step S1, with dimension 6;

the implementation process for simulating various extreme conditions in the step S4 is as follows:

in the condition that the image data is shielded by foreign matters, the test set randomly selects pictures, randomly selects a pixel coordinate of the pictures from the selected pictures, and adds a black mask block with the size of 100 x 100 by taking the pixel coordinate as the center;

in the case of image data loss, the test set randomly selects pictures, and the selected pictures are replaced by pure black pictures;

creating four folders, and storing data of three extreme conditions, namely the image data is shielded by foreign matters, the inertial navigation data is lost and the image data is lost into the corresponding folders after the four extreme conditions are simulated; taking the deep learning network model trained in the step S4 to perform the test under the four extreme conditions, the test results are as follows:

table 2 shows that the depth learning based visual-inertial navigation combination based on normalization of the present invention (hereinafter, referred to as normaize _ VIO) is compared with the depth learning based Soft fusion method (hereinafter, referred to as Soft _ VIO) without data loss, and table 2 shows that the present invention is superior to Soft _ VIO in terms of the method:

TABLE 2 comparison of the two methods in case of no data damage

In Table 2, m refers to meter units and rad refers to radian units.

Table 3 verifies the performance comparison of the two methods in case of inertial navigation data loss, and the result shows that the normaize _ VIO has higher precision than the Soft _ VIO.

TABLE 3 comparison table of two methods in inertial navigation data loss condition

Table 4 verifies the comparison of the two methods in the case of image data blocked by foreign matter, and the results show that normaize _ VIO has higher precision than Soft _ VIO.

TABLE 4 comparison table of two methods for the case of image data being blocked by foreign matter

Table 5 verifies the comparison of the two methods in case of image data loss, and the results show that normaize _ VIO has higher precision than Soft _ VIO.

TABLE 5 comparison of two methods in case of image data loss

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A deep learning visual inertial navigation combined navigation design method based on standardization is characterized by comprising the following steps:

s3, training and storing results, inputting training data to train the deep learning network model constructed in the step S1, and storing the deep learning network model to a specified path after training;

2. The method for designing the navigation based on the standardized deep learning visual inertial navigation combination of the claim 1, wherein the method for designing the navigation further comprises a test verification step, and the process is as follows:

3. The method for designing deep learning visual inertial navigation combination navigation based on standardization according to claim 1, wherein the navigation design method is characterized in that a training set and a test set are divided as follows: and taking 00-08 sequences in the KITTI data set as a training set, and taking 09 and 10 sequences as a test set.

4. The design method of deep learning visual inertial navigation combination based on standardization according to claim 1, wherein the main module a is formed by sequentially stacking 10 layers of CNNs in sequence, wherein the CNNs are all formed by two-dimensional convolution, the convolution kernel sizes of the first three layers of CNNs are 7 × 7, 5 × 5 and 5 × 5, and the convolution kernel sizes of the last seven layers of CNNs are all 3 × 3; the main module B consists of two layers of Bi-LSTM, and each layer of Bi-LSTM comprises 512 neurons; the main module C comprises an Attention submodule, two layers of Bi-LSTM submodules and a full-connection layer submodule, wherein each layer of Bi-LSTM in the two layers of Bi-LSTM submodules comprises 1000 neurons; the Attention submodule consists of two full-connection layers, the full-connection layer activation function of the first layer is Relu, and the full-connection layer activation function of the second layer is Sigmoid; the full-connection layer submodule is formed by cascading four full-connection layers, and the number of the neurons of the four full-connection layers is 512, 128, 64 and 6 respectively.

5. The method for designing combined navigation based on deep learning and visual inertial navigation of claim 1, wherein the step S2 includes the following steps:

the training set label mean calculation method is as follows:

the training set label variance is calculated as follows:

the training set label standardization calculation mode is as follows:

as a label after standardization.

6. The method for designing combined navigation based on deep learning and visual inertial navigation of claim 1, wherein the loss function in step S3 is as follows:

is that

The single element corresponding to the position of t,

is a single element of the output result of the fully-connected layer sub-module in the main module C corresponding to the position t, | purple_iThe output and the of the sub-module of the full connection layer in the main module C after the ith group of data of B is transmitted into the deep learning network model

And performing absolute value operation after subtraction.

7. The method of claim 1, wherein in step S3, the fixed learning rate is 0.0001, the epoch is 200, the batch is 8, and an Adam optimizer is used.

8. The method for designing deep learning-based visual inertial navigation combination based on normalization of claim 1, wherein the inverse normalization in step S4 is as follows:

9. The method for designing combined navigation based on deep learning and visual inertial navigation of claim 1, wherein the simulation of extreme conditions in step S4 is implemented as follows:

10. The design method of deep learning visual inertial navigation combination based on standardization according to claim 1, wherein the activation functions used by the fully-connected layer sub-modules in the main module a and the main module C are Relu, and the activation functions used by the Attention sub-module in the main module C are Relu and Sigmoid.