CN112801201B

CN112801201B - Deep learning visual inertial navigation combined navigation design method based on standardization

Info

Publication number: CN112801201B
Application number: CN202110171232.3A
Authority: CN
Inventors: 胡斌杰; 丘金光
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2022-10-25
Anticipated expiration: 2041-02-08
Also published as: CN112801201A

Abstract

The invention discloses a deep learning visual inertial navigation combined navigation design method based on standardization, which comprises the following steps of: designing and introducing standard operation on labels of a training set, calculating the mean value and variance of the labels, converting the labels into distribution with the mean value of 0 and the variance of 1, and storing the obtained mean value and variance; in the network design, in order to balance the contribution of image data and inertial navigation data, inertial navigation characteristics and image characteristics are changed to be in the same dimension m through a network; and in the verification stage or the test stage, the result output by the network is subjected to inverse standardization operation through the mean value and the variance obtained by calculation to obtain a final result. According to the method, in the deep learning visual inertial navigation combined navigation design method based on standardization, the standardization operation is carried out on the training set labels, the selection of balance factors in the target function is reduced, the generalization performance of a neural network is improved, and the accuracy of the relative pose prediction is improved.

Description

Deep learning visual inertial navigation combined navigation design method based on standardization

Technical Field

The invention relates to the technical field of sensor fusion and motion estimation, in particular to a deep learning visual inertial navigation combined navigation design method based on standardization.

Background

With automatic driving and continuous development of unmanned aerial vehicles, high-precision and high-robustness positioning is an important premise for completing autonomous navigation and exploring tasks such as unknown areas, a pure visual odometer method is adopted, a system acquires surrounding environment information by using a visual sensor, and motion state of the system is estimated by analyzing image data. The vision inertial navigation odometer adds Inertial Measurement Unit (IMU) information on the basis of a pure vision odometer, and can improve the precision of motion state estimation under the condition of visual loss.

The conventional visual inertial navigation odometer technology has been fully researched, but some researches on data loss, data damage and the like are not well solved, and a large amount of manual feature selection and external reference calibration among sensors are required, which is undoubtedly time-consuming. In recent years, deep learning techniques have achieved enormous success in the field of computer vision, and are widely used in various fields. The visual inertial navigation combined navigation is taken as a regression task, and can also be trained by adopting a deep learning method, in the existing visual inertial navigation combined navigation task based on the deep learning, the target function balances the learning of translation and rotation through a balance factor, and a large amount of training time is needed for finding the balance factor, so that manpower and material resources are undoubtedly consumed, and aiming at the problem, a new target function needs to be designed to avoid manual setting of the balance factor.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a deep learning visual inertial navigation combined navigation design method based on standardization, so that the manual setting of balance factors of relative translation and relative rotation is avoided.

The purpose of the invention can be achieved by adopting the following technical scheme:

a deep learning visual inertial navigation combined navigation design method based on standardization comprises the following steps:

s1, establishing a deep learning network model, wherein the deep learning network model comprises a first main module, a second main module and a third main module, and the first main module is formed by stacking 10 layers of CNNs and is called a main module A; the second main module comprises two layers of Bi-LSTM, called main module B; the third main module is called as a main module C, the main module C comprises a first sub module, a second sub module and a third sub module, the first sub module is an Attention sub module, the second sub module is a two-layer Bi-LSTM sub module, and the third sub module is a full-connection layer sub module; inputting image data to a main module A to extract image characteristics; inputting inertial navigation data into the main module B to extract inertial navigation characteristics, and ensuring that the dimensionality of the inertial navigation characteristics is consistent with the dimensionality of image characteristics; the image characteristic and the inertial navigation characteristic are serially connected and input into an Attention submodule in a main module C, the output of the Attention submodule is multiplied by the input of the Attention submodule and then input into two layers of Bi-LSTM submodules of the main module C, and the output of the two layers of Bi-LSTM submodules is input into a fully-connected layer submodule to output a result;

s2, designing a loss function, standardizing the labels of the training set, transforming the labels of the training set into distribution with a mean value of 0 and a variance of 1, storing the mean value and the variance obtained by standardized calculation, and subtracting the standardized labels from the output of the all-connection layer sub-module in the main module C to obtain a final loss function;

s3, training and storing results, wherein an activation function adopted by all-connection layer sub-modules in the main module A and the main module C is Relu, an activation function adopted by an Attention sub-module in the main module C is Relu and Sigmoid, training data are input to train the deep learning network model constructed in the step S1, and the deep learning network model is stored to an appointed path after training is finished;

and S4, inputting the test data into the deep learning network model obtained by training in the step S3 to obtain an output result, and then carrying out inverse standardization through the mean value and the variance obtained in the step S2 to obtain a prediction result.

Further, the navigation design method further comprises a test verification step, and the process is as follows:

and simulating four extreme conditions, namely a data non-damage condition, an image data shielding condition by a foreign matter, an inertial navigation data loss condition and an image data loss condition, inputting the corresponding test data under the four extreme conditions into the deep learning network model obtained by training in the step S3 for testing, and carrying out inverse standardization on the output result of the deep learning network model according to the mean value and the variance stored in the step S2 to obtain a prediction result.

Further, the navigation design method divides the training set and the test set as follows: taking 00-08 sequences in the KITTI data set as a training set, and taking 09 and 10 sequences to divide the training set into a test set.

Further, the main module a is formed by sequentially stacking 10 layers of CNNs in sequence, wherein the CNNs are formed by two-dimensional convolution, convolution kernels of the first three layers of CNNs are 7 × 7, 5 × 5 and 5 × 5, and convolution kernels of the last seven layers of CNNs are 3 × 3; the main module B consists of two layers of Bi-LSTM, and each layer of Bi-LSTM comprises 512 neurons; the main module C comprises an Attention submodule, two layers of Bi-LSTM submodules and a full-connection layer submodule, wherein each layer of Bi-LSTM in the two layers of Bi-LSTM submodules comprises 1000 neurons; the Attention submodule consists of two full-connection layers, wherein the full-connection layer activation function of a first layer is Relu, and the full-connection layer activation function of a second layer is Sigmoid; the full-connection layer submodule is formed by cascading four full-connection layers, and the number of neurons of the four full-connection layers is 512, 128, 64 and 6 respectively.

Further, the step S2 normalization process is as follows:

the training set label mean calculation mode is as follows:

the training set label variance is calculated as follows:

the training set label standardization calculation mode is as follows:

wherein n is the number of training set labels; y is _raw For training the original labels of the set, including relative translation of the x, y, z axes and relative rotation of the x, y, z axes, the dimension is 6; u is a mean value comprising relative translation of the x, y, z axes and relative rotation of the x, y, z axes, with a dimension of 6; sigma ² The dimension is 6 for the variance based on the relative translation of the x, y, z axes and on the relative rotation of the x, y, z axes; sigma is sigma ² The corresponding standard deviation of the measured signal is,

as a label after standardization.

Further, the loss function in step S3 is as follows:

wherein B refers to a batch of single input data in the training process, and i refers to a serial number of a corresponding batch; k is the dimension of the label, and the size is 6; t is the output result of the fully connected layer sub-module in the main module C and the element subscript corresponding to the standardized label,

is that

The single element corresponding to the position of t,

is a single element of the output result of the fully-connected layer sub-module in the main module C corresponding to the position t, | calting _i The output and the output of the fully connected layer sub-module in the third main module C after the ith group of data of B is transmitted into the deep learning network model

And performing absolute value operation after subtraction.

Further, in the step S3, the fixed learning rate is 0.0001, the epoch is 200, the batch is 8, and an Adam optimizer is adopted.

Further, the inverse normalization in step S4 is as follows:

where σ is the standard deviation of the training set labels, Y _inv After inverse normalizationThe last value of the prediction, dimension 6;

the dimension is 6 for the fully connected layer sub-module in the third main module C in step S1 to output the result.

Further, the implementation process of simulating various extreme cases in the step S4 is as follows:

in the condition that the image data is shielded by foreign matters, the test set randomly selects pictures, then randomly selects a pixel coordinate of the pictures from the selected pictures, and adds a black mask block with the size of 100 x 100 by taking the pixel coordinate as the center;

in the case of inertial navigation data loss, the test set randomly selects inertial navigation data, and the selected inertial navigation data is set to be zero;

in the event of image data loss, the test set randomly selects a picture, and the selected picture is replaced with a pure black picture.

Compared with the prior art, the invention has the following advantages and effects:

1. in the method for designing the kernel of the visual inertial navigation system based on deep learning, the real label of the training set is subjected to standardized processing, so that the condition that the learning of relative translation and relative rotation needs to be balanced by manually setting a balance factor in other deep learning methods is avoided, the generalization capability of the kernel design method is improved, and the time consumed by manually setting the balance factor is avoided.

2. In the visual inertial navigation system kernel design method based on deep learning, the dimension of an image feature is reduced to m dimension, the dimension of an inertial navigation feature is increased to m dimension, and the dimensions of the image feature and the inertial navigation feature are the same.

3. In the method for designing the visual inertial navigation system kernel based on deep learning, disclosed by the invention, an attention mechanism is introduced, the features are subjected to self-adaptive weighting, unnecessary features are inhibited, and the precision of motion state estimation is improved.

Drawings

FIG. 1 is a flowchart of a standardized deep learning-based visual inertial navigation integrated navigation design method disclosed in the embodiment of the present invention;

FIG. 2 is a diagram of an Attention module in an embodiment of the present invention;

FIG. 3 is an overview of a deep learning network model in an embodiment of the invention;

fig. 4 is a trace diagram of scene sequence number 09 in the KITTI dataset.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

As shown in fig. 1, the embodiment discloses a deep learning visual inertial navigation integrated navigation design method based on standardization, in which a training set and a test set are divided as follows: and taking 00-08 sequences in the KITTI data set as a training set, and taking 09 and 10 sequences as a test set.

The method comprises the following steps:

s1, establishing a deep learning network model, wherein the deep learning network model comprises a first main module, a second main module and a third main module, and the first main module is formed by stacking 10 layers of CNNs and is called a main module A; the second main module comprises two layers of Bi-LSTM, called main module B; the third main module is called as a main module C, the main module C comprises a first sub module, a second sub module and a third sub module, the first sub module is an Attention sub module, the second sub module is a two-layer Bi-LSTM sub module, and the third sub module is a full-connection layer sub module; inputting image data into a main module A to extract image characteristics; inputting inertial navigation data into the main module B to extract inertial navigation characteristics, and ensuring that the dimensionality of the inertial navigation characteristics is consistent with the dimensionality of image characteristics; the image characteristic and the inertial navigation characteristic are serially connected and input into an Attention submodule in a main module C, the output of the Attention submodule is multiplied by the input of the Attention submodule and then input into two layers of Bi-LSTM submodules of the main module C, and the output of the two layers of Bi-LSTM submodules is input into a fully-connected layer submodule to output a result;

s2, designing a loss function, standardizing labels of the training set, converting the labels of the training set into distribution with a mean value of 0 and a variance of 1, storing the mean value and the variance obtained by standardized calculation, and subtracting the standardized labels from the output of the all-connection layer sub-module in the main module C to obtain a final loss function;

s3, training and storing results, wherein an activation function adopted by a fully-connected layer sub-module in the main module A and a fully-connected layer sub-module in the main module C is Relu, an activation function adopted by an Attention sub-module in the main module C is Relu and Sigmoid, the learning rate is fixed, training data are input to train the deep learning network model constructed in the step S1, and the deep learning network model is stored to an appointed path after training is finished;

In addition, the navigation design method also comprises a test verification step, and the process is as follows:

Example two

On the basis of the method for designing the deep learning visual inertial navigation combination navigation based on standardization disclosed by the embodiment, the embodiment further discloses that the structure of the deep learning network model is as follows:

the structural parameters of the main module a are shown in table 1, the main module a is formed by sequentially stacking 10 layers of CNNs in sequence, wherein the CNNs are formed by two-dimensional convolution, the convolution kernels of the first three layers of CNNs are 7 × 7, 5 × 5 and 5 × 5, and the convolution kernels of the last seven layers of CNNs are 3 × 3; the main module B consists of two layers of Bi-LSTM, and each layer of Bi-LSTM comprises 512 neurons; the main module C comprises an Attention sub-module, two layers of Bi-LSTM sub-modules and a full-connection layer sub-module, wherein each layer of Bi-LSTM in the two layers of Bi-LSTM sub-modules comprises 1000 neurons; the Attention submodule consists of two fully-connected layers, the structure of the Attention submodule is shown in figure 2, the activation function of the fully-connected layer of the first layer is Relu, and the activation function of the fully-connected layer of the second layer is Sigmoid; the full-connection layer submodule is formed by cascading four full-connection layers, and the number of neurons of the four full-connection layers is 512, 128, 64 and 6 respectively;

TABLE 1 first Module Structure parameter Table

As shown in table 1, in a parameter column, K refers to the size of a convolution kernel, S refers to a convolution step, P refers to whether zero padding operation is performed, and zero padding is required when the value of P is 1;

the specific implementation of step S2 is as follows:

when the task of the deep learning network model is to predict the relative translation of the x, y and z axes and the relative rotation of the x, y and z axes, the prediction effect of the relative translation of the x, y and z axes is good and the prediction effect of the relative rotation of the x, y and z axes is poor due to the fact that the magnitude of the relative translation of the x, y and z axes and the magnitude difference of the relative rotation of the x, y and z axes are extremely large; in order to balance the training of the relative translation of the x, y, and z axes and the relative rotation of the x, y, and z axes, the relative translation of the x, y, and z axes and the relative rotation of the x, y, and z axes are often equal in order of magnitude by a scaling factor, but the selection of the scaling factor requires multiple experiments to determine the scaling factor, and the label is normalized, that is, the normalization of the relative translation of the x, y, and z axes and the relative rotation of the x, y, and z axes does not require the addition of the scaling factor, and the label normalization process in step S2 is as follows:

the training set label mean is calculated according to:

the training set label variance is calculated according to:

training set label normalization was performed according to the following equation:

wherein n is the number of training set labels; y is _raw For training the original labels of the set, including the relative translation of the x, y, z axes and the relative rotation of the x, y, z axes, the dimension is 6; u comprises the mean of the relative translations and rotations of the x, y, z axes, with a dimension of 6; sigma ² The dimension is 6 for the variance based on the relative translation of the x, y, z axes and on the relative rotation of the x, y, z axes; sigma is sigma ² The corresponding standard deviation of the measured signal is,

as a label after standardization.

The loss function described in step S2 is calculated according to the following equation:

wherein B is the batch of single input data volume in the training process, i refers to the serial number of the corresponding batch, k refers to the dimension of the label, the size is 6, t refers to the output result of the fully-connected layer sub-module in the main module C and the element subscript corresponding to the standardized label,

is that

The single element corresponding to the position of t,

is a single element of the output result of the fully-connected layer sub-module in the main module C corresponding to the t position | purple _i The output of the sub-module of the full connection layer in the third main module C is connected with the data of the ith group of the data B after the data of the ith group of the data B is transmitted into the deep learning network model

And performing absolute value operation after subtraction.

The specific implementation of step S3 is as follows:

the fixed learning rate was 0.0001, the epoch was 200, the batch was 8, and an Adam optimizer was used.

The specific implementation of step S4 is as follows:

inverse normalization was performed according to the following formula:

where σ is the standard deviation of the training set labels, Y _inv Is the last value predicted after inverse normalization;

outputting a result for the fully connected layer sub-module in the third main module C in the step S1, wherein the dimension is 6;

the implementation process of simulating various extreme conditions in the step S4 is as follows:

in the condition that the image data is shielded by foreign matters, the test set randomly selects pictures, randomly selects a pixel coordinate of the pictures from the selected pictures, and adds a black mask block with the size of 100 x 100 by taking the pixel coordinate as the center;

in the case of image data loss, the test set randomly selects pictures, and the selected pictures are replaced by pure black pictures;

creating four folders, and storing data of three extreme conditions, namely the image data is shielded by foreign matters, the inertial navigation data is lost and the image data is lost into the corresponding folders after the four extreme conditions are simulated; and (4) testing the deep learning network model trained in the step (S4) under the four extreme conditions, wherein the test results are as follows:

table 2 shows that the method for designing the standardized deep learning-based visual inertial navigation integrated navigation system (hereinafter, referred to as normaize _ VIO) according to the present invention is superior to Soft-fusion method based on deep learning (hereinafter, referred to as Soft _ VIO) in the present invention, as shown in table 2, when the data is not damaged:

TABLE 2 comparison of the two methods in case of no data damage

In Table 2 m refers to meter units and rad refers to units of radians.

Table 3 verifies the performance comparison of the two methods in case of inertial navigation data loss, and the result shows that the normaize _ VIO has higher precision than the Soft _ VIO.

TABLE 3 comparison table of inertial navigation data loss conditions of two methods

Table 4 verifies the comparison of the two methods in the case of image data blocked by foreign matter, and the results show that normaize _ VIO has higher precision than Soft _ VIO.

TABLE 4 comparison table of image data blocked by foreign matter in two methods

Table 5 verifies the comparison of the two methods in case of image data loss, and the results show that normaize _ VIO has higher precision than Soft _ VIO.

TABLE 5 comparison of two methods in case of image data loss

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A deep learning visual inertial navigation combined navigation design method based on standardization is characterized by comprising the following steps:

s1, establishing a deep learning network model, wherein the deep learning network model comprises a first main module, a second main module and a third main module, and the first main module is formed by stacking 10 layers of CNNs and is called a main module A; the second main module comprises two layers of Bi-LSTM, called main module B; the third main module is called as a main module C, the main module C comprises a first sub module, a second sub module and a third sub module, the first sub module is an Attention sub module, the second sub module is a two-layer Bi-LSTM sub module, and the third sub module is a full-connection layer sub module; inputting image data into a main module A to extract image characteristics; inertial navigation data are input into a main module B to extract inertial navigation characteristics, and the dimension of the inertial navigation characteristics is ensured to be consistent with the dimension of the image characteristics; the image characteristic and the inertial navigation characteristic are serially connected and input into an Attention sub-module in a main module C, the output of the Attention sub-module is multiplied by the input of the Attention sub-module and then input into two layers of Bi-LSTM sub-modules of the main module C, and the output of the two layers of Bi-LSTM sub-modules is input into a fully-connected layer sub-module to output a result;

s2, designing a loss function, standardizing labels of the training set, transforming the labels of the training set into distribution with a mean value of 0 and a variance of 1, storing the mean value and the variance obtained by standardized calculation, and subtracting the standardized labels from the output of the all-connection layer sub-module in the main module C to obtain a final loss function, wherein the standardization process comprises the following steps:

the training set label mean calculation mode is as follows:

the training set label variance is calculated as follows:

the training set label standardization calculation mode is as follows:

wherein n is the number of training set labels; y is _raw For training the original labels of the set, including the relative translation of the x, y, z axes and the relative rotation of the x, y, z axes, the dimension is 6; u is a mean value comprising relative translations and rotations of the x, y, z axes, with a dimension of 6; sigma ² The dimension is 6 for the variance based on the relative translation of the x, y, z axes and on the relative rotation of the x, y, z axes; sigma is sigma ² The corresponding standard deviation of the measured signal is,

as a label after normalization;

the loss function is as follows:

b refers to a batch of single input data in the training process, and i refers to a serial number of a corresponding batch; k is the dimension of the label, and the size is 6; t refers to the output result of the fully connected layer sub-module in the main module C and the element index corresponding to the standardized label,

is that

The single element corresponding to the position of t,

is a single element of the output result of the fully-connected layer sub-module in the main module C corresponding to the position t, | purple _i The output and the of the sub-module of the full connection layer in the main module C after the ith group of data of B is transmitted into the deep learning network model

Performing absolute value operation after subtraction;

s3, training and storing results, inputting training data to train the deep learning network model constructed in the step S1, and storing the deep learning network model to a specified path after training is finished;

2. The method for designing the navigation based on the standardized deep learning visual inertial navigation combination of the claim 1, wherein the method for designing the navigation further comprises a test verification step, and the process is as follows:

3. The method for designing deep learning visual inertial navigation combination navigation based on standardization according to claim 1, wherein the navigation design method is characterized in that a training set and a test set are divided as follows: and taking 00-08 sequences in the KITTI data set as a training set, and taking 09 and 10 sequences as a test set.

4. The design method of deep learning visual inertial navigation combination based on standardization according to claim 1, wherein the main module a is formed by sequentially stacking 10 layers of CNNs in sequence, wherein the CNNs are all formed by two-dimensional convolution, the convolution kernel sizes of the first three layers of CNNs are 7 × 7, 5 × 5 and 5 × 5, and the convolution kernel sizes of the last seven layers of CNNs are all 3 × 3; the main module B consists of two layers of Bi-LSTM, and each layer of Bi-LSTM comprises 512 neurons; the main module C comprises an Attention sub-module, two layers of Bi-LSTM sub-modules and a full-connection layer sub-module, wherein each layer of Bi-LSTM in the two layers of Bi-LSTM sub-modules comprises 1000 neurons; the Attention submodule consists of two full-connection layers, the full-connection layer activation function of the first layer is Relu, and the full-connection layer activation function of the second layer is Sigmoid; the full-connection layer submodule is formed by cascading four full-connection layers, and the number of neurons of the four full-connection layers is 512, 128, 64 and 6 respectively.

5. The method according to claim 1, wherein in the step S3, a fixed learning rate is 0.0001, an epoch is 200, a batch is 8, and an Adam optimizer is adopted.

6. The method for designing deep learning visual inertial navigation combined navigation based on standardization according to claim 1, wherein in the step S4, inverse standardization is as follows:

where σ is the standard deviation of the training set labels, Y _inv For the last value predicted after inverse normalization, the dimension is 6;

for the fully connected layer sub-module in the third main module C in step S1 to output the result, the dimension is 6.

7. The method for designing deep learning visual inertial navigation combination based on standardization according to claim 1, wherein the simulation of various extreme conditions in step S4 is implemented as follows:

in the case of image data loss, the test set randomly selects a picture, and the selected picture is replaced with a pure black picture.

8. The design method of deep learning visual inertial navigation combination based on standardization according to claim 1, wherein the activation functions used by the fully-connected layer sub-modules in the main module a and the main module C are Relu, and the activation functions used by the Attention sub-module in the main module C are Relu and Sigmoid.