CN114690230A

CN114690230A - Automatic driving vehicle navigation method based on visual inertia SLAM

Info

Publication number: CN114690230A
Application number: CN202210480382.7A
Authority: CN
Inventors: 伍锡如; 黄凤堂; 刘金霞
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-07-01

Abstract

The invention discloses an automatic driving vehicle navigation method based on visual inertia SLAM, in particular to a navigation method combining the visual inertia SLAM with a target detection network. The method aims to remove the influence of a dynamic target in a road scene and solve the problem that the navigation precision of an automatic driving vehicle is reduced in a dynamic environment. The method is realized by the following steps: the method comprises the following steps: the vehicle-mounted sensor collects images, acceleration and angular velocity, and then data are aligned; step two: detecting a dynamic target, and removing the influence of the dynamic target on pose estimation; step three: IMU pre-integration, and integration is not required to be carried out again after optimization and updating every time, so that the calculated amount is reduced; step four: initializing and calculating each parameter; step five: and calculating marginalized prior information, IMU measurement residual and visual observation residual by close coupling optimization. The invention is suitable for the visual inertial navigation system.

Description

Automatic driving vehicle navigation method based on visual inertia SLAM

Technical Field

The invention relates to the technical field of automatic driving vehicle navigation, in particular to an automatic driving vehicle navigation method based on visual inertia SLAM.

Background

There are many problems and challenges facing the application of visual inertial SLAM to autonomous vehicles, and there are still some key technical issues to be solved. The performance of e.g. visual inertial SLAM depends to a large extent on the assumption that the surrounding visual features are static, so that the navigation accuracy of the system in a dynamic environment drops drastically. In order to improve the positioning and state estimation precision of the automatic driving vehicle in a dynamic environment, the invention provides a visual inertial navigation system combined with dynamic target detection, and provides theoretical guidance and a solution for applying the visual inertial SLAM to the automatic driving vehicle.

In order to solve the problems of reduced pose estimation precision, track drift and even navigation failure of the visual inertia SLAM in a dynamic environment, a plurality of excellent solutions are provided at home and abroad. The existing better solution method is to combine the neural network-semantic segmentation to remove the dynamic feature points, but the calculation amount is large, so that the real-time requirement cannot be met. Another solution is to use a random sample consensus algorithm to remove outliers, but when there are too many outliers in the scene, it can cause drift and even fail. The last solution is to detect moving objects with depth cameras and then eliminate the feature points on the objects, but the limited cameras also limit the application scenarios and onboard sensors. The above solutions still do not enable the visual inertial SLAM to be well applied to autonomous vehicles due to limitations in real-time, computational, sensors, etc.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides the automatic driving vehicle navigation method based on the visual inertia SLAM, and the navigation precision of the automatic driving vehicle is improved.

The invention is realized by adopting the following technical scheme:

the automatic driving vehicle navigation method based on visual inertia SLAM is realized by adopting the following steps:

firstly, a vehicle-mounted binocular camera acquires a gray image of a road in front, an IMU sensor acquires the acceleration and the angular velocity of a vehicle, and a master control aligns data;

training a dynamic target detection model, detecting a dynamic target in a road scene, extracting a target boundary box, and removing feature points in the boundary box;

eliminating noise and deviation of the IMU, performing pre-integration on IMU data, calculating an observed value of the IMU data, a covariance matrix and a Jacobian matrix of residual errors, and then correcting the deviation;

initializing and calculating an absolute scale, a gyroscope bias, an acceleration bias, a gravity acceleration and a speed of each IMU at the moment;

and step five, after initialization, executing tight coupling optimization based on a sliding window to obtain a state estimation value with higher precision and better robustness.

The invention has the following effective effects: firstly, the time calibration is carried out on the acquired data, and the pose estimation precision is improved; the invention pre-integrates IMU data, and reduces the calculation amount consumed during integration. Secondly, the invention detects the dynamic target in the road environment based on the dynamic target detection model proposed by the YOLOv5 network training. Compared with a model carried by YOLOv5, the model only detects dynamic targets such as automobiles, trucks, motorcycles, pedestrians and the like in a road scene, reduces the size of the model and accelerates the speed of recognizing the dynamic targets. The model is obtained by training the BD100K data set, and through a large amount of debugging and parameter optimization, compared with a model carried by YOLOv5, the dynamic target detection model reduces the model amount by about 70%, and the target detection speed is averagely improved by 63%. The dynamic target detection model trained by the invention can detect a dynamic target in a road environment, and after a target boundary frame is extracted by using the model, a characteristic point tracking module at the front end removes characteristic points in the target boundary frame, so that the aim of eliminating dynamic interference is fulfilled, and the navigation precision of the automatic driving vehicle is improved.

The effect of the invention is verified by the following experiment:

compared with the VINS-Fusion, the method has the functions of closed loop, reloading, high-precision map multiplexing and the like. The present invention has evaluated the proposed method in a common KITTI dataset and in the real world. To align the estimated trajectory with the ground truth, the present invention uses the method of horns's. The method evaluates the accuracy of navigation through the absolute pose error and the absolute track error. The absolute pose error and the absolute trajectory error are calculated by the EVO tool. The invention adopts the absolute track root mean square error to carry out quantitative evaluation. The absolute track error compares the absolute distance between the estimate and the translation component of the ground truth track. The absolute trajectory error at time step i is calculated as follows:

where T represents a transformation that aligns two tracks; g represents a real ground; e represents the estimated trajectory.

For a sequence of N poses, the absolute trajectory root mean square error of the absolute trajectory error is calculated as follows:

all values were obtained through five experiments considering the uncertainty of the system, and show the median result of the estimated trajectory accuracy. The present invention uses the KITTI data set to evaluate the proposed method, which is suitable for evaluating the performance of visual inertial SLAM in autonomous vehicles. In this dataset, a real ground is provided for sequences 00-10. In order to provide accurate ground truth, the lidar, camera and GPS are calibrated and synchronized. The KITTI data set is collected from various scene data in the real world, such as rural, highway and urban. The present invention primarily considers sequences 00 and 05-07 because there are moving vehicles and pedestrians in these urban sequences.

In real-world experiments, the present invention utilizes a homemade sensor suite to demonstrate the proposed method. The invention drives the automobile in an outdoor high dynamic environment and uses four different combinations for state estimation. In order to carry out error evaluation on the experiment, the control point is arranged through the RTK-GPS sensor and serves as a real ground point. To obtain the absolute error, the present invention aligns the estimated trajectory with the control point.

Experiments on KITTI data sets and the real world indicate that the method provided by the invention improves the positioning precision by about 40% in a large-scale dynamic urban environment compared with VINS-Fusion. The comparison experiment result shows that the method improves the accuracy and the robustness in a dynamic environment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow chart of steps one to five in the present invention.

FIG. 2 is the motion trajectory and dense point cloud of the present invention in a KITTI data set during the experiment.

FIG. 3 is a real world experimental set-up of the present invention in an experiment.

FIG. 4 is a plot of the root mean square error of the absolute track versus the absolute track error of the present invention and conventional methods.

Detailed Description

The present invention will be described in detail with reference to specific examples.

The automatic driving vehicle navigation method based on the visual inertia SLAM is realized by adopting the following steps, specifically the flow shown in figure 1:

firstly, a vehicle-mounted binocular camera acquires a gray image of a road in front, an IMU sensor acquires acceleration and angular velocity of a vehicle, and a master control aligns data.

And step two, training a dynamic target detection model, detecting a dynamic target in a road scene, extracting a target boundary box, and removing feature points in the boundary box.

And step three, eliminating the noise and deviation of the IMU, performing pre-integration on the IMU data, calculating an observed value of the IMU data, a covariance matrix and a Jacobian matrix of residual errors, and then correcting the deviation.

And step four, initializing and calculating the absolute scale, the gyroscope bias, the acceleration bias, the gravity acceleration and the speed of each IMU at the moment.

In the first step, the specific steps include:

(1) the vehicle-mounted binocular camera acquires left and right target gray level images at the frequency of 10 Hz;

(2) the vehicle-mounted IMU acquires the three-axis acceleration and the angular velocity of the vehicle body at the frequency of 100 Hz;

(3) the master control aligns data; due to the trigger delay, transmission delay, and unsynchronized clocks, the generated timestamps are not equal to the actual sampling time, resulting in time misalignment between the camera and the IMU. The alignment data is calculated as:

t_IMU＝t_cam+t_d；

wherein, t_IMUAnd t_camTimestamps representing the IMU and the camera, respectively; t is t_dRepresenting the time offset of an unknown constant.

In the second step, the concrete steps include:

(1) and training a dynamic target detection model. The self-trained neural network model can detect a specific dynamic target more accurately and quickly. The training images are passed to a neural network to create a custom trained detector. The present invention prepares a training set of BDDs 100K to train a dynamic object detection model. The dynamic target detection model was trained using the YOLOv5 network. Like other neural networks, the YOLO network also requires a large number of images to be labeled and sent into the model to train the parameters of the neural network. The dynamic objects detected by the dynamic object detection model include people, cars, motorcycles, buses, and trucks.

The accuracy and the running speed of the dynamic target detection model have decisive influence on the overall result. The invention does much work to improve the accuracy of the dynamic target detection model. In practical application, adding the dynamic target detection model to the visual inertia SLAM increases the calculation amount of the algorithm, but does not influence the real-time performance of the system.

(2) Lifting deviceAnd taking a bounding box. The dynamic object detection model provides classes and two-dimensional bounding boxes of dynamic objects. The dynamic object detection model requires three steps to obtain the bounding box of the dynamic object. First, the entire image is divided into equal sized grid cells, and the network runs on all combined cells simultaneously. Second, the network predicts the bounding box of each cell in the output feature map. Finally, in order to obtain the actual coordinate values of the predicted bounding box with respect to the original image, the algorithm will be (b)_x,b_y,b_w,b_h) Divided by the size of the feature map and multiplied by the size of the original image. The calculation formula of the prediction bounding box is as follows:

wherein (b)_x,b_y) Coordinates representing each grid cell at the prediction center; (b)_w,b_h) Representing the width and height of the bounding box; (c)_x,c_y) Representing the upper left coordinates of the grid cells in the feature map; p is a radical of formula_wAnd p_hRespectively representing the width and height of the predicted bounding box.

In the third step, the concrete steps include:

(1) the IMU is noise and offset cancelled. The measured value is influenced by acceleration bias, gyroscope bias and noise, and the calculation formula for eliminating the noise and the bias is as follows:

wherein, the first and the second end of the pipe are connected with each other,

and

random walk noise representing the accelerometer and gyroscope, respectively; a is_tAn ideal value representing the accelerometer measurements; omega_tRepresenting an ideal value of a gyroscope measurement;

a rotation matrix representing an IMU coordinate system to a camera coordinate system; g^wRepresenting the gravity vector. Acceleration noise n_aAnd gyroscope noise n_wIs gaussian white noise and is expressed as:

(2) pre-integrating IMU measurement values; for two consecutive frames b_kAnd b_k+1The time interval t needs to be set_k,t_k+1]Integrating the inertial measurement into a local frame b_kIn (1), the calculation formula is as follows:

wherein the content of the first and second substances,

representing a position pre-integration;

representing a velocity pre-integration;

representing a rotational pre-integration;

(3) and (5) correcting the deviation. If the predicted deviation varies little, it will be adjusted by a first order approximation of the deviation

And

the offset correction is calculated asThe following:

in the fourth step, the concrete steps include:

(1) the pose is converted from the camera frame into an IMU frame. The calculation formula is as follows:

wherein s is a proportional parameter to be solved;

and

representing the external reference between the camera and the IMU.

(2) Calibrating the gyroscope bias; initial calibration of the gyroscope bias is obtained by linearizing the IMU pre-integration with respect to the bias of the gyroscope and minimizing the cost function. The cost function is calculated by the formula:

wherein the content of the first and second substances,

indicating the index of all frames in the window.

(3) The velocity, gravity vector and metric scale are initialized. The velocity, gravity vector and metric vector are shown in the following equations:

wherein the content of the first and second substances,

representing the velocity of the IMU in the k-th image; g represents a gravity vector.

Two consecutive frames b in the window_kAnd b_k+1The conversion formula of (1) is as follows:

the following linear model was obtained:

wherein the content of the first and second substances,

Δt_kis the time interval between two successive frames.

By solving the linear least square problem, the velocity of each frame in the window is obtained, and the calculation formula is as follows:

in the fifth step, the concrete steps include:

(1) a cost function; for cameras and IMUs, the overall state that needs to be estimated is defined as:

wherein p represents an estimated position; r represents the estimated direction; x is the number of_camRepresenting a camera-related state vector; x is the number of_imuRepresenting IMU-related state vectors; λ represents the depth of the feature; v represents a velocity vector; b_aRepresenting an acceleration bias; b_gRepresenting the gyroscope bias.

After defining the state to be estimated, performing state estimation, wherein the state estimation is a maximum likelihood estimation problem and is defined as the following steps:

wherein S represents a measurement value set; x ═ x₀,x₁,…,x_n}，

And

position vector and direction quaternion respectively;

the uncertainty representing the measured value is a gaussian distribution h (·).

The negative log-likelihood function of the above equation is:

information matrix representing a gaussian distribution

And

h (-) represents the sensor model.

The state estimate is then converted to a non-linear least squares problem. Solving the nonlinear least squares problem requires factoring the camera and IMU sensors.

(2) Camera and IMU sensor factors. The camera factor consists of the features of each frame of image. The residual calculation formula observed in the image t is:

wherein, pi_cAnd

projection and back projection functions of the camera model, respectively; t represents a 4x4 homogeneous transformation matrix;

a transformation matrix representing the IMU center to camera center of the off-line calibration.

The IMU residual error calculation formula is:

wherein α represents a position; β represents a velocity; γ represents rotation;

negative value operation exclusively for non-linear rotation; g represents a gravity vector equal to 9.81.

(3) And (6) optimizing. And (3) after the camera factor and the IMU factor are obtained through the step (2), optimizing the nonlinear least square problem. The cost function is calculated by the formula:

wherein J represents each factor relative to the current state

A jacobian matrix.

After nonlinear approximation, the above equation yields a closed solution of δ χ. The derived calculation formula for the solution is as follows:

(4) and (5) marginating. To reduce the amount of computation, marginalization is introduced without losing useful information. Summarizing all marginalization factors by the above equation, resulting in a new H and b order of the rearranged states, the following relationship is obtained:

wherein the content of the first and second substances,

representing a set of states that are marginalized; x_rRepresenting the remaining state set.

The formula for marginalization is as follows:

obtaining a new prior H of the residual state by the above formula_pAnd b_p. After prior information about the current state is obtained, a Bayesian rule is used to compute the posterior as the product of the likelihood and the prior:

then the state estimation becomes the maximum a posteriori problem, and the calculation formula of the maximum a posteriori problem is as follows:

fig. 2 is a sparse map and a motion trajectory constructed by the method proposed by the present invention. The device for collecting data set in real world of the present invention is shown in fig. 3, and fig. 3 includes a binocular depth camera (MYNTEYE-S1030), an Inertial Measurement Unit (IMU) and a GPS sensor (GPS-U-Blox). After the data set is obtained, the precision verification of the algorithm is carried out, the precision comparison result is shown in fig. 4, and the traditional algorithm in fig. 4 is VINS-Fusion.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An automatic driving vehicle navigation method based on visual inertia SLAM is characterized by comprising the following steps:

2. The visual inertial SLAM-based autonomous vehicle navigation method of claim 1, wherein: in the first step, the specific steps include:

(1) the vehicle-mounted binocular camera collects left and right target gray level images at the frequency of 10 Hz;

(3) the master control aligns data; due to trigger delay, transmission delay and an unsynchronized clock, the generated timestamp is not equal to the actual sampling time, resulting in time misalignment between the camera and the IMU; the alignment data is calculated as:

t_IMU＝t_cam+t_d；

3. The visual inertial SLAM-based autonomous vehicle navigation method of claim 1, wherein: in the second step, the concrete steps include:

(1) training a dynamic target detection model; the self-trained neural network model can detect a specific dynamic target more accurately and more quickly; the training images are passed to a neural network to create a custom trained detector; the invention prepares a BDD100K training set to train a dynamic target detection model; the dynamic target detection model is trained by using a YOLOv5 network; like other neural networks, the YOLO network also needs to label a large number of images and send them to the model to train the parameters of the neural network; the dynamic targets detected by the dynamic target detection model comprise people, automobiles, motorcycles, buses and trucks;

the accuracy and the running speed of the dynamic target detection model have decisive influence on the overall result; the invention does a lot of work to improve the accuracy of the dynamic target detection model; in practical application, adding the dynamic target detection model into the visual inertia SLAM can increase the calculation amount of the algorithm, but cannot influence the real-time performance of the system;

(2) extracting a bounding box; the dynamic target detection model provides a class and a two-dimensional bounding box of a dynamic target; the dynamic target detection model requires three steps to obtain the bounding box of the dynamic target; first, the entire image is divided into equal sized grid cells, and the network runs on all the combined cells simultaneously; second, the network predicts each cell in the output feature mapA bounding box; finally, in order to obtain the actual coordinate values of the predicted bounding box with respect to the original image, the algorithm will be (b)_x,b_y,b_w,b_h) Dividing the size of the feature graph by the size of the original graph, and multiplying the size of the original graph by the size of the feature graph; the calculation formula of the prediction bounding box is as follows:

wherein (b)_x,b_y) Coordinates representing each grid cell at the prediction center; (b)_w,b_h) Representing the width and height of the bounding box; (c)_x,c_y) Representing the upper left coordinates of the grid cells in the feature map; p is a radical of_wAnd p_hRespectively representing the width and height of the predicted bounding box.

4. The visual inertial SLAM-based autonomous vehicle navigation method of claim 1, wherein: in the third step, the concrete steps include:

(1) eliminating noise and bias of the IMU; the measured value is influenced by acceleration bias, gyroscope bias and noise, and the calculation formula for eliminating the noise and the bias is as follows:

wherein the content of the first and second substances,

and

a rotation matrix representing an IMU coordinate system to a camera coordinate system; g^wRepresenting a gravity vector; acceleration noise n_aAnd gyroscope noise n_wIs gaussian white noise and is expressed as:

wherein the content of the first and second substances,

representing a position pre-integration;

representing a velocity pre-integration;

representing a rotational pre-integration;

(3) correcting deviation; if the predicted deviation varies little, it will be adjusted by a first order approximation of the deviation

And

the calculation formula is as follows:

5. the visual inertial SLAM-based autonomous vehicle navigation method of claim 1, wherein: in the fourth step, the concrete steps include:

(1) converting the pose from the camera frame into an IMU frame; the calculation formula is as follows:

wherein s is a proportional parameter to be solved;

and

representing external parameters between the camera and the IMU;

(2) calibrating the gyroscope bias; obtaining an initial calibration of the gyroscope bias by linearizing the IMU pre-integration for the bias with respect to the gyroscope and minimizing a cost function; the cost function is calculated by the formula:

wherein, B represents the index of all frames in the window;

(3) initializing a speed, a gravity vector and a measurement proportion; the velocity, gravity vector and vector of the metric ratio are shown in the following equations:

wherein the content of the first and second substances,

representing the velocity of the IMU in the k-th image; g represents a gravity vector;

two consecutive frames b in the window_kAnd b_k+1The conversion formula of (c) is:

the following linear model was obtained:

wherein the content of the first and second substances,

Δt_kis the time interval between two successive frames;

6. the visual inertial SLAM-based autonomous vehicle navigation method of claim 1, wherein: in the fifth step, the concrete steps include:

wherein p represents an estimated position; r represents the estimated direction; x is the number of_camIndicating camera relevanceA state vector; x is the number of_imuRepresenting IMU-related state vectors; λ represents the depth of the feature; v represents a velocity vector; b_aRepresenting an acceleration bias; b_gRepresenting a gyroscope bias;

wherein S represents a measurement value set; x ═ x₀,x₁,…,x_n}，

And

position vector and direction quaternion respectively;

the uncertainty representing the measurement is gaussian distributed h (·);

the negative log-likelihood function of the above equation is:

wherein the content of the first and second substances,

information matrix representing a gaussian distribution

And

h (-) represents the sensor model;

then, converting the state estimation into a nonlinear least square problem; solving the nonlinear least squares problem requires solving factors of the camera and the IMU sensor;

(2) camera and IMU sensor factors; the camera factor consists of the characteristics of each frame of image; the residual calculation formula observed in the image t is:

wherein, pi_cAnd

a transformation matrix representing an offline calibrated IMU center to camera center;

the IMU residual error calculation formula is as follows:

negative value operation exclusively for non-linear rotation; g represents a gravity vector equal to 9.81;

(3) optimizing; after the camera factor and the IMU factor are obtained through the step (2), the nonlinear least square problem is optimized; the cost function is calculated by the formula:

wherein J represents each factor relative to the current state

A jacobian matrix of;

after nonlinear approximation, the above formula obtains a closed solution of δ χ; the derived calculation of the solution is as follows:

(4) marginalizing; in order to reduce the calculation amount, marginalization is introduced on the premise of not losing useful information; summarizing all marginalization factors by the above equation, resulting in a new H and b order of the rearranged states, the following relationship is obtained:

wherein, χ_mRepresenting a set of states that are marginalized; chi shape_rRepresenting a set of remaining states;

the formula for marginalization is as follows:

obtaining a new prior H of the residual state by the above formula_pAnd b_p(ii) a After prior information about the current state is obtained, a Bayesian rule is used to compute the posterior as the product of the likelihood and the prior: p (χ | z) oc |. p (z |. χ) p (χ); then the state estimation becomes the maximum a posteriori problem, and the calculation formula of the maximum a posteriori problem is as follows: