WO2023155258A1

WO2023155258A1 - Visual inertial odometry method that contains self-calibration and is based on keyframe sliding window filtering

Info

Publication number: WO2023155258A1
Application number: PCT/CN2022/080346
Authority: WO
Inventors: 槐建柱; 庄园; 林煜凯
Original assignee: 武汉大学
Priority date: 2022-02-21
Filing date: 2022-03-11
Publication date: 2023-08-24
Also published as: CN114623817B; CN114623817A

Abstract

A visual inertial odometry method that contains self-calibration and is based on keyframe sliding window filtering, which belongs to the field of multi-sensor fusion navigation and positioning. In a traditional filtering method, old state quantities are continuously removed as time elapses, and in a situation of degraded motion, there is not enough parallax between frame bundles corresponding to a reserved state quantity, thus making it difficult to constrain motion and leading to drifting. The visual inertial odometry method that contains self-calibration and is based on keyframe sliding window filtering comprises several steps: image feature extraction, feature association based on keyframes, filter initialization, IMU-based state extrapolation, filter updating using feature observation, and keyframe-based state quantity management; and geometric parameters of a sensor can be estimated in real time. The described method is based on a keyframe organization state quantity; state quantities corresponding to keyframe bundles are not removed during degraded motion, so that good observation can still be ensured, and drifting is avoided. The described method is the first keyframe-based sliding window filtering method supporting self-calibration.

Description

A visual-inertial odometry method with self-calibration based on key frame sliding window filtering

technical field

The invention belongs to the field of multi-sensor fusion navigation and positioning, and in particular relates to the estimation of position, velocity and attitude based on a combined system of a monocular or binocular camera and an inertial measurement unit (IMU) and real-time self-calibration of sensor parameters in the system.

Background technique

How to estimate the position and attitude (i.e., pose) of a device in a scene in real time is one of the core issues in the application fields of augmented reality (AR), mobile robots, and drones. Visual inertial odometry (visual inertial odometry) can track the pose of the device in real time by using the data collected by the fixed camera and IMU on the device. This technology combines computer 3D vision and inertial dead reckoning, and uses low-cost sensors to achieve centimeter-level real-time positioning accuracy.

Many AR-related companies have developed this technology. In the 2014 Tango project, Google demonstrated that a mobile phone containing a fisheye camera and an IMU can estimate the pose of the mobile phone and perform three-dimensional reconstruction of the scene by fusing information from these sensors, which proves that mobile devices can operate without relying on satellite navigation systems. Possesses pose and space awareness in the situation. Subsequently, Google cooperated with mobile phone manufacturers to launch Tango-based smart phones. The mobile phone can track its own three-dimensional motion trajectory through the equipped fisheye camera and IMU. In 2017, Apple launched the ARKit framework for iOS handheld devices with cameras and IMUs for the development of augmented reality applications. The motion tracking technology is a typical visual-inertial odometry [1], [2]. In the same year, Google launched the ARCore framework for Android phones with visual-inertial odometry [3], [4].

Visual-inertial odometry is also widely used in industries such as drones and autonomous driving. DJI's new UAV can accurately estimate the pose in complex scenes through visual inertial odometry technology [5], so as to realize a series of flight actions such as obstacle avoidance and crossing. A visual-inertial odometry approach is used in several autonomous driving suites to assist in automated parking.

There are still some problems with current visual-inertial odometry methods, such as how to deal with imperfect parameter calibration and how to deal with degenerate motions (such as stationary and approximately pure rotation). For a large number of consumer products, low-cost sensors are usually not accurately calibrated, and these imprecise calibrations will significantly reduce the pose estimation accuracy of visual-inertial odometry. On the other hand, these devices (e.g., mobile phones) often experience degenerative motions during practical use, and existing real-time odometry methods often produce significant pose drift in these situations.

Contents of the invention

In order to solve the above two problems, the present invention proposes a visual-inertial odometry method suitable for camera and IMU combined systems based on keyframe-based sliding window filter (Keyframe-based Sliding Window Filter, KSWF) contained in the present invention. This method is based on Extended Kalman Filter (EKF) for state estimation, so it has good real-time performance. This method can use scene features to calibrate (ie, self-calibrate) the geometric parameters of the camera and IMU in real time, and filter the variables in the sliding window based on the concept of key frames to solve the degenerate motion. In the following description, we collectively refer to state quantities such as position, velocity, attitude, and sensor parameters estimated by the filter as state vectors. The filter also estimates the uncertainty of these state quantities, ie the covariance of the state vectors. For simplicity, the position and attitude are referred to as pose for short, and the pose and velocity at the same time t are referred to as navigation state quantities. In this article, sensors generally refer to cameras and IMUs, devices refer to devices that carry these sensors, such as mobile phones, and filters refer to programs that fuse data for state estimation. KSWF is suitable for a system consisting of one or two cameras and an IMU fixed, based on sliding window filter fusion of their collected data to estimate the state vector and covariance of the system (or carrying equipment). KSWF can calibrate the zero bias, scale factor, misalignment and acceleration sensitivity of the IMU in real time, as well as the camera's projection geometry parameters, external parameters, camera time delay and rolling shutter effect. Its feature association front-end obtains a series of frames from a monocular camera or a series of roughly synchronized frame pairs from a binocular camera as input. For simplicity, the monocular frames or binocular frame pairs are collectively referred to as frame bundles. The front-end extracts feature points in frame bundles, and matches feature points between frame bundles based on the concept of key frames, so as to obtain the trajectory observed by feature points (that is, feature trajectory), and according to the overlapping ratio of the current frame bundle and the historical frame bundle view Select the keyframe bundle. The back-end filter uses IMU data for pose estimation, then uses the observations in the feature trajectory to update the state vector and covariance of the filter, and finally manages the state quantities in the filter based on the concept of key frames to ensure real-time operation.

Because KSWF includes the geometric calibration parameters of the camera and IMU in the state vector, it can estimate the calibration parameters of the sensor in real time to deal with inaccurate calibration. In the case of accurate calibration parameters, the calibration parameters in the state vector can be fixed to reduce the amount of calculation.

In order to avoid the continuous increase of calculation amount over time, this method selects and removes redundant state quantities based on the concept of key frames. The process of removing redundant state quantities is also called marginalization. KSWF determines the marginalization order according to whether the navigation state quantity is associated with the keyframe bundle, so as to prolong the existence time of the keyframe bundle. In this way, when the device performs degenerate motion, since the key frame bundle basically does not change at this time, this method can avoid the pose drift of the traditional method.

Aiming at the problem that the existing camera-IMU fusion state estimation technology is difficult to deal with calibration errors and degenerate motions, the invention provides a visual inertial odometer method with self-calibration based on key frame sliding window filtering. At least one IMU (including a three-axis gyroscope and a three-axis accelerometer) and at least one camera (such as a wide-view camera or a fisheye camera) are fixedly installed on the device to which the present invention is applicable, and the shutter mode of the camera can be rolling shutter or global shutter. The present invention assumes that the camera and IMU on the device have been roughly calibrated. Examples of such devices are smartphones, AR glasses, delivery robots, etc. The device collects a sequence of image frame bundles through N cameras (each frame bundle contains N images with similar time taken by N cameras, N≥1) and an IMU collects 3-axis accelerometer data and 3-axis gyroscope data , and then use this method to fuse the data of the camera and IMU to estimate the state and parameters of the device, as well as the uncertainty. The position, velocity, attitude, and parameters of the camera and IMU are collectively called state vectors. The uncertainty of these state vectors is determined by the state The covariance of the vector is described; for simplicity, the pose and velocity at the same time t are simply referred to as the navigation state quantity;

Implementation method of the present invention comprises the following steps:

Step 1, when the j-th frame bundle arrives, extract the feature points and corresponding descriptors on the N images, and get the set of feature points for image k

Further, the detection method of the feature point in step 1 includes ORB and BRISK, and the method of generating the descriptor includes BRISK, FREAK and the feature descriptor generation method based on deep learning.

Step 2, match the frame bundle j with the previous K keyframe bundles to expand the feature trajectories of feature points. The matching of two frame bundles includes the matching of N pairs of images, and for each pair of images, the feature points in the two frames of images are matched according to the feature descriptor. Two matching feature points correspond to a landmark point L _m in a scene, m=0,1,..., the observation of the landmark point L _m in multiple frames of images (j=0,1,...)

(

where u, v are the pixel coordinates of the feature point in the kth image of the jth frame beam) to form a feature trajectory

After the current frame bundle j is matched with K key frame bundles, it is determined whether to select frame bundle j as a key frame bundle according to the ratio of the number of feature matches in frame bundle j to the number of feature points.

Next, if the last frame bundle j-1 is not a key frame bundle, frame bundle j and frame bundle j-1 will also be matched to expand the feature track of the feature points.

For the case of N>1, pairwise matching will be performed on the N frame images in the frame bundle j to expand the feature track of the feature points.

The matching between two frames of images is realized by finding feature descriptors with similar Hamming distances. The implementation of searching for similar feature descriptors includes brute force search, k-nearest neighbor or window-based search.

Step 3, if the sliding window filter has not been initialized, the state vectors such as pose, velocity and sensor parameters and their covariance will be initialized through the feature trajectory and IMU data in several frame bundles (such as two keyframe bundles).

In step 3, the state vector X(t) contains the current navigation state quantity π(t), the IMU parameter x _imu , and the parameters of N cameras

Navigation state quantities corresponding to the past N _kf keyframe bundles

The navigation state quantity corresponding to the most recent N _tf +1 frame bundles

and m landmark points L ₁ , L ₂ ,...,L _m ; among them, x _imu includes the zero bias, scale factor, dislocation and acceleration sensitivity of the IMU, and x _C includes the projection geometry parameters and external parameters of each camera , the time delay of the camera and the rolling shutter effect; the past N _kf key frame bundles and the latest N _tf frame bundles form a sliding window of the navigation state quantity.

Step 4, when the filter is initialized, for the current frame bundle j that has completed feature matching, according to the IMU data, recursively estimate the time _t j corresponding to the frame bundle j from the pose and velocity corresponding to the previous frame bundle j-1 pose and velocity, and clone the estimated pose and velocity to augment the state vector and covariance of the filter.

Step 5: Using multiple feature trajectories in frame bundle j as observation information, update the state vector and covariance of the filter. The updated state vector and covariance can be released to downstream applications, such as projection of virtual scenes in AR and robot path planning. According to whether the feature track disappears in the current frame bundle j or whether the corresponding landmark point is in the state vector, the processing mode of the feature track is determined. In turn, observations in these feature trajectories are used to update the state vector and covariance. This method supports real-time self-calibration because the geometric parameters of the sensor can be placed in the state vector to be updated.

Specifically, different methods are used to deal with the three situations that may occur in each feature trajectory. The three cases include that the feature track disappears in the current frame bundle j, the feature track corresponds to a landmark point in the state vector, and the feature track corresponds to a well-observed landmark point that is not in the state vector. For the first case, a landmark point is first triangulated with the characteristic trajectory, and the reprojection error and Jacobian matrix of all observations in the characteristic trajectory are calculated, and then the matrix zero-space projection is used to eliminate the Jacobian matrix of the landmark point parameters, and then Outliers were removed by the Markov test. For the second case, the reprojection error and the Jacobian matrix of the feature trajectory observed in the current frame bundle j are first calculated, and then the outliers are removed by the Markov test. For the third case, first use the feature trajectory to triangulate a landmark point, then calculate the reprojection error and Jacobian matrix of all observations in the feature trajectory, remove outliers by the Markov test, and then add the landmark point to the state vector and covariance matrix. Finally, according to the above three observation situations, all the characteristic trajectories are divided into three categories. For each type of characteristic trajectory, the above-mentioned reprojection error and Jacobian matrix are superimposed, and the state vector and covariance matrix are updated. The update method is the same as that of the classic EKFs are the same. .

Step 6, in order to control the calculation amount, after the update is completed, the redundant navigation state quantity in the current state vector will be detected. These redundant state quantities are selected based on whether they correspond to keyframe bundles, and are then moved out of the filter.

Specifically, when the number of navigation state variables in the filter state vector exceeds the sum of the allowable number of key frame bundles N _kf and the number of recent frame bundles N _tf , redundant frame bundles are selected and marginalized. For each marginalization operation, at least N _r redundant frame bundles are selected (for monocular cameras, N _r is 3, and for binocular cameras, N _r is 2). To meet this requirement, redundant frame bundles are first selected among the nearest non-key frame bundles while excluding the nearest N _tf frame bundles, and then redundant frame bundles are selected among the earliest key frame bundles. After finding N _r redundant frame bundles, the filter can be updated with feature trajectories whose length exceeds two observations—that is, if such a feature track can successfully triangulate a landmark point with all its observations, then use The feature point observations in the redundant frame bundle on the feature track are updated by EKF, similar to the first case in step 5; the feature points belonging to other feature tracks in the redundant frame bundle will be discarded. After the EKF is updated, the rows and columns of the navigation state quantities and the covariance matrix corresponding to these redundant frame bundles are deleted.

For the next frame bundle j+1, steps 1-6 will be repeated. Each loop will publish the filter estimated state vector and covariance to serve downstream applications. These published information describe state quantities such as pose, velocity, and sensor parameters of the device.

Compared with prior art, advantage and beneficial effect of the present invention are as follows:

1. Since the odometry method can estimate IMU internal parameters, camera internal parameters, camera external parameters, time delay and rolling shutter effect during the filtering process, it can be applied to camera-IMU combinations with inaccurate initial calibration or rolling shutter, An example is the camera-IMU combination on a mobile phone.

2. The odometry method uses keyframe-based feature association and state quantity management, so it can robustly handle degenerate motion situations, such as stationary, hovering, and approximate pure rotation.

Description of drawings

Figure 1 is a schematic diagram of the visual inertial odometry method based on key frame sliding window filtering. The gray bottom module uses the concept of key frames.

Figure 2 is a schematic diagram of the multi-thread implementation of the visual inertial odometry method based on key frame sliding window filtering. The gray bottom module uses the concept of key frames.

Figure 3 is a flow chart of steps ①-⑦ of image feature point matching based on key frames, taking a binocular camera as an example.

Fig. 4 is a schematic diagram of two examples of state quantities contained in a state vector.

Fig. 5 is a schematic diagram of state vectors before and after cloning the navigation state quantity for the current frame bundle j for the situation in Fig. 3 , where π _j is a newly added state quantity.

FIG. 6 is a schematic diagram of the covariance matrix before cloning the navigation state quantity of the current frame bundle j for the situation in FIG. 3 .

Figure 7 is a schematic diagram of the covariance matrix after cloning the navigation state quantity of the current frame bundle j for the situation in Figure 3. The squares filled with diagonal lines correspond to the newly added rows and columns. Note that the rows and columns filled with diagonal lines The rows and columns filling the checkerboard have equal elements.

Figure 8 is a schematic diagram of the three ways that the feature trajectory of the landmark point is used for state vector and covariance update. If the landmark point disappears in the current frame bundle and is not in the state vector, it is projected to remove its Jacobian coefficient matrix, and then used for filtering Update; if the landmark point is in the state vector, it is directly used for filter update; if the landmark point is observed in the current frame beam and can be successfully triangulated, it is added to the state vector, and then used for filter update.

FIG. 9 is a schematic diagram of a method for selecting redundant frame bundles. The figure shows the navigation state quantities corresponding to key frame bundles (gray background) and normal frame bundles (white background). Note that at least one keyframe bundle needs to be preserved after marginalizing redundant frame bundles.

Figure 10 is a schematic diagram of two examples of redundant frame bundle selection, assuming N _tf =3, N _r =2, the gray background marks the key frame bundle, the white background marks the common frame bundle, and the dotted line marks the marginalized redundant Yu frame bundle. Left picture: For the situation in Figure 3, after completing the filter update of the current frame bundle, the redundant frame bundle needs to be selected. Right: For another hypothetical situation, redundant frame bundles need to be selected.

Detailed ways

The present invention uses the data collected by the camera and IMU, and considers the problems caused by the residual error of low-end sensor calibration and degenerate motion such as static, and proposes a key frame-based sliding window filter with self-calibration for real-time pose estimation. This filter is suitable for the combined system of monocular or binocular camera and IMU. The shutter mode of the camera can be rolling shutter or global shutter. By estimating the geometric calibration parameters of the sensor in real time, the accuracy of pose estimation can be improved; by using the concept of key frames, the pose drift phenomenon during degenerate motion can be avoided.

The method provided by the present invention can be implemented with computer software technology, and the main steps of implementation are: image feature extraction, feature association based on key frames, filter initialization, state calculation based on IMU, update filter using feature observation, and key frame-based For status quantity management, see Figure 1. Embodiment Take binocular camera and IMU combined system as an example to carry out a specific elaboration on the process of the present invention, as follows:

Step 1, for a frame bundle j containing N frames of time-similar images captured by N cameras, detect the feature points in each image, and generate descriptors (ie, feature vectors) for these feature points. The feature point detection methods include ORB[6] or BRISK[7], and methods for generating descriptors include BRISK[7] or FREAK[8].

The specific implementation process of the embodiment is described as follows (for example, N=2, both feature detection and description methods are BRISK):

Detect and extract image features for frame bundle j captured by 2 cameras. In order to detect feature points for each frame of image, first construct a scale space with 4 layers and 4 inner layers, and obtain 8 images with different sampling rates of the frame image. Then use the FAST[9] detection operator to locate the feature points, and perform non-maximum value suppression on the located feature points in the multi-scale space, that is, each feature point is related to the image space (8-neighborhood) and scale of the feature point. A total of 26 neighborhood points in the space (upper and lower layers of 9×2=18 neighborhoods) are compared, and the feature points with the maximum FAST response value in the neighborhood are retained, and non-maximum values in the neighborhood are removed. Finally, based on the quadratic fitting, the sub-pixel-level precise position of the feature points and the precise scale of the scale space are obtained. In order to describe each feature point, first of all, concentric circles with different radii are constructed with the feature point as the center, and a certain number of equally spaced sampling points are obtained on each circle. There are 60 sampling points including the feature point itself. Points are Gaussian filtered to eliminate the aliasing effect, and the main direction of the gray scale formed by these sampling points is calculated; then the sampling area around the feature point is rotated to the main direction to obtain a new sampling area, and 512 sampling point pairs are constructed, according to The gray difference of point pairs forms a 512-dimensional binary code, that is, each feature vector is 64Bytes.

Step 2, use the image feature descriptor extracted in step 1 to perform feature association based on keyframes (also called feature matching). As shown in Fig. 3, the matching process associates the features of the current frame bundle j with the features of several previous frame bundles, including the K most recent keyframe bundles and the previous frame bundle j−1. After the matching with the K key frame bundle is completed, if the ratio of the number of feature matches to the number of feature points in the current frame bundle is < T _r , frame bundle j is selected as the key frame bundle. If there are multiple cameras, feature association can also be performed between the N images of the current frame bundle j. The matching of feature descriptors between two frames of images is done by brute force search, k-nearest neighbor or window search.

The specific implementation process of the embodiment is described as follows (for example, N=2, K=2, violent search, T _r =0.2):

Figure 3 shows the process of feature point matching for the data collected by a device with a binocular camera, where the frame bundles with a gray background represent key frame bundles, and the frame bundles with a white background represent ordinary frame bundles. Feature matching includes ①-⑦ and other steps, ① is the matching of the current frame bundle j and the left camera image of the previous frame bundle j-1, ② is the matching of the current frame bundle j and the key frame bundle j-2 left camera image, ③ is The match between the current frame bundle j and the key frame bundle j- ₂ left camera image, ④ is the match between the current frame bundle j and the previous frame bundle j-1 right camera image, ⑤ is the current frame bundle j and the key frame bundle j-2 right The matching of camera images, ⑥ is the matching of the current frame bundle j and the right camera image of the key frame bundle j ₂ , and ⑦ is the matching between the two frame images of the left and right cameras in the current frame bundle j. Feature matching first performs the matching between the current frame bundle j and the first two key frame bundles j-2, j ₂ —②③⑤⑥, and then judges whether the current frame bundle j is selected as the key frame bundle. If the number of feature matches in the current frame bundle is less than 20% of the number of feature points, select j as the key frame bundle. Next, perform the matching of the current frame bundle j and the previous frame bundle j-1—①④. Finally, perform the matching between the 2 frame images in the current frame bundle j—⑦. The matching of every two frames of images consists of two steps. First, the nearest feature matching is found by violent search according to the Hamming distance between the feature descriptors, and then the abnormal matching is removed by the RANSAC method including the 5-point method.

Step 3, if the sliding window filter has not been initialized, initialize the state vector and covariance using the feature trajectory and IMU data in several frame bundles (such as 2 keyframes). The state vector X(t) composed of all state quantities at time t is generally shown in Figure 4, which includes the navigation state quantity π(t) at the current time t, the IMU parameter x _imu , and the parameters of N cameras

Navigation state quantities corresponding to the past N _kf keyframe bundles

and m landmark points L ₁ , L ₂ , ..., L _m (in FIG. 4 , N _kf =2, N _tf =3). Among them, x _imu can include the bias, scale factor, misalignment and acceleration sensitivity of the IMU (Fig. 4 instance 2), and x _C can include the projection geometry parameters of each camera, extrinsic parameters, camera time delay and rolling shutter effect (Fig. 4 instance 2). These past N _kf key frame bundles and the latest N _tf frame bundles form a sliding window of the navigation state quantity.

The specific implementation plan of embodiment is:

After step 2 gives two key frame bundles j ₁ , j ₂ and corresponding IMU data, initialize the filter. The initial position and velocity of the filter are set to zero, and the initial attitude is determined from the initial accelerometer data, so that the z-axis of the filter's reference world coordinate system {W} is along the negative gravity direction. For the bias of the IMU, if the pixel movement (i.e., optical flow) of the frame bundle _j2 relative to _j1 is small, the bias of the gyroscope and accelerometer is set by averaging the IMU data. Otherwise, the bias of the gyroscope and accelerometer will be set to zero. The time delay of the camera relative to the IMU is initialized to 0. The covariance of other sensor parameters and state quantities is set according to data tables or experience. If the sensor has been well calibrated in advance, the entries of the covariance matrix corresponding to these sensor parameters can be set to zero to fix these parameters, that is, these parameters will not be updated.

Step 4, for the current frame bundle j, propagate the navigation state quantity π(t) and the covariance matrix to the time t _j corresponding to the frame bundle through the IMU data. Then, clone the navigation state quantity π(t _j )=π(t) and add it to the state vector, and add the row and column corresponding to π(t _j ) in the covariance matrix, and the values of these rows and columns are changed from corresponding π( t) for row and column replication.

The specific implementation plan of embodiment is:

For frame bundle j, the navigation state quantity π(t) and covariance matrix are propagated to the time t _j corresponding to the frame bundle through IMU data. Then, the navigation state quantity π(t) is cloned into π(t _j ) and added to the state vector (as shown in Figure 5, note that π _j := π(t _j )), and the covariance matrix (as shown in Figure 6 The corresponding rows and columns are also amplified (as shown in Figure 7).

Step 5, update the state vector and covariance using the feature trajectory in the current frame bundle j. For the three situations that may occur in the feature trajectory, different ways are adopted to perform EKF update. These three cases include that the feature trajectory disappears in the current frame bundle, the landmark point corresponding to the feature trajectory is in the state vector, and the feature trajectory corresponds to a well-observed landmark point that is not in the state vector. Well-observed landmark points need to meet two conditions. One is that there are sufficiently long observations, such as being observed in 7 frames of images, and the other is that their corresponding surface points can be successfully triangulated, that is, these observations have sufficient parallax. For the first case, first use the characteristic trajectory to triangulate a landmark point, then calculate the reprojection error and Jacobian matrix of all observations on the characteristic trajectory, and then use the matrix zero-space projection to eliminate the Jacobian matrix of the landmark point parameters, Outliers were removed by the Markov test. For the second case, the reprojection error and the Jacobian matrix of the observations of the feature trajectory in the current frame bundle j are first calculated, and then the outliers are removed by the Markov test. For the third case, first use the characteristic trajectory to triangulate a landmark point, then calculate the reprojection error and Jacobian matrix of all observations on the characteristic trajectory, remove outliers through the Markov test, and then add the landmark point to the state vector and covariance matrix. Finally, according to the above three observation situations, all the characteristic trajectories are divided into three categories. For each type of characteristic trajectory, the above-mentioned reprojection error and Jacobian matrix are superimposed, and the state vector and covariance matrix are updated. The update method is the same as that of the classic EKFs are the same.

The specific implementation plan of embodiment is:

As shown in Figure 8, for a feature trajectory that disappears in the current frame bundle j, if it contains at least 3 observations, it can be used as a complete feature trajectory to triangulate a landmark point. If the triangulation is successful, the reprojection error and the Jacobian matrix of all observations on the feature trajectory will be calculated, and then the matrix null space projection will be used to eliminate the Jacobian matrix of the landmark point parameters, and then the outliers will be removed by the Markov test. For the feature trajectory of a landmark point in the state vector, the reprojection error and the Jacobian matrix of the landmark point observation in the current frame bundle j are first calculated, and then the outliers are removed by the Markov test. For sufficiently long feature trajectories of landmark points that are not in the state vector (such as the number of observations > 7), first triangulate the landmark points, and if the triangulation is successful, the reprojection error and Jacobian matrix of all observations on the feature trajectory will be calculated , and then the outliers are removed by the Markov test, and then the landmark points are added to the state vector and the covariance matrix is expanded accordingly. Finally, divide all feature trajectories into three categories according to the above conditions, stack the reprojection errors and Jacobian matrices corresponding to all observed trajectories in each category, and perform an EKF update step (a total of three updates).

Step 6, in order to limit the amount of calculation, when the number of navigation state quantities in the sliding window exceeds the sum of the allowable number of key frame bundles N _kf and the number of recent frame bundles N _tf , select redundant frame bundles from the sliding window and convert them to marginalized. Because two observations of an unknown landmark point cannot provide effective constraints on the pose, at least N _r redundant frame bundles are selected in a marginalization operation (for monocular cameras, N _r is 3, for stereo cameras, _Nr is 2). To meet this requirement, redundant frame bundles are first selected among the nearest non-key frame bundles while excluding the nearest N _tf frame bundles, and then redundant frame bundles are selected among the earliest key frame bundles. After finding N _r redundant frame bundles, the sliding window filter can be updated by using the feature trajectory whose length exceeds two observations. Specifically, if such a feature trajectory can successfully triangulate a landmark point with all its observations, Then use the feature point observations in the redundant frame bundle on the feature track to perform EKF update, similar to the first case in step 5; the feature points belonging to other feature tracks in the redundant frame bundle will be discarded. After the EKF is updated, the rows and columns of the navigation state quantities and the covariance matrix corresponding to these redundant frame bundles are deleted.

The specific implementation plan of embodiment is:

In order to select N _r redundant frame bundles (for a monocular camera, N _r is 3, and for a binocular camera, N _r is 2), as shown in Figure 9, the nearest N _tf frame bundles are excluded first, and then the nearest Select redundant frame bundles from the non-key frame bundles (assuming that n _r are selected), and then select N _r -n _r redundant frame bundles from the earliest key frame bundles. Note that after removing these redundant frame bundles, at least one keyframe bundle should be kept. For the case of N _kf =2 and N _tf =3 in FIG. 3 , the redundant frame bundles are key frame bundle j ₁ and normal frame bundle j-3 (as shown by the dotted line box in the left figure of FIG. 10 ). For another hypothetical situation, as shown in the right figure of FIG. 10 , the redundant frame bundles are the two oldest frame bundles. When a redundant bundle is selected, the filter is updated with observations of landmark points that are observed more than twice in the redundant bundle. If the landmark point can be successfully triangulated using all its observations, use the observations in the redundant frame bundle on its feature trajectory to perform EKF update, similar to the first case in step 5; other feature point observations in the redundant frame bundle will be thrown away. After the EKF is updated, the rows and columns of the navigation state quantities and the covariance matrix corresponding to these redundant frame bundles are deleted.

The method provided by the present invention can also be implemented as a corresponding program by utilizing multi-thread design, as shown in FIG. 2 . The multi-threaded program of visual-inertial odometry based on key frame sliding window filtering includes feature extraction thread, key frame based feature association thread, camera-IMU synchronization thread and sliding window filtering thread. This multi-threaded implementation can significantly increase the throughput of the program.

The feature extraction thread is used to detect the feature points of each frame image captured by the camera, and use a feature description method to generate descriptors for the feature points. For a frame bundle j with N timestamps close to the image captured by N cameras, use methods such as BRISK [7] to detect feature points and extract descriptors. BRISK uses the FAST[9] algorithm for feature point detection, and detects features by constructing image pyramids in the scale space. In order to generate the descriptor of the feature point, the binary code of the descriptor is obtained by comparing the gray value of 512 pixel pairs in the neighborhood of the feature point.

The key frame-based feature association thread is used to associate the image features of the current frame bundle and the previous frame bundle, and perform key frame-based feature point matching according to the image features extracted by the feature extraction thread. The specific matching steps include matching the current frame bundle with K key frame bundles, matching the current frame bundle with the previous frame bundle, and matching the N images of the current frame bundle two by two. After the matching with the K key frame bundles is completed, it will be judged whether the current frame bundle is selected as the key frame bundle according to the ratio of feature matching. One way to achieve two-frame image feature matching is to find the closest descriptor to each feature descriptor through violent search. Finally, these matches will go through a RANSAC method including a 5-point method to remove outliers.

Camera-IMU synchronization thread for synchronizing camera and IMU data. The thread can lock other threads through the condition variable to wait for the arrival of the IMU data corresponding to the current frame bundle.

The sliding window filtering thread uses the feature trajectory and IMU data obtained by the feature association thread to update the filter. If the filter has not been initialized, the state vector and covariance of the filter will be initialized using the feature trajectory between frame bundles, IMU data, data tables and empirical values. For the already initialized situation, when the frame bundle j arrives, the navigation state quantity π(t) and the covariance matrix are propagated to the moment t _j of the j frame bundle through the IMU data, and then the navigation state quantity π(t _j )=π (t) and added to the state vector, and the covariance matrix is also expanded accordingly. Then, according to the properties of the feature trajectory, the EKF update is performed in three cases: the feature trajectory that disappears in the current frame bundle j, the feature trajectory corresponding to the landmark point in the state vector, and the landmark point that is not in the state vector but has a long enough observation Characteristic locus of punctuation. The three cases differ in how the update is prepared. For the first case, the matrix null-space projection is utilized to eliminate the Jacobian matrix of the landmark point parameters. For the third case, the triangulated new landmark points will be added to the state vector and covariance matrix. These characteristic trajectories are divided into three categories according to these three situations. A classic EKF update is performed per class. When the number of navigation state quantities in the state vector exceeds the sum of the allowable number of key frame bundles N _kf and the number of recent frame bundles N _tf , redundant frame bundles are selected and marginalized. At least N _r redundant frame bundles are selected each time (for monocular devices, N _r is 3, and for binocular devices, N _r is 2). When these frame bundles are selected, the nearest N _tf frame bundles will be excluded, then redundant frame bundles will be selected among non-key frame bundles, and finally redundant frame bundles will be selected among the earliest key frame bundles. After the redundant frame bundle is selected, the filter can be updated with observations of landmark points observed more than twice in the redundant frame bundle. After finding these landmark points, select the landmark points that can be successfully triangulated, and use their observations in the redundant frame bundle to perform an EKF update, similar to the first case of the above update. Other feature point observations in redundant frame bundles will be discarded. After the update is completed, delete the entries of the navigation state quantity and covariance matrix corresponding to these redundant frame bundles.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention pertains may make various modifications, supplements or similar substitutions to the described specific embodiments without departing from the spirit of the present invention or exceeding the scope defined in the appended claims.

[1] A.Flint, O.Naroditsky, C.P.Broaddus, A.Grygorenko, S.Roumeliotis, and O.Bergig, "Visual-based inertial navigation," US9424647B2, Aug.23, 2016 Accessed: Apr.26, 2021. [Online].Available: https://patents.google.com/patent/US9424647/nl

[2] S.I.Roumeliotis and K.J.Wu, "Inverse sliding-window filters for vision-aided inertial navigation systems," US9658070B2, May 23, 2017 Accessed: Apr.26, 2021. [Online]. Available: https://patents. google.com/patent/US9658070B2/en? oq＝US9658070B2-patent-Inverse+sliding-window+filters+for+vision-aided+inertial+navigation

[3] S.I.Roumeliotis and A.I.Mourikis, "Vision-aided inertial navigation," US9766074B2, Sep.19, 2017 Accessed: Apr.26, 2021. [Online]. Available: https://patents.google.com/ patent/US9766074B2/en

[4]M.Li and A.Mourikis, "Real-time pose estimation system using inertial and feature measurements," US9798929B2, Oct.24, 2017 Accessed: Feb.18, 2022.[Online].Available:https:// patents.google.com/patent/US9798929B2/en

[5] T.Qin, P.Li, and S.Shen, "VINS-Mono: A robust and versatile monocular visual-inertial state estimator," IEEE Trans.Robot., vol.34, no.4, pp.1004 –1020, Aug.2018, doi:10.1109/TRO.2018.2853729.

[6] E.Rublee, V.Rabaud, K.Konolige, and G.Bradski, "ORB: An efficient alternative to SIFT or SURF," in 2011International Conference on Computer Vision, Nov.2011, pp.2564–2571.doi :10.1109/ICCV.2011.6126544.

[7] S.Leutenegger, M.Chli, and R.Y.Siegwart, “BRISK: Binary robust invariant scalable keypoints,” in Intl.Conf.on Computer Vision (ICCV), Barcelona, Spain, Nov.2011, pp.2548–2555 .doi:10.1109/ICCV.2011.6126542.

[8] A.Alahi, R.Ortiz, and P.Vandergheynst, "FREAK: Fast retina keypoint," in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Jun.2012, pp.510–517. doi:10.1109/CVPR .2012.6247715.

[9] E.Rosten and T.Drummond, "Machine learning for high-speed corner detection," in European conference on computer vision, 2006, pp.430–443.

Claims

A kind of visual-inertial odometry method containing self-calibration based on key frame sliding window filtering, is characterized in that, comprises the following steps:

First, an IMU and N cameras are fixedly installed on the device. The camera is used to collect the sequence of image frame bundles, and the IMU is used to collect 3-axis accelerometer data and 3-axis gyroscope data, and then perform the following steps to estimate the state of the device through the filter and parameters, as well as uncertainty; hereinafter, the parameters of position, velocity, attitude, camera and IMU are collectively referred to as state vectors, and the uncertainty of these state vectors is described by the covariance of state vectors; for simplicity, at the same moment The pose and velocity of t are referred to as navigation state quantities for short;

Step 1, for the frame bundle j of N images captured by N cameras with similar time frames, detect the feature points in each image, and generate descriptors, ie, feature vectors, for these feature points;

Step 2, using the feature descriptor extracted in step 1 to perform keyframe-based feature matching;

Step 3, when the filter has not been initialized, the state vector and covariance will be initialized through the feature trajectory and IMU data in the frame bundle;

Step 4, when the filter is initialized, for the frame bundle j that has completed the feature matching, according to the data of the IMU, recursively estimate the position of the frame bundle j corresponding to the time t j from the pose and velocity corresponding to the frame bundle j-1 pose and velocity, and clone the inferred pose and velocity to augment the state vector and covariance of the filter;

Step 5, update the state vector and covariance of the filter by using the feature trajectory of the feature point;

Step 6, after the update is completed, detect whether there is a redundant frame, and delete the corresponding row and column of the navigation state quantity and the covariance matrix corresponding to the redundant frame;

For the next frame bundle j+1, steps 1-6 will be repeated;

Each loop publishes the filter estimated state vector and covariance to serve downstream applications.
A kind of visual-inertial odometry method containing self-calibration based on key frame sliding window filtering as claimed in claim 1, characterized in that: the detection method of feature points in step 1 includes ORB and BRISK, and the method for generating descriptors includes BRISK , FREAK and deep learning-based feature descriptor generation methods.
A kind of visual-inertial odometry method containing self-calibration based on key frame sliding window filtering as claimed in claim 1, characterized in that: in step 2, the feature matching process is to combine the features of the current frame bundle j with several previous frame bundles The features associated with each other, including the K most recent key frames and the previous frame bundle j-1, after the matching with the K key frames is completed, if the ratio of the number of feature matches to the number of feature points in the current frame is less than T r , then frame bundle j is selected as the key frame; if each frame bundle contains N images of multiple cameras, feature association is performed between the N images of the current frame bundle j, and the feature descriptor between two frame images Matching is done by brute-force search, k-nearest neighbor or window search.
A kind of visual-inertial odometer method containing self-calibration based on key frame sliding window filtering as claimed in claim 1, is characterized in that: in step 3, state vector X (t) comprises current navigation state quantity π (t), IMU Parameter x imu , parameters of N cameras
Navigation state quantities corresponding to the past N kf keyframe bundles
The navigation state quantity corresponding to the most recent N tf +1 frame bundles
and m landmark points L 1 , L 2 ,...,L m ; among them, x imu includes the zero bias, scale factor, misalignment and acceleration sensitivity of the IMU, and x C includes the projection geometry parameters, external parameters, camera The time delay and rolling shutter effect of ; the past N kf key frame bundles and the latest N tf frame bundles form a sliding window of the navigation state quantity.
A kind of visual-inertial odometry method containing self-calibration based on key frame sliding window filtering as claimed in claim 1, characterized in that: in step 5, for the three situations that may occur for each feature track, different methods are used to execute EKF update, these three cases include the feature trajectory disappearing in the current frame, the landmark point corresponding to the feature trajectory is in the state vector, and the feature trajectory corresponds to a well-observed landmark point that is not in the state vector; for the first case, first use The feature trajectory triangulates a landmark point, and then calculates the reprojection error and Jacobian matrix of all observations on the feature trajectory, and then uses the matrix zero-space projection to eliminate the Jacobian matrix of the landmark point parameters, and then removes outliers through the Markov test; for In the second case, the reprojection error and the Jacobian matrix of the observations of the feature trajectory in the current frame bundle j are first calculated, and then the outliers are removed by the Markov test; for the third case, a landmark point is first triangulated using the feature trajectory , and then calculate the reprojection error and Jacobian matrix of all observations on the characteristic trajectory, remove outliers through the Markov test, and then add the landmark point to the state vector and covariance matrix; finally, according to the above three observations, all The characteristic trajectory is divided into three types. For each type of characteristic trajectory, the state vector and covariance matrix are updated using the above-mentioned reprojection error and Jacobian matrix. The update method is the same as that of the classic EKF.
A kind of visual-inertial odometer method containing self-calibration based on key frame sliding window filtering as claimed in claim 5, is characterized in that: the concrete realization mode of step 6 is as follows;

When the number of navigation state quantities in the filter state vector exceeds the sum of the allowable number of key frames N kf and the number of recent frames N tf , then select redundant frame bundles and marginalize them; for each marginalization operation, at least select N r redundant frame bundles, in order to meet this requirement, first select redundant frames in the nearest non-key frame bundle, while excluding the nearest N tf frame bundles, then select redundant frames in the earliest key frame bundle, After finding N r redundant frames, update the sliding window filter by using the feature trajectory whose length exceeds two observations, that is, if such a feature trajectory can successfully triangulate a landmark point with all its observations, then use this trajectory The feature trajectories in the upper redundant frame are updated by EKF, the update method is the same as the first case in step 5; the feature points belonging to other feature trajectories in the redundant frame will be discarded, after the EKF is updated, delete these redundant frames corresponding to The rows and columns of the navigation state quantities and covariance matrix.