GB2599948A

GB2599948A - Initialising keyframes for visual-inertial localisation and/or mapping

Info

Publication number: GB2599948A
Application number: GB2016445.5A
Authority: GB
Inventors: F Alcantarilla Pablo
Original assignee: Slamcore Ltd
Current assignee: Slamcore Ltd
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2022-04-20
Also published as: GB202016445D0

Abstract

Provided are a mobile device and computer-implemented method for selecting and initialising keyframes to describe a 3-D environment, to support visual-inertial localisation in the environment and/or mapping of the environment. The method comprises selecting one or more image frames to be keyframes based at least in part on inertial data. The method comprises obtaining visual data generated at a mobile device, comprising a series of image frames captured by a camera of the mobile device, obtaining inertial data generated at the mobile device, comprising at least one or any combination of two or more of: accelerometer, gyroscope and compass measurements, selecting one or more of the image frames to be keyframes and initialising one or more new keyframes in the map from the respective selected one or more image frames wherein the selecting may be based at least in part on the inertial data and may include detecting a saturation event in the inertial data, and in response, rejecting image frames captured during a first time interval after the saturation event. The selecting may also include selecting an image frame to be a keyframe if the number of keyframes in a sliding temporal window before the image frame is below a first threshold.

Description

DESCRIPTION

INITIALISING KEYFRAMES FOR VISUAL-INERTIAL LOCALISATION AND/OR

MAPPING

FIELD OF THE INVENTION

This invention relates to localisation, meaning the estimation of the pose of a mobile device in a 3-D environment, and/or mapping, meaning the mapping of the 3-D environment. It relates in particular to visual-inertial localisation and/or mapping, in which visual data and inertial measurements are combined in the pose estimation and/or mapping process.

BACKGROUND OF THE INVENTION

Visual-inertial systems for simultaneous localisation and mapping (SLAM) are known in the art. These systems combine an inertial measurement unit (IMU) and a camera. The IMU provides one or more of accelerometer, gyroscope and compass measurements; and the camera provides visual input, in the form of a sequence of images. These may be either monocular images, stereo image-pairs, or a set of images from a multi-camera rig.

The mobile device comprising the camera and IMU may be handheld or mounted on a vehicle or robot moving through an environment. The goals of the system are twofold: to localise -that is, estimate the position and orientation ("pose") of the camera -and also to create a map of the environment through which the camera is moving.

Keyframe creation is an important part of the mapping and localisation process.

Keyframes seek to capture visual information of the 3-D environment in a compact form. It is desirable to create a sufficient number of keyframes to fully describe the environment. Equally, it is desirable not to create too many keyframes, in order to ensure a concise description of the environment which minimises unnecessary redundancy.

Better selection of keyframes can support more efficient and more accurate mapping and localisation. Conventionally, image frames are selected as keyframes based on the visual content of the image frame.

SUMMARY OF THE INVENTION

The invention is defined by the claims.

According to an aspect of the invention there is provided a computer-implemented method for initialising keyframes in a map of a 3-D environment, for visual-inertial localisation in the environment and/or mapping of the environment, according to claim 1.

The inventors have recognised that inertial measurements offer a rich source of information for better selection of keyframes, in a visual-inertial localisation system or a visual-inertial SLAM system. Inertial information can either be used alone or in combination with visual information, to guide the selection of keyframes.

The selecting may be performed "online", by evaluating each new image frame and deciding whether or not to create a keyframe based on that image frame.

The selecting may include: detecting a saturation event in the inertial data; and in response, rejecting image frames captured during a first time interval after the IS saturation event.

A saturation event occurs when one or more inertial sensors reports an inertial measurement equal to a maximum measurable value. This typically means (i) that there is uncertainty in the inertial measurement, because the true value is likely to be larger than the maximum measurable value; and (ii) that the mobile device is subject to relatively energetic motion. The inventors have recognised that it is advantageous not to create a keyframe for a certain period after such a saturation event. Creating a keyframe soon after saturation events may mean that there is significant motion blur in the new keyframe, which can make the tracking of visual features unstable.

Rejecting image frames means preventing the relevant frames from being selected as keyframes.

The first time interval may be defined explicitly or may be defined implicitly as a number of frames. For example, the method may reject the first 15 frames after a saturation event.

The selecting optionally includes selecting an image frame to be a keyframe if the number of keyframes in a sliding temporal window before the image frame is below a first threshold.

The image frame may be selected based on this criterion provided that none of the negative criteria summarised herein is met -for example, provided that there has not recently been a saturation event, and provided that a sufficient number of visual features has been extracted from the image frame (see below). The first threshold may be 2 frames, for example.

The selecting may include selecting an image frame to be a keyframe if a second time interval has elapsed since the preceding keyframe in the series of image frames. This may be beneficial for stability of tracking, because the covariance of pre-integrated IMU measurements will grow with longer time differences between consecutive key frames. The second time interval may be 3 seconds, for example. The method may further comprise calculating a measure of uncertainty for the inertial data, and the selecting may include selecting an image frame to be a keyframe if, for that image frame, a value of the measure of uncertainty is greater than a second threshold.

It can be beneficial to create a new keyframe if there is a large degree of uncertainty in the inertial measurements (provided that the inertial measurements are not saturated). The uncertainty tends to grow with the length of time that has elapsed since the preceding keyframe. However, it may grow at different rates depending on the circumstances. Consequently, considering the measure of uncertainty may offer a better way to trigger creation of a keyframe than simply considering the length of time that has elapsed.

The measure of uncertainty may comprise an error covariance of the inertial data. The error covariance may be calculated based on a comparison of the inertial data with an estimated pose of the mobile device based on visual and inertial data. The method may comprise: calculating a first measure of uncertainty for rotational inertial data; comparing the first measure with the second threshold; if the first measure is above the second threshold, selecting the corresponding image frame to be a keyframe; if the first measure is not above the second threshold calculating a second measure of uncertainty for acceleration data, comparing the second measure with a third threshold, and if the second measure is above the third threshold, selecting the corresponding image to be a keyframe.

The rotational inertial data may be associated with measurements from a gyroscope and/or a compass. The acceleration data may be associated with measurements from an accelerometer.

Accelerometer measurements are typically more uncertain than rotational measurements; therefore, the present approach can save computational effort by avoiding the need to calculate the second measure of uncertainty if the first measure of uncertainty (rotational uncertainty) is already above the threshold.

The method may further comprise extracting visual features from an image frame in the series of image frames, and the selecting may include rejecting the image frame if the number of visual features extracted from that image frame is below a fourth threshold.

If only a small number of visual features can be extracted from a given image frame, then this suggests that the image frame does not contain a very rich description of the environment. It is unlikely to be very useful as a keyframe, because there will not be many features available for matching. The fourth threshold may be 20 visual features, for example.

In some embodiments, visual features may be extracted from each image frame in the series of image frames.

Extracting visual features and matching them with landmarks in existing keyframes may be performed as part of the pose estimation process in a SLAM system. Therefore, using these visual features and/or landmarks to help guide the keyframe selection process need not impose any additional computational.

Each existing keyframe may have associated with it a plurality of landmarks. Each landmark may be a detected visual feature that is present in two or more keyframes. Each landmark defines a 3-D position for the respective visual feature. The method may comprise extracting visual features from an image frame in the series of image frames, and the selecting may include comparing the extracted visual features of said image frame with the landmarks of the preceding keyframe.

It can be beneficial to consider the relationship between the visual features in the current frame and the landmarks of the preceding keyframe, when deciding whether to create a new keyframe based on the current frame. For example, considering this relationship can enable the visual novelty of the current frame, relative to the preceding keyframe, to be taken into account.

The method may comprise: matching the extracted visual features of said image frame with the landmarks of the preceding keyframe, to identify matched visual features; calculating a first convex hull, being the two-dimensional convex hull of the matched visual features in said image frame; projecting the landmarks of the preceding keyframe into said image frame; calculating a second convex hull, being the two-dimensional convex hull of the projected landmarks in said image frame; calculating an area of overlap between the first and second convex hulls; calculating an overlap ratio, being the ratio of the area of overlap to the area of the second convex hull; and selecting the image frame to be a keyframe if the overlap ratio is below a fifth threshold.

In general, some of the projected landmarks will be outside the field of view of the current image frame. Furthermore, even for those projected landmarks that are present in the field of view, not all will be successfully matched with visual features in the current image frame. In this way, the first convex hull and the second convex hull will tend to shrink as the viewpoint changes between the current frame and the preceding keyframe, and so will the overlap between them. When the overlap ratio drops below the fifth threshold (for example, 90%) creation of a new keyframe is triggered.

The inventors have found that this criterion can perform better than other approaches. For example, alternative approaches might rely on the number of detected features in the current frame, or the convex hull of those detected features. However, these can be unreliable -a feature detector may find a significant number of "weak" features in the current image frame, which are not useful for matching/tracking and may skew any criterion based on them. In contrast, by examining only those features that match with existing landmarks, the present approach implicitly concentrates on features that are robust enough to be tracked between image frames.

Projecting the landmarks from the preceding keyframe into the current image frame may be based on an estimated pose of the mobile device associated with the respective image frames. The estimated pose may be based on the visual and inertial data. Methods of estimating the pose are summarised below.

The method may comprise: matching the extracted visual features of said image frame with the landmarks of the preceding keyframe, to identify matched visual features; estimating a pose for the image frame; identifying a set of inlier correspondences, being the set of matched visual features, and respective landmarks, that deviate from the estimated pose by less than an error threshold; and if the number of inlier correspondences is less than a sixth threshold, rejecting the image frame. It is important for each keyframe to have a reliable estimated pose associated with it. If the number of inlier correspondences is low, then there is low confidence that the estimated pose is reliable; therefore, a keyframe should not be created based on the image frame in question. The sixth threshold may be 20 inliers, for example.

Estimating the pose may comprise deriving at least one pose estimate by applying a perspective-3-point method to the matched visual features and respective landmarks. Perspective-n-point methods can allow the pose of a camera to be estimated from a set of n 3-D points (in this case, the landmarks) and their corresponding 2-D projections in the image frame (in this case, the matched visual features). In a P3P algorithm, the number, n, of points is 3.

Calculating the estimated pose may comprise applying a random sample consensus algorithm. In a RANSAC algorithm, the pose can be estimated over multiple iterations, each iteration using a randomly selected group (for example, triplet) of the matched 2-D visual features (and their corresponding 3-D landmarks) to derive a pose estimate. In each iteration, the quality of the pose estimate is checked by assessing how well that pose estimate fits the set of landmarks and matched visual features. The second pose may be generated based on the pose estimate giving the best fit. How well the pose estimate fits the landmarks and matched visual features can be assessed, for example, based on the number of landmarks and matched visual features that deviate from the pose estimate by less than a predetermined threshold error. These landmarks / matched visual features may be denoted "inner' correspondences. The second pose may be generated by minimising the re-projection error of the inlier correspondences.

The extracted visual features may be multiscale features in a nonlinear scale space, optionally Accelerated KAZE features. In the AKAZE approach, feature points can be detected by identifying maxima of a determinant of a Hessian matrix in the nonlinear scale space.

The extracted visual features and/or the landmarks may be described using rotation-and scale-invariant binary descriptors, optionally M-LDB descriptors. These descriptors may be generated by estimating a dominant orientation of the visual feature, rotating the visual feature based on the dominant orientation, and extracting a binary descriptor from the rotated visual feature.

The extracted visual features may be compared and matched with the landmarks by comparing their binary descriptors.

The descriptors may be generated using a gravity direction derived from the inertial data. Initialising a keyframe may comprise any one or combination of two or more of: storing the relevant image frame as a keyframe; storing extracted visual features of the image frame; storing the pose estimate of the image frame; and creating new landmarks associated with the keyframe.

Also provided is a computer program comprising computer program code configured to cause one or more physical computing devices to perform all the steps of a method as summarised above, when said computer program is run on the one or more physical computing devices. The computer program may be stored on a computer-readable storage medium, optionally a non-transitory computer-readable storage medium.

Also provided is a mobile device configured to initialise keyframes using a method as summarised above.

According to another aspect of the invention, there is provided a mobile device configured to initialise keyframes in a map of a 3-D environment, for visual-inertial localisation in the environment and/or mapping of the environment, according to claim 19. The mobile device may be comprised in a handheld device, a robot or a vehicle. The vehicle may be an unmanned and/or autonomous vehicle. The vehicle may be a land vehicle, an aerial vehicle, a marine vehicle or a submarine vehicle.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described by way of example with reference to the accompanying drawings, in which: Fig. 1 is a block diagram of a mobile device according to an example; Fig. 2 is a flowchart illustrating a method for visual-inertial SLAM; Fig. 3 is a flowchart illustrating the initial pose estimation step in the method of Fig. 2 Fig. 4 is a flowchart illustrating the local map tracking step in the method of Fig. Fig. 5 is a flowchart illustrating a process for initialising a new keyframe; Fig. 6 is a flowchart illustrating a sliding window estimator; Fig. 7 is a flowchart illustrating a method for visual-inertial localisation in an existing map, according to an example; Fig. 8 illustrates an algorithm for predicting landmarks that are likely to be visible, in the method of Fig. 7; Fig. 9 is a flowchart illustrating an algorithm for calculating the second pose, in the method of Fig. 7; Fig. 10 illustrates a method for selecting an image frame for keyframe creation according to an example; Fig. 11 is a flowchart illustrating a method for comparing a current image frame with an existing keyframe, to assist in deciding whether to select the current image frame as a new keyframe; Fig. 12 is a flowchart illustrating a method for evaluating a number of inliers; Fig. 13 illustrates an additional criterion for selecting key frames; and Fig. 14 illustrates the method of Fig. 13 in greater detail; It should be noted that these figures are diagrammatic and not drawn to scale. Relative dimensions and proportions of parts of these figures have been shown exaggerated or reduced in size, for the sake of clarity and convenience in the drawings.

DETAILED DESCRIPTION

Fig. 1 shows a block diagram of a mobile device 100 according to an example. As will be described below, the mobile device 100 is configured to operate in two modes. In a first mode, it performs simultaneous localisation and mapping (SLAM) using visual and inertial data. In this mode, it produces a map of the environment in which it is navigating. In a second mode, the mobile device performs a localisation method -not mapping the environment, but localising itself within an existing map, generated previously while it was in the first mode. Of course, it will be understood that the map need not be generated by the same device that uses the map for localisation in the second mode. It is possible that one mobile device may generate the map (operating in the first mode) and another device may consume the map, to localise itself within the mapped environment (operating in the second mode). The mobile device may be part of a handheld device, a robot, or a vehicle (such as a drone), for example.

As shown in Fig. 1, the mobile device comprises: a memory 110; two cameras, operating as a stereo pair 120; an inertial measurement unit (IMU) 130; and a microprocessor 140. The IMU includes an accelerometer 132 and a gyroscope 134. In some examples, it may further comprise a compass.

Fig. 2 illustrates a method performed by the processor of the mobile device, when the device is operating in the first mode (visual-inertial SLAM). In step 254, the processor obtains inertial data generated by the IMU 130. In step 252, the processor obtains visual data (image frames) generated by the stereo camera pair 120.

In step 300, the processor extracts visual features from the image frames captured by the cameras. This includes generating 310 binary descriptors of the visual features. In step 500, the processor performs an initial pose estimation. In step 640, local map tracking is performed. If the current frame is selected for creation of a keyframe, then the processor initialises the new keyframe in step 600.

In step 650, the processor carries out a sliding window estimation, to refine the initial pose estimate. Meanwhile, in step 690, loop closure processing is performed. The individual steps will be described in greater detail below, with reference to Figs. 3-6. Firstly, we need to introduce some notation.

Coordinate Frames and Navigation State The system tracks a moving body with a mounted IMU and a stereo camera rig relative to a static world coordinate frame denoted as F. The IMU sensor coordinate frame is denoted as Y's and the stereo camera frames are represented as Tci with i = 1,2. The following SLAM navigation state representation is used: bgT = [xws,VwT,,barnE ill3 x 53 x (1) where xws represents the location and attitude of the sensor (Le. pose) between the position of the origin of F's relative to.Tw coordinate frames. The vector vw represents ID the linear velocity of the sensor in the Tw coordinate frame. Vectors by and b" represent the gyro and accelerometer biases respectively. The navigation state xr," is estimated at every time step n for which frames from the stereo camera rig are available.

Each pose is represented by the tuple xws = (tw, qws), consisting of a translation vector and a unit quaternion that represents its orientation. Whenever needed, a sensor pose can be converted into its corresponding Euclidean transformation matrix as follows: (R(qws) tw) E ilk 2 = (2) OT 1/ where, for a unit quaternion qws, the expression R(q) E 50(3) is the corresponding rotation matrix.

Visual Inertial SLAM Mode In this section, we describe the main components of our Visual Inertial SLAM system (when operating in the first mode). This is used to create the global 3D maps, optimised camera motion, and landmark visibility information that will be used subsequently by the localisation system (when operating in the second mode). Stereo image data and IMU readings are processed by a "frontend" algorithm, which performs feature extraction and creates 3D-2D correspondences and keyframes, which are then fed into the estimator alongside pre-integrated IMU measurements. The loop closure module processes keyframes and performs pose graph optimisation once a candidate loop closure is found Frontend The system uses Accelerated KAZE (A-KAZE) to extract 300 and track features for each frame, due its high detector repeatability. The detected feature points are Xnav described by generating 310 Modified Local Difference Binary (M-LDB) descriptors. These are found to provide good performance compared with other binary feature descriptors. (For details of the A-KAZE and M-LDB algorithms, see P. F. Alcantarilla, J. Nuevo, and A. Bart°li, "Fast explicit diffusion for accelerated features in nonlinear scale spaces," in British Machine Vision Conf. (BMVC), 2013.) Rotation-invariant descriptors are extracted by projecting the gravity direction (provided by the accelerometer in the IMU) into the image.

Initial pose estimation The system is initialised when a stereo frame contains a certain number of feature correspondences that are triangulated into 3D landmarks. This first frame is added into the system as a keyframe and this frame defines the world coordinate frame.Tw. After the first frame, for incoming stereo frames and IMU data, the system aims to first find a good initial pose with respect to the world frame.

Fig. 3 depicts the different steps in the initial pose estimation process 500. First, is 3D-2D putative correspondences between the last two consecutive frames are established by means of guided matching 510, using confidence regions predicted by IMU measurements. A similarity-based Hamming distance threshold is used for matching binary descriptors. If guided matching does not return a successful set of putative correspondences, brute force matching 520 with nearest neighbor distance ratio (NNDR) is used instead. (See D. Lowe, "Distinctive image features from scale-invariant keypoints," Intl. J. of Computer Vision, vol. 60, pp. 91-110,2004.) The set of putative correspondences is fed into a RANSAC framework 530 to estimate an absolute pose by means of a P3P method for non-central cameras. (See M. Fischler and R. Bolles, "Random sample consensus: a paradigm for model fitting with application to image analysis and automated cartography," Commun. Assoc.

Comp. Mach., vol. 24, pp. 381-395,1981.) The initial pose is refined by minimising the reprojection error of the inlier correspondences. The set of inliers are added as observations into the backend. For those cases in which the system is unable to find a good initial pose, the predicted pose from pre-integrated IMU measurements is used instead. (See: C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, "On-manifold preintegrafion for real-time visual-inertial odometry," IEEE Trans. Robotics, vol. 33, no. 1, pp. 1-21,2017.) Local Map Tracking In the local map tracking step 640, the refined initial pose is used to match (track) additional 3D landmarks from the local map by means of guided matching 644.

The final set of putative correspondences contains putatives from the last frame to the current frame and from the local map to the current frame to improve accuracy. RANSAC+P3P is used again to discard some potential outliers, and the final pose is refined by minimising the reprojection error of the inlier correspondences. We also perform map maintenance in order to keep only trackable 3D visual landmarks in the system. Those 3D landmarks that are observed only by a small number of keyframes 3) are removed 646 from the system.

Keyframe Creation and Initialisation For keyframe creation 600, our approach combines different cues, such as a minimum number of detections and inliers, and the ratio between the convex hulls of the matched landmarks from the current frame and projected 3D landmarks from the last keyframe. The process of keyframe selection will be described in greater detail later, below.

When a keyframe is created, the relevant image frame is stored 605 as a keyframe; and the corresponding extracted 2D visual features are stored 615. New 3D landmarks are triangulated and added 625 into the system. For creating new landmarks, we find and triangulate stereo correspondences between the two images of the stereo pair and across stereo image pairs from the last keyframe and the current frame. Those 3D landmarks with large depth uncertainty after triangulation are not added into the system.

In SLAM mode, each new keyframe is added 635 to the covisibility graph. The covisibility graph is an undirected weighted graph that encodes covisibility information between keyframes. Each node in the graph represents a keyframe, whereas an edge between nodes exists only if two keyframes share a minimum number of common landmark observations.

For each new keyframe, we compute 610 its Bag of Words (BoVV) encoding by using DBoW3 and a pre-built visual vocabulary. (See D. Galvez-Lopez and J. D. Tardos, "Bags of binary words for fast place recognition in image sequences," IEEE Trans. Robotics, vol. 28, no. 5, pp. 1188-1197, 2012.) The new keyframe is added 630 to the sliding window estimator (further details below).

Sliding Window Visual-Inertial Estimator Our visual inertial estimator 650 combines reprojection and IMU errors. We formulate the multi-sensor fusion SLAM problem as a factor graph, in which each factor encodes the connectivity between the unknown variable nodes and the sensor measurements. (Factor graphs are described in: F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, "Factor graphs and the sum-product algorithm," IEEE Trans. Inform. Theory, vol. 47, no. 2, pp. 498-519, 2001.) The objective is to estimate a set of navigation states Xt = fx13 nay, n=1:Nstates for each time step n, and additionally estimate a collection of visual landmarks y* = 10m t 1ff,1 j=tit landmarks, given a set of sensor measurements Z that includes visual and IMU measurements. Here, each landmark 17% E DV is expressed in the world coordinate frame Ns and A/1 represent the total number of states and landmarks.

The joint probability distribution of the navigation states and landmarks 0 = pC, (4*) given the measurements Z, can be factorised as the product of the contribution of each individual factor in the graph: P(X, y; 2) a LI fi (00, (3) i =1 where el represents a subset of the variable nodes. Each factor h represents an error function that connects a set of variables and measurements. Our goal is to estimate the navigation states X* and visual landmarks y* that maximise the factorisation in Eq. 3.

We assume a Gaussian measurement model for the factors in the graph of the form: exp Ozi hi(e911:4. (4) The factor graph in Eq. 3 can be estimated via maximum a posteriori (MAP) estimation, which is equivalent to solving the following non-linear least squares problem that comprises a sum of different error terms: argmine(-log f (6)) = argmin® -2111z' -ht(et)ii (5) where h(09 is a measurement prediction function, zi is a sensor measurement, and II * qt is the Mahalanobis distance induced by the measurement covariance Et. Let us discuss in more detail the error terms for each sensor: Reprojection Error This error measures the reprojection error of the j-th landmark Y";., into the i-th camera from the stereo rig at time step k, with respect to the corresponding image measurement zisPc E R2: ei,j,k = 11Zili Tr SC( T-1 T-1 vi)11 * * (6) rep i wsk prof where it() is the camera projection function. In order to deal with outliers and spurious measurements, we use the Cauchy loss function as a robust kernel.

IMU Error The IMU error measures the error between navigation parameters (pose, speed and biases) from two different time steps k and r, with r > k, given IMU measurements zsmr (gyro and accelerometer readings) between both navigation states.

ID This factor uses pre-integrated IMU measurements. ii2

eit = 11Xciar G 2rnav(Xifuzv, Zsicir)11 kr (7) zlisw where iecim,(xer",47) denotes the predicted navigation state at time step r based on a previous state estimate 4, and IMU readings. The operator e calculates the difference in the tangent space by applying the log map. (Further details of how to pre- integrate IMU measurements can be found in: T. Lupton and S. Sukkarieh, "Visual-inertial-aided navigation for high dynamic motion in built environments without initial conditions," IEEE Trans. Robotics, vol. 28, no. 1, pp. 61-76,2012; and C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, "On-manifold preintegration for real-time visual-inertial odometry," IEEE Trans. Robotics, vol. 33, no. 1, pp. 1-21,2017.) The calculation of the errors is represented in Fig. 6 by step 670. Eq. 5 represents a bundle-adjustment problem. We solve 675 this non-linear least squares problem by means of Levenberg-Marquardt and a Schur-complement linear solver with Jacobi preconditioning. (See B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, "Bundle Adjustment A Modern Synthesis," in In International Workshop on Vision Algorithms: Theory and Practice (lW/A), 1999.) We use a sliding window with N keyframes and M frames On the present example, we use M = 5 and N = 3) and also perform marginalisafion of variables. When states need to be marginalized 680 from the sliding window, all the error terms related to the variables to be marginalised are transformed into one linear error term that is added to the optimisation problem. In addition to reprojecfion and IMU errors, we use a prior pose error for the first pose of the sliding window in order to fix the coordinate frame of the reconstruction.

The sliding window estimator outputs 685 a refined pose estimate, calculated by solving the bundle adjustment problem.

Loop Closure Detection and Correction Our loop closure detection routine 690 runs on a separate thread from the visual inertial SLAM system. When a new keyframe has been added into the system, the loop closure detection module aims to find loop closure candidates from among the previous keyframes in the covisibility graph. Loop closure allows the mobile device to determine when it has returned to a part of the environment that has already been mapped. We use covisibility information to incorporate spatial and temporal checks for extra robustness when accepting query keyframes as potential loop closure candidates. Valid loop closure candidates go through a final geometric verification check, which comprises: finding putative 3D-2D correspondences between the query keyframe and the candidate keyframe; and estimating an absolute pose by means of RANSAC. Those candidates that show a large number of inliers and a high inlier-ratio, in the RANSAC algorithm, are considered in the loop closure correction module.

The loop closure correction module performs pose graph optimisation for the keyframes in the covisibility graph by using, as measurements, relative pose transformations between nodes in the factor graph, along with the relative transformations from the loop closure candidates and corresponding covariance matrices. The pose graph optimisation aims to estimate a factor graph via MAP estimation, considering that the variables are poses and the measurements are relative pose transformations. The error term between two poses is as follows: k,r -Il e between zr, S where zsnk is the relative pose measurement between poses 'cyst, and xwsr.

If the pose graph optimisation problem converges, the keyframes in the covisibility graph are updated with the new corrections alongside the corrected 3D landmarks anchored to their corresponding keyframes. Additionally, any duplicated map points after the loop closure correction are merged into single points.

Global Bundle Adjustment The steps performed by the frontend -namely, steps 300, 310, 500, and 640 -as well as the steps performed by the backend 650, and the loop closure processing 690 are run iteratively for each new image frame captured by the cameras. In this way, a map is accumulated, containing keyframes, their estimated poses, and their 3D landmarks.

At the end of the mapping session, we perform offline global bundle adjustment, minimising the reprojecfion errors of all the 3D landmarks visible in the keyframes from (8) kr, £ ' rel.

the covisibility graph. In our global bundle adjustment cost we also add the loop closure constraints. We also remove landmarks that have large reprojection errors or poor visibility -in particular, any landmarks being observed in fewer than 2 keyframes are removed.

Learning Visibility of Landmarks Once the visual-inertial SLAM system finishes processing a sequence (that is, finishes a mapping session), we use the set of optimised keyframe poses and 3D landmarks 0 = pc, zy for learning how to predict the visibility of known 3D points with respect to a query camera pose in the large-scale environment. Note that all the data needed for this purpose is already present in the covisibility graph. The visibility of each 3D landmark with respect to a camera pose is modelled by means of a memory-based learning approach, in which a distance metric between poses is learned in an entirely non-parametric way. (For reference, some related work can be found in: P. F. Alcantarilla, K. Ni, L. M. Bergasa, and F. Dellaert, "Visibility learning for large-scale urban environment," in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2011, pp. 6205-6212.) In the visibility prediction problem, we are interested in the posterior distribution of a certain 3D landmark)11,7 being visible given a query camera pose x, denoted as P(vi = 11x). For this purpose, we apply the principles of a memory-based learning approach, known as Locally Weighted Learning. (See C. Atkeson, A. Moore, and S. Schaal, "Locally weighted learning," Al Review, vol. 11, pp. 11-73,1997.) Given training data (keyframe poses, 3D landmarks and visibility information) and a query pose, we form a locally weighted average at the query pose and take that as an estimate of the visibility posterior as follows: 13(vj = 11x) - (9) E7_1 k (x, xi) where 41=1 are the camera poses from the training data in which the 3D point Yi is visible. The function k(x, xi) is a weighting function or kernel function that is used to calculate a similarity score between camera poses.

The kernel function that we use combines the Mahalanobis distance and the Gaussian kernel, and it applies to any distance-based kernel function with differentiable dependencies on parameters specifying the distance function. The kernel function to measure the similarity between two camera poses is as follows: k (x, ?el xj) = exp (-114xi xi)I12), (10) where the matrix A is the square root information matrix and hence it has also full rank. A very high similarity score (approaching a value of 1) suggests that two camera poses will observe the same set of landmarks, while a low score (approaching 0) means that the two cameras are unlikely to share any landmarks in common. The vectors xi and xj encode information about the camera pose and its appearance information. In particular, we use three cues in our kernel function: differences in camera translation, viewing direction, and appearance by means of BoW encoding. In greater detail, the cues comprise: * Translation: Measures the differences between the translation of two cameras in Tw * Orientation: Measures orientation differences by using the normalised inner-product between the viewing directions of the two cameras -that is, 1 vd, Ivd,I where vd, is the camera viewing direction, extracted from the pose quaternion.

* Appearance: Measures an L1 similarity score between two BoW vectors -namely, 1 --21 vb., vb., The BoW vectors and vb") are computed I*bow, vbow, from the raw binary descriptors and visual vocabulary.

Once the global bundle adjustment has converged, after a mapping session (as discussed above), the processor 140 learns an appropriate kernel function, by minimising a loss function between the predicted values from the kernel function and target values yii: (vdi,Vdi) LVlSJbJIJtY The learning algorithm uses the map contents for training data. As target values yri, we use the mean of the ratios between the intersection of the common 3D landmarks relative to the number of 3D points visible to each of the two cameras: _ 1 IIYErlY/1 (12) 1111 2 I 11 where Y, denotes the number of 3D landmarks being observed from camera pose x,. We minimise this non-linear least squares function by using the Levenberg-Marquardt algorithm. In order to reduce the number of parameters in the estimation we model A as a symmetric matrix.

A key observation is that the results of the visibility prediction in Eq. 9 will be mostly zero for many of the 3D landmarks in a large map, since most of the landmarks would not be observed at all by the K-Nearest Neighbors (KNNs) of the query pose. As a consequence, rather than predicting the visibility individually for each landmark (which could be prohibitively expensive in large-scale maps), we find the KN Ns of the current query pose, and then predict the visibility for the subset of landmarks seen at least once by these KNNs. Then, we can set the visibility to zero for the rest of the landmarks without computing them at all. Finally, the locally weighted K nearest neighbor approximation for the visibility posterior is: k (x, 41-1) where only the nearest K samples of the query pose are considered in the average. Visual Inertial Localisation System The operation of the system in the second mode (visual inertial localisation) is based on the first mode (SLAM mode) described above. However, the key difference now is that the 3D landmarks in the existing map, generated in the previous session, are kept constant. More formally, we want to estimate the joint probability distribution of the navigation states and a new set of landmarks for the localisation run Choc = {X, cy} given the landmarks from the existing map yrnap and the measurements Z: P(X, yry",ap, oc n fi(eLJ* (14) t=i We use the same error terms as in the first mode (described above). However, the localisation system introduces several differences -for example, when bootstrapping the system and also in the frontend and backend.

Fig. 7 illustrates a localisation method performed in the second mode, according to an example. In step 750, the processor 140 obtains the existing map, from the previous SLAM session. The map may be stored in the memory 110. In step 254, the processor obtains inertial data from the IMU 130. In step 252, the processor obtains visual data (image frames) from the camera pair 120.

In step 300, the processor extracts visual features from the image frames, (again using A-KAZE, in this example), and generates (in step 310) binary descriptors (again based on M-LDB descriptors, in this example). As before, rotation invariant P(vi = 11x) - k(x, xi) (13) descriptors are extracted, using gravity estimation from the IMU measurements. The raw descriptors are converted into the Bag of Words (BoWs) representation. Bootstrapping and Relocalisation When the new session starts On localisation-only mode), the system does not initialise until a successful relocalisation with respect to the previously built global map is found. The relocalisation involves localising 400 a first pose of a first image frame. To do this, descriptors generated in step 310 are compared (step 450) with the keyframes in the covisibility graph, using the BoWs representation. For those candidates with a high similarity score, a further geometric check based on P3P+RANSAC is performed to estimate the initial pose in Tw, in order to initialise the localisation system. In this way, the first pose of the first image frame is localised purely using visual data. Note that the "first image frame" is not necessarily the very first frame in the sequence from the cameras 120. The frames from the camera are processed until a pose can be relocalised.

The relocalisation process is triggered whenever tracking is lost. Even though the localisation system is able to produce accurate motion estimates in the surroundings of a previously built map, inevitably the tracking can sometimes get lost. This may be caused by several factors -for example, kidnapping, motion blur, or textureless scenes where feature extraction fails. When tracking is lost, the system enters in the initialisation phase until a good relocalisation with respect to the previously built map is found.

Optionally, if the mobile device 100 comprises a satellite positioning receiver, a position estimate derived using satellite positioning signals may be used to assist the relocalisation.

Pose estimation for subsequent frames after relocalisation Once the system is relocalised with respect to the global map, we use the visibility-prediction framework described above to compute a set of putative 3D-2D correspondences between the global map and the current frame. This means, that for a second frame, subsequent to the first frame mentioned above, the localisation is performed using a combination of visual and inertial data.

In step 810, the processor 140 calculates a predicted pose for the second image frame. This is done by extrapolating from the pose of a previous image frame -for example, the first image frame -using NU measurements.

Given the predicted pose from the IM U measurements xws, the corresponding poses for each stereo camera xwc, are computed by using the known extrinsics parameters between the IMU 130 and cameras 120.

Next, in step 830, the processor uses the visibility classifier (described above) to predict the likely visible 3D landmarks for each camera pose xwci. To avoid matching landmarks that have a very small value for the posterior probability in Eq. 13, we use a visibility probability threshold Rt,l's to filter landmarks with low visibility scores. Optionally, if a position fix computed by satellite positioning is available, this can be used to assist the prediction of visible landmarks. Landmarks far from the position fix are unlikely to be visible in the current frame. This can be used to constrain the visibility prediction, potentially reducing the computational effort.

Figure 8 illustrates the visibility prediction 830 of Eq. (13) graphically. As described above, the kernel function incorporates a translation cue 832, an orientation cue 833, and an appearance cue 834. In step 836, the processor 140 calculates the similarity k(x, xi) between the IMU-based predicted pose and each of the K nearest poses among the keyframes. In step 837, the processor combines the similarity values, to predict the visibility of each landmark. This comprises, for each landmark, summing the similarity values of the set of keyframes containing that landmark. This is the summation in the numerator of Eq. (13). This is normalized by the sum of the kernel-similarity values of all of the K nearest neighbors -that is, the denominator in Eq. (13).

Figs. 7 and 9 together show the steps leading to the calculation of a second pose for the second image frame, in the localisation system.

In step 840, the visual features extracted in step 300 are matched with the landmarks that were predicted (in step 830) to be visible. In the present example, the predicted 3D landmarks are used to find 3D-2D correspondences using brute force matching. In other examples, guided matching could instead be used.

In step 850, the matched visual features and landmarks are used to calculate the second pose for the second image frame. This is illustrated in greater detail in Fig. 9. If the matching in step 840 was successful, the set of putative correspondences from the existing map is fed into a RANSAC framework (step 530) to estimate an absolute pose by means of a P3P method for non-central cameras. The initial estimate of the second pose is improved by minimising the reprojection error of the inlier correspondences.

For those cases in which the matching in step 840 is unsuccessful, the system falls back on the visual-inertial SLAM matching process 500, described above in the context of the first mode, in the section "Initial Pose Estimation". In this way, the mobile device 100 is able to work in new areas that were not captured in the previously built map. The matching process 500 proceeds by finding putative correspondences between two consecutive frames and estimating an absolute pose in a RANSAC-based framework (step 530). If it is not possible to find good correspondences, the propagated pose from the IM U is used as the second pose (step 540).

Local map tracking Similarly to the process described above, illustrated in Fig. 4, local map tracking is performed. The inlier correspondences from the matching are added as observations into the backend and the refined initial pose is used to match (track) additional landmarks from the local map (step 644). The main difference from the map tracking in the first mode (SLAM mode) is that, for those error terms that contain observations from global map landmarks, the 3D landmarks are kept constant in the optimisation. In other words, only new landmarks from the current (localisation) session are adjusted -the "old" landmarks from the existing map are left unchanged. Similarly, landmarks are removed 646 if they are only visible in a small number of keyframes. This also applies only to landmarks from the current (localisation) session -landmarks from the existing map are left unchanged.

Keyframe Creation and Initialisation The localisation mode uses a similar keyframe creation and initialisation strategy 600 as described above with reference to Fig. 5. In localisation-only mode, new keyframes from the current session are not added as part of the covisibility graph.

However, they are used as keyframes in the sliding window estimator.

Sliding Window Visual-Inertial Estimator for Localisation The second pose calculated in step 850 is refined in step 650. As described above, in the present example, this is achieved using bundle adjustment in a sliding window. As above, we keep only N keyframes and M frames, and perform marginalisation 680 of variables. Mien states need to be marginalised from the sliding window, all the error terms related to the variables to be marginalised are transformed into one linear error term that is added to the optimisation problem. For variables corresponding to the current (localisation-only) session, once they are marginalised they are completely removed from the system.

The method of Fig. 7 enables IMU measurements and visual data to be integrated to localise the mobile device 100 within an existing map. The method is suitable for real-time implementation on board the mobile device. This is achieved by seeking to focus the computational effort efficiently at each stage. For example, the use of the sliding window avoids the need to optimise over the full set of keyframes and frames. Likewise, the use of the predicted pose from the IM U helps to better target the computational effort in the matching step 840, by means of the visibility prediction in step 830.

Keyframe selection In both the first mode (SLAM mode) and the second mode (localisation mode) described above, the system selected new keyframes for initialisation 600 as a pose was calculated for each image frame. The criteria used in the selection process will now be described. The same criteria may be used in both modes.

Keyframe creation is an important step for building the map and also for guaranteeing good tracking quality throughout the whole environment. Keyframes refer to particular frames of interest within the sequence, which can be used to describe the mapped environment. In this way, keyframes also help to make the estimation problem tractable and bounded, since processing each single state in a bundle adjustment problem tends to be computationally expensive and may have large memory requirements.

Fig. 10 is a flowchart illustrating a method of selecting frames to be keyframes, according to an example. This method can be applied to each current image frame, once a pose estimate is available. In particular, although not shown in Fig. 2 or Fig. 7, the method is implemented just before the step 600 of initialising a keyframe. In other words, image frames are first selected 550 to be keyframes, and then 600 initialised as keyframes. Consequently, it will be understood that, by the time the keyframe selection method is executed, the processor 140 of the mobile device 100 has already obtained 252 the visual data and obtained 254 the inertial data. It has also extracted 300 the visual features from the current image frame; matched 510 the extracted visual features with landmarks of at least one existing keyframe; and estimated 530 the pose of the current frame. (If matching fails completely, and the predicted pose is based solely on inertial data, there is no keyframe selection or initialisation.) The selecting 550 of an image frame to be a keyframe is based on the evaluation of a number of criteria, in a particular sequence. It will be understood that the sequence illustrated in Fig. 10 is exemplary and other sequences (and some alternative or additional criteria) may be used instead. Nevertheless, the set of criteria and sequence of evaluation illustrated in Fig. 10 have been found to produce good results.

The method begins with step 555. In step 555, the processor detects whether there has been a saturation event in the inertial data, recently. In particular, if there has been an IMU saturation event (accelerometer or gyroscope), the processor waits at least 15 frames before initialising any new keyframes into the system. In this way, the system avoids inserting keyframes very close to IMU saturation events where the tracking may be unstable due to motion blur. In other words, for image frames captured during a first time interval after a saturation event, the image frames are rejected 590 by the keyframe selection process. In this example, the first time interval is defined as frames.

If there have been no IMU saturation events, or if a time period longer that the first time interval has elapsed since the most recent IMU saturation event, then the image frame is not rejected, and the processor proceeds to step 565. In this step, the processor checks the number of visual features that were extracted from the current image frame in step 300. A frame needs to have a minimum number of 2D detections to be considered suitable as a keyframe. If the number of visual features extracted is below a threshold number, the processor rejects 590 the image frame. (This threshold number will be referred to as the "fourth threshold".) In the present example it is set to 20 detected feature points.

If the number of extracted features is sufficient, then the image frame is not rejected and the processor proceeds to step 570. In this step, the processor evaluates whether there are enough keyframes for the sliding window estimator 650. The sliding window is defined in part by having a set number of keyframes (namely, 3 keyframes, in the examples above). However, immediately after initialisation of the system, there will be no keyframes. In step 570, the current image frame is selected 595 for initialisation as a keyframe as long as there are fewer than the required number of keyframes in the sliding window. (This required number will be referred to as the "first threshold".) If there is already a sufficient number of keyframes in the sliding window, then the current frame is not (yet) selected as a keyframe, and the processor proceeds to step 575. In step 575, the processor evaluates the time difference between the current image frame and the previous keyframe. If this is larger than a second time interval, then the current image frame is selected 595 to be a keyframe. This promotes better tracking stability, since the covariance of pre-integrated IMU measurements will grow with longer time differences between consecutive keyframes. In the present example, the second time interval is set to 3 seconds.

If the time since the preceding keyframe is less than the second time interval, then the current image frame is not (yet) selected as a keyframe, and the processor proceeds to step 580. In step 580, the processor compares extracted visual features of the current image frame with the landmarks of the preceding keyframe. This is illustrated in greater detail in Fig. 11.

The visual features extracted from the current image frame will already have been matched with the landmarks of at least one previous keyframe (for example, in step 510 of Fig. 3). In step, 581, the processor calculates the 2-D convex hull of the matched visual features in the current image frame -in other words, it calculates the convex hull of those visual features that were extracted from the current image frame and have been matched with respective landmarks in the previous keyframe. Extracted visual features that did not match with landmarks are ignored. In step 582, the processor projects the landmarks of the previous keyframe into the current image frame. This is done based on the relationship between the pose estimate for the current image frame and the pose of the keyframe. Next, in step 583, the processor calculates the 2-D convex hull of the projected landmarks in the current image frame. Then, in step 584, the processor calculates the area of overlap between the two convex hulls, and the overlap ratio. The overlap ratio is defined as the area of overlap divided by the area of the convex hull of the projected landmarks. In step 580, the processor checks whether the overlap ratio is below a threshold (for example 90%). If so, the current image frame is selected 595 to become a keyframe. If not, the method proceeds to step 585. Evaluating the overlap ratio in this way can allow the system to cope robustly with changes in viewpoint, while avoiding creating unnecessary additional keyframes. (The threshold against which the overlap ratio is evaluated will be referred to as the "fifth threshold".) In step 585, the processor determines whether there were enough inliers after pose estimation for the current image frame. This is illustrated in greater detail in Fig. 12. As noted already, matching has already been carried out, between the visual features of the current image frame and the landmarks of at least one previous keyframe (for example, in step 510). Furthermore, a pose estimate is already available for the current image frame (for example, from step 530). In step 586, the processor identifies the set of inliers following the pose estimation process -that is, the set of matched feature points that are consistent with the pose estimate. The "consistency" with the pose estimate may be determined by calculating how far each matched extracted feature deviates from its position as predicted by the pose estimate. Features that deviate by less than an error threshold are deemed to be inliers. In step 585, the processor determines whether the number of inliers is less than a threshold -in this example, 21. If so, the image frame is rejected 590. Otherwise (for example, if there are 21 or more inliers), the image frame is selected 595 to be a keyframe. (The threshold on the minimum number of inliers will be referred to as the "sixth threshold".) Fig. 13 shows an additional criterion that may be useful for selecting keyframes. In step 560, the processor evaluates whether a measure of uncertainty for the IMU measurements is above a threshold. If so, the processor selects 595 the current image frame to be a keyframe. In the present example, the measure of uncertainty comprises an error covariance of the inertial data. Inertial data tends to be unreliable over long time periods, because!MU errors are integrated over time. When the error covariance grows too large, it indicates that it is time to create a new keyframe -effectively recalibrafing the IMU measurements. If, in step 560, the IMU measurements are not above the uncertainty threshold, then there is no urgent need to create a new keyframe. The processor may continue to evaluate other criteria, or may determine that the current image frame is to be rejected 590 as a keyframe.

Fig. 14 illustrates one exemplary implementation of Fig. 13, in greater detail. In step 561, the processor calculates a measure of uncertainty for the rotational inertial data -for example, for the inertial measurements from the gyroscope. In step 560, the processor compares the rotational uncertainty measure against a rotational uncertainty threshold. If the rotational uncertainty measure is above the rotational uncertainty threshold, the processor selects 595 the current image frame as a keyframe. Otherwise, the processor proceeds to calculate a measure of uncertainty for the acceleration data, in step 563. In step 564, the processor compares this against an acceleration uncertainty threshold. If the measure is above the threshold, the processor selects 595 the current image frame as a keyframe. Otherwise, the processor can either continue to evaluate other criteria, or it can reject 590 the current image frame as a keyframe.

The rotational uncertainty threshold is an example of what will be referred to as the "second threshold"; the acceleration uncertainty threshold will be referred to as the "third threshold".

The criteria of Figs. 13-14 can be used in addition to the criteria of Fig. 10, or in place of any of those criteria. In particular, it may make sense to replace the criterion of step 575 with the criterion of step 560/564. This is because the length of time since the last keyframe, in step 575, effectively acts as a proxy estimate for the uncertainty of the IMU measurements. By examining one or more actual measures of uncertainty of the inertial data, the system can distinguish between situations in which the uncertainty grows more slowly and those in which it grows more quickly (rather than assuming that it grows at a constant rate, which is effectively the underlying assumption in step 575).

The memory 110 may store one or more computer programs (or software or code) and/or data. The computer programs may include an operating system for the processor 140 to execute in order for the mobile device 100 to function. The computer programs may include computer programs according to embodiments of the invention or computer programs that, when executed by the processor 140, cause the processor 140 to carry out a method according to an embodiment of the invention. The computer programs may be stored in a non-transitory computer-readable storage medium as well as, or instead of, the memory 110.

The processor 140 may be any data processing unit suitable for executing one or more computer readable program instructions, such as those belonging to computer programs stored in the computer-readable storage medium and/or the memory 110. The processor 140 may comprise a single data processing unit or multiple data processing units operating in parallel or in cooperation with each other The processor 140 may, as part of the execution of one or more computer readable program instructions, store data to and/or read data from the computer-readable storage medium and/or the memory 110.

It should be noted that the above-mentioned embodiments illustrate, rather than limit, the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims.

For example, the stereo camera pair 120 may be replaced by a single camera, or by a multi-camera rig (for example, providing an omnidirectional view of the environment).

In the examples described above, A-KAZE features were used with M-LDB descriptors, to extract and characterise the visual features (and landmarks). However, it will be appreciated that other combinations of detectors and descriptors can be used. The descriptors may be binary descriptors, like the M-LDB descriptor, but this is not essential.

Other suitable algorithms for feature detection, description, and matching include but are not limited to: the Scale Invariant Feature Transform (SIFT); and Binary Robust Invariant Scalable Keypoints (BRISK) algorithm. Features may also be detected using a corner detector such as the Harris corner detector. These algorithms will be well-known to those skilled in the art. In general, any (human-) hand-crafted or (machine-) learned features could be used for the visual features / landmarks.

Nevertheless, in general it may be advantageous for the descriptions to be rotation invariant. As mentioned previously, this can be achieved by calculating a descriptor based on an estimate of the gravity direction, from the I MU.

Comparisons may be made between descriptors using any suitable distance metric. For example, a similarity-based distance threshold may be used for matching descriptors (whether binary or not). In the examples described above, a Bag of (binary) Words representation was used to characterise the keyframes and new image frames. Alternatively or in addition, another way of characterising and comparing image frames could be used. The appropriate comparison metric may depend on the type of descriptor used to characterise the individual landmarks / visual features. A visual vocabulary based on the BOW representation has been found to work well for binary descriptors, for example.

Other suitable representations include but are not limited to: the Vector of Locally Aggregated Descriptors (VLAD) (see Relja Arandjelovic and Andrew Zisserman "All About VLAD", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 1578-1585); and the vocabulary tree (see David Nister and Henrik Stewenius, "Scalable Recognition with a Vocabulary Tree", CVPR 2006: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition -Volume 2, June 2006, Pages 2161-2168).

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The embodiments may be implemented by means of hardware comprising several distinct elements. In a device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to some advantage. Furthermore in the appended claims lists comprising "at least one of: A; B; and C" should be interpreted as (A and/or B) and/or C. Furthermore in general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic, or any combination thereof For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor, or other computing device, although these are not limiting examples.

While various aspects described herein may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof The embodiments described herein may be implemented by computer software executable by a data processor of the apparatus, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits; blocks, and functions, or a combination of program steps and logic circuits, blocks, and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory, and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASIC), gate level circuits, and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments as discussed herein may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Claims

CLAIMS1. A computer-implemented method for initialising keyframes in a map of a 3-D environment, for visual-inertial localisation in the environment and/or mapping of the environment, the method comprising: obtaining (252) visual data generated at a mobile device, comprising a series of image frames captured by a camera of the mobile device; obtaining (254) inertial data generated at the mobile device, comprising at least one or any combination of two or more of: accelerometer, gyroscope and compass 10 measurements; selecting (550) one or more of the image frames to be keyframes; and initialising (600) one or more new keyframes in the map from the respective selected one or more image frames, wherein the selecting (550) is based at least in part on the inertial data.
2. The method of claim 1, wherein the selecting includes: detecting (555) a saturation event in the inertial data; and in response, rejecting (590) image frames captured during a first time interval after the saturation event.
3. The method of claim 1 or claim 2, wherein the selecting includes selecting (595) an image frame to be a keyframe if (570) the number of keyframes in a sliding temporal window before the image frame is below a first threshold.
4. The method of any one of the preceding claims, wherein the selecting includes selecting (595) an image frame to be a keyframe if (575) a second time interval has elapsed since the preceding keyframe in the series of image frames.
5. The method of any one of the preceding claims, further comprising calculating a measure of uncertainty for the inertial data, and wherein the selecting includes selecting (595) an image frame to be a keyframe if (560), for that image frame, a value of the measure of uncertainty is greater than a second threshold.
6. The method of claim 5, wherein the measure of uncertainty comprises an error covariance of the inertial data.
7. The method of claim 5 or 6, comprising: calculating (561) a first measure of uncertainty for rotational inertial data; comparing the first measure with the second threshold; if (560) the first measure is above the second threshold, selecting (595) the corresponding image frame to be a keyframe; if the first measure is not above the second threshold calculating (563) a second measure of uncertainty for acceleration data, comparing the second measure with a third threshold, and if (564) the second measure is above the third threshold, selecting (595) the corresponding image to be a keyframe.
8. The method of any one of the preceding claims, further comprising extracting (300) visual features from an image frame in the series of image frames, wherein the selecting includes rejecting (590) the image frame if (565) the number of visual features extracted from that image frame is below a fourth threshold.
9. The method of any one of the preceding claims, wherein each existing keyframe has associated with it a plurality of landmarks, each landmark being a detected visual feature that is present in two or more keyframes and each landmark defining a 3-D position for the respective visual feature, the method comprising extracting (300) visual features from an image frame in the series of image frames, and wherein the selecting includes comparing (580) the extracted visual features of said image frame with the landmarks of the preceding keyframe.
10. The method of claim 9, comprising: matching (510, 520, 840) the extracted visual features of said image frame with the landmarks of the preceding keyframe, to identify matched visual features; calculating (581) a first convex hull, being the two-dimensional convex hull of the matched visual features in said image frame; projecting (582) the landmarks of the preceding keyframe into said image frame; calculating (583) a second convex hull, being the two-dimensional convex hull of the projected landmarks in said image frame; calculating an area of overlap between the first and second convex hulls; calculating (584) an overlap ratio, being the ratio of the area of overlap to the area of the second convex hull; and selecting (595) the image frame to be a keyframe if (580) the overlap ratio is below a fifth threshold.
11. The method of claim 9 or claim 10, comprising: matching (510) the extracted visual features of said image frame with the landmarks of the preceding keyframe, to identify matched visual features; estimating (530) a pose for the image frame; identifying (586) a set of inlier correspondences, being the set of matched visual features, and respective landmarks, that deviate from the estimated pose by less than an error threshold; and if (585) the number of inlier correspondences is less than a sixth threshold, rejecting the image frame.
12. The method of claim 11, wherein estimating (530) the pose comprises deriving at least one pose estimate by applying a perspective-3-point method to the matched visual features and respective landmarks.
13. The method of claim 11 or 12, wherein calculating the estimated pose comprises applying a random sample consensus algorithm.
14. The method of any one of claims 8-13, wherein the extracted visual features are multiscale features in a nonlinear scale space, optionally Accelerated KAZE features.
15. The method of one of claims 8-14, wherein the extracted visual features and/or the landmarks are described using rotation-and scale-invariant binary descriptors, optionally M-LDB descriptors.
16. The method of claim 15 wherein the descriptors are generated using a gravity direction derived from the inertial data.
17. A computer program comprising computer program code configured to cause one or more physical computing devices to perform all the steps of the method of any one of the preceding claims when said computer program is run on the one or more physical computing devices.
18. A mobile device (100) configured to initialise keyframes using the method of any one of claims 1-16.
19. A mobile device (100) configured to initialise keyframes in a map of a 3-D environment, for visual-inertial localisation in the environment and/or mapping of the environment, the mobile device comprising: a memory (110), for storing the map; a camera (120), configured to capture visual data comprising a series of image frames; an inertial measurement unit (130), configured to generate inertial data, wherein the inertial measurement unit comprises at least one or any combination of two or more of: an accelerometer, a gyroscope and a compass; and one or more processors (140), configured to: select (550) one or more of the image frames to be keyframes; and initialise (600) one or more new keyframes in the map from the respective selected one or more image frames, wherein the selecting (550) is based at least in part on the inertial data.
20. The mobile device of claim 19, wherein the mobile device (100) is comprised in a handheld device, a robot or a vehicle.