WO2000034803A2

WO2000034803A2 - Vision-assisted camera pose determination

Info

Publication number: WO2000034803A2
Application number: PCT/US1999/027483
Authority: WO
Inventors: Jim Kain; Charlie Yates; Arthur Zwern; Sandor Fejes; Jinlong Chen; Marc Jablonski
Original assignee: Geometrix, Inc.; Korbin Systems, Inc.
Priority date: 1998-11-20
Filing date: 1999-11-19
Publication date: 2000-06-15
Also published as: JP2002532770A; EP1068489A2; WO2000034803A3

Abstract

A completely passive and self-contained system for determining pose information of a platform comprises a motion sensing device and an imaging device, operating together such that each of images generated from the imaging device corresponds to a set of motion data provided by the motion sensing device. The motion sensing device and the imaging device may be integrated and/or operate synchronously. The imaging device senses the surrounding scene from which features are extracted and tracked to determine the imaging device motion. Hence no advance information regarding the scene or no special scene preparation is required. Further, a statistic estimation process, such as the Kalman filter, is employed to assist the feature tracking. To determine the pose information, the features and the motion data propagated by a strapdown navigation process are provided to the statistic estimation process. Errors from the statistic estimation process are used to refine the features and the motion data.

Description

specially-designed visual targets pre-placed within a scene and US Patent No.: 5,517,419 describes a system for geolocating arbitrary points in the terrain with the vehicle trajectory determined primarily by GPS. Preplanning is required for all of these prior art systems. Moreover, US Patent No.: 5,699,444 describes a system in which the camera position and in-scene features are located from multiple camera views from multiple aspects. The system uses vision sensors only without a requirement for separate motion sensing components. The overall performance of the system may suffer because no pose information is provided until all views are available and further the feature selection is a manual process. In addition, the imagery-only solution for pose determination may be constrained by the geometry of the various view aspects. US Patent No.: 5,51 1 ,153 improves upon other methods in that a high-rate sequence of images is processed offering recursive estimation for feature localization and thus offers the possibility of automation of the feature tracking process. However, US Patent No.: 5,51 1 , 153 requires that at least seven features be continuously tracked, each feature is represented as a single range parameter rather than a complete 3D position vector. In addition, such imagery-only solutions impose serious constraints on allowable camera motion, tending to fail for rotations about the camera nodal point, rapid rotation, looming motion, and narrow field-of-view cameras. Finally, imagery-only methods such as US Patent No.: 5,51 1 ,153 have noindependent means of predicting the feature locations on future frames, so feature tracking search windows may have to be large enough to accommodate all potential camera motions, resulting in prolonged computation times.

In reality, however, there are many applications or areas in which a priori knowledge of a scene is not available while precise navigation guidance is required. There is, therefore, a great need for a system in which the post determination of a platform can be automatically and efficiently obtained in conjunction with a surrounding scene without a priori knowledge of the scene. SUMMARY OF THE INVENTION

The present invention relates to techniques that extract and process visual information from imagery in order to assist an instrument-based navigation system in achieving high levels of pose determination accuracy, especially under conditions when one or more instruments fail to provide accurate data. A navigation system employing the invention disclosed herein can be used to obtain highly accurate platform pose information despite large drift and noise components in motion sensing data, and despite large uncertainties in the camera/lens/optics used for imagery collection. According to one aspect of the present invention, images are acquired by a camera that is either rigidly connected to or whose orientation is controlled relative to a cluster of motion sensing instruments in a platform that may include, but not be limited to, a moving system (i.e. a vehicle) and a flying system (i.e. a missile). An example of motion sensing instruments includes the Inertial

Measurement Unit (IMU) comprising devices for measuring acceleration and rotation rate along three orthogonal axes rigidly attached to the platform. This IMU may be augmented with a GPS sensor for measuring the position and velocity of the platform relative to a geodetic reference frame. A salient feature operator is applied to the images obtained from the camera to detect regions within the images that are relatively unique in an image processing sense. Image templates of these regions are stored, and visual feature tracking operators attempt to identify the corresponding regions in later images. If the corresponded features resulted from stationery objects or terrain (e.g. buildings), then the platform position can be determined with respect to a coordinate system fixed relative to these features. Motion sensors, such as the accelerometers and rate gyros that are included in an IMU instrument cluster, have heretofore been complex individual sensor subsystems. Recent development in the field of Micro-Mechanical Electromechanical Systems (MEMS) are revolutionizing the concept of the IMU in terms of size and power. However, MEMS IMU components are much less accurate than traditional IMU components used for navigation. Moreover, digital camera components have evolved to highly integrated single-chip cameras containing all light-sensing, digitization, and image processing elements. These cameras can be fitted with miniature lenses to arrive at a very small digital camera component. As with the MEMS IMU components, the new camera optical components are not metrically accurate. That is, the light rays are not mapped with high accuracy onto the light-sensing two-dimensional array. Further complicating this issue, the lens/optics may be a varying focal length so that the optical parameters can be highly variable through the course of imagery collection. In both the IMU and camera situations, accurate system mechanization must deal with the likely inaccuracies of the constituent subsystems. One of the features in the present invention is the built-in tolerance to inaccuracy from subsystems. The present invention includes substantial and easily expandable models for errors in the sensing components that are parameterized with statistical-based coefficients. The errors in these coefficients are accounted for in the processing and, in typical circumstances, the errors are estimated with high accuracy.

Systems that employ the present invention may include self- contained motion sensing and camera devices. Navigation using these sensors can be relative to a local scene and the axes of the pose information are related arbitrarily to the early detected features. All subsequent features and platform data are provided relative to this initial coordinate system. According to one embodiment, geodetic pose information is provided. The merging of GPS and IMU information is well known to the art and such integration may be optionally included to work together with the present invention. As a result, only sparse GPS information is required, as maybe available within an inner city environment, to enable pose to be referenced to geographic coordinates. Unlike the prior art systems, this embodiment provides pose determination in geographic coordinates even during long periods of GPS unavailability such as indoors, in tunnels, and in densely built- up urban environments

Some naturally occurring scenes may be known to include features that can be classified by their imagery makeup. Such features may include intersections of straight lines (e.g., building corner) or an occurrence of a vertical line (e.g., a building edge or a tree trunk) or other characteristics reflecting objects internationally planted in a scene. The invention also includes the optional capability of restricting features to fall in special classes. Some navigation scenarios may be such that repeated navigation within an area is required. In this instance, it may be suitable to initiate the navigation by exercising the invention as described above in a learning mode to collect the necessary navigation feature archive. Thereafter, the same pre-learned features will be used without a need to add new features. The archive may also be passed among multiple platforms that wish to navigate the area. Similarly, it may be suitable to insert arbitrary features at arbitrary non-surveyed locations (traffic cones, colored balls, and posts) that are easily recognizable within the imagery. These features can then be treated preferentially within the feature selection process. However, the localization and feature tracking of these preferential features will proceed exactly as did the naturally occurring features.

The present invention may be implemented in numerous ways, including a method, a system and a computer readable medium containing program code for automatically obtaining pose information of a platform including a motion sensor and an imaging device. Different embodiments or implementations may yield one or more of the following unique advantages and benefits.

One of the advantages and benefits in the present invention is that pose for an arbitrarily moving platform can be determined with a high degree of accuracy without a requirement for external information or prior mission planning. Moreover, the invention has a built-in resilience to subsystem error enabling the use of less accurate inertial sensing and camera components. Finally, the precise pose information can be tagged to specific image frames allowing extensive metric use of the imagery set for 3D scene reconstruction and mensuration.

Other advantages, benefits, objects and features of the present invention, together with the foregoing, are attained in the exercise of the invention in the following description and resulting in the embodiment illustrated in the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS

These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description, appended claims, and accompanying drawings where:

Figure 1 demonstrates a system in which the present invention may be practiced.

Figure 2 shows a block diagram of a preferred internal construction of a computer system that may be used in the system of Figure 1.

Figure 3 illustrates a 3D drawing of an intensity image that includes a white area and a dark area;

Figure 4A shows two exemplary consecutive images and successively received from an imager; Figure 4B shows that an exemplary multi-resolution hierarchical feature structure for extracting a feature in one of the images in Figure 4A;

Figure 4C shows K image structures from a single image and each of image structures is for one feature; Figure 4D shows, as an example, what is called herein a

"features tracking map", or simply features map;

Figure 4E shows a flowchart of the feature extraction process;

Figure 4F illustrates a template update in feature tracking among a set of consecutive images; Figure 4G shows a process flowchart of the enforcement of an epipolar geometry between a current image frame and previous image frames;

Figure 5A shows a process flowchart of the integration of feature tracking and navigation according to one embodiment of the present invention;

Figure 5B illustrates an aspect of the motion-sensor-derived information;

Figure 6 shows the overall process functional diagram of the navigation computation elements in the system;

Figure 7 shows that a feature is a discrete point detected within an image observed from the digital camera; and

Figure 8 shows a set of exemplary states that are received from an inertial device and further propagated by the Kalman filter process in Figure 6.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Notation and Nomenclature

In the following detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will become obvious to those skilled in the art that the present invention may be practiced without these specific details. In other instances, well known methods, procedures, components, and circuitry have not been described in detail to avoid unnecessarily obscuring aspects of the present invention. The detailed description of the present invention in the following are presented largely in terms of procedures, steps, logic blocks, processing, and other symbolic representations that resemble of data processing in computing devices. These process descriptions and representations are the means used by those experienced or skilled in the art to most effectively convey the substance of their work to others skilled in the art. The method along with the system and the computer readable medium to be described in detail below is a self-consistent sequence of processes or steps leading to a desired result. These steps or processes are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities may take the form of electrical signals capable of being stored, transferred, combined, compared, displayed and otherwise manipulated in a computer system or electronic computing devices.

It proves convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, operations, messages, terms, numbers, or the like. It should be borne in mind that all of these similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following description, it is appreciated that throughout the present invention, discussions utilizing terms such as "processing" or "computing" or "verifying" or "comparing" or the like, refer to the actions and processes of a computing device that manipulates and transforms data represented as physical quantities within the computing device's registers and memories into other data similarly represented as physical quantities within the computing device or other electronic devices.

10 System Overview and Data Acquisition

Referring now to the drawings, in which like numerals refer to like parts throughout the several views. Figure 1 demonstrates a configuration in which the present invention may be practiced. An object 100 is typically a distinctive aspect of a landscape around which a system 110 employing the present invention is transported in an arbitrary unplanned route 101. The object may include, but may not be limited to, a natural scene, terrain, and man-made architecture. Examples of system 110 may include a vehicle, a plane, a human operator and a self-navigated missile. Typically, system 100 comprises an imaging device 103 and a motion sensing device 104. According to one embodiment, imaging device or imager 103 is a digital camera producing digital images and motion sensor 104 is an Inertial Measurement Unit (IMU) that provides a full six degree of measurement (i.e. pose information) of system 110 with respect to a common coordinate space. Imaging device 103 produces an image archive 102 of the object where each image is tagged with the pose that occurred at the instant the image was taken. To image such object, imager 103 is attached to and operates in concert with IMU device 104 and thus produces a sequence of images in a format of video frames 105 or a sequence of pictures by gradually moving the imager around or relatively within the scene. For example, an imager may be attached to a flying vehicle if a particular imagery set from the overflown terrain is required, or a camera may be attached to an automobile rooftop if imagery features will be used to aid the automobile navigation, or a camera with attached IMU may be carried by a human operator if a 3D reconstruction is planned. A sequence of images of the particular area is thus generated when the vehicle moves with

11 respect to the urban terrain. In the present invention, a motion sensing device, such as IMU 104 that senses rotation rate and acceleration along orthogonal axes, is coupled with the imaging process. Moreover, the timing of the discrete IMU measurements is precisely synchronized with the timing of each of the imagery frames.

Synchronization of the imagery frames and motion sensing data may be achieved by a variety of means by those skilled in the art. For example, camera frames may be precisely triggered by the IMU sample clock, or both camera frames and IMU samples may be time-tagged using a common clock.

To facilitate the description of the present invention, the object 100 is assumed to be a building. Thus a user or operator can traverse around object 100 to produce a sequence of images providing a surrounding view of object 100. Imager 103 may be a video camera whose focal length may be varied when the surrounding imagery is generated. Imager 103 is associated with a digital sampling process that produces digital imagery on a suitable media 105 along with time-tags that relate the images to the IMU data. Imager 103 and its companion IMU instrument 104 are coupled either during imaging or later to computer system 106 that includes a means for accepting images and companion motion sensing instrument data. Thus suitable media 105 may in fact be a dedicated data bus. Within the camera mechanization or in a post- collection step, digital imagery is produced. Figure 1 shows a Digital Video (DV) camera that results in direct video storage to magnetic tape. Alternatively, a camera may be used to produce analog signals that are digitized via a frame grabber, either in realtime or in a post-processing step. The exact configuration of imager

12 104 does not affect the operation of the present invention. In the following, it is assumed imager 104 produces a sequence of digital images C-i , C₂, ... C_N, typically in a commonly used color format, coordinates or space. One of the commonly used color spaces is the RGB color space in which each of the image color pixels is represented as a vector C(i, j) = [R(i, j), G(i, j), B(i, j)]^τ, where (i, j) ate coordinates of an image pixel C(i, j) and R, G and B are the respective three intensity images in color image C. It is understood that the R, G, and B color image data representation is not necessarily the best color space for certain desired computations, there are many other color spaces that may be particularly useful for one purpose or another.

Computer system 106 may be a computing system that may include, but not be limited to, a desktop computer, a laptop computer or a portable device integral to the imaging system. Figure 2 shows a block diagram showing an exemplary internal construction of computer system 106. As shown in Figure 2, computer system 106 includes a central processing unit (CPU) 122 interfaced to a data bus 120 and a device interface 124. CPU 122 executes certain instructions to manage all devices and interfaces coupled to data bus 120 for synchronized operations and device interface 124 may be coupled to an external device such as imaging system 103 and IMU instrument 104 hence image data and IMU data therefrom are received into a memory or storage through data bus 120. Also interfaced to data bus 120 is a display interface 126, network interface 128, printer interface 130 and floppy disk drive interface 138. Generally, a complied and linked version of one embodiment of the present invention is loaded into

13 storage 136 through floppy disk drive interface 138, network interface 128, device interface 124 or other interfaces coupled to data bus 120.

Main memory 132 such as random access memory (RAM) is also interfaced to data bus 120 to provide CPU 122 with the instructions and access to memory storage 136 for data and other instructions. In particular, when executing stored application program instructions, such as the complied and linked version of the present invention, CPU 122 is caused to manipulate the image data to achieve desired results. ROM (read only memory) 134 is provided for storing invariant instruction sequences such as a basic input/output operation system (BIOS) for operation of keyboard 140, display 126 and pointing device 142 if there are any.

Feature Extraction and Tracking One of the features in the present invention is to provide an automatic mechanism that extracts and tracks only the most salient features in the image sequence, and uses them, in concert with the IMU measurements, to automatically generate the motion of the imager. The features used in the present invention are those that are characterized as least altered visually from one frame to an adjacent frame and can be most accurately located in the image by automatic image processing methods. For example, a salient feature may be characterized by a sharp peak in an autocorrelation surface or by corner-like features in each of the image frames. There are many known techniques for extracting salient features and locating them in subsequent frames and the method detailed here should not be considered limitative of the present invention.

To accelerate the process of feature extraction and tracking, the present invention uses a salient feature operator to detect the features

14 only in an initial image or in those images that appear to have lost some of the features being tracked. For images subsequent to the images applied with the salient feature operator, the present invention utilizes multi-resolution hierarchical feature tracking to establish feature correspondences to the features detected by the salient feature operator. The search space for the corresponding feature is initiated at a point as directed by the predicted feature location from the navigation processing subsystem.

Extraction of Salient Features According to one embodiment, the salient features to be extracted are typically those corner-like features in the images. Figure 3 illustrates a 3D drawing 202 of an intensity image 200 that includes a white area 204 and a dark area 206. Drawing 202 shows a raised stage 208 corresponding to white area 204 and a flat plane 210 corresponding to dark area 206. Corner 212 is the salient feature of interest whose location change can be the most accurately determined and typically least affected from one frame to a next frame.

A salient feature detection processing is designed to detect all the salient features in an image. The salient feature detection processing is to apply a feature detection operator to an image to detect the salient features therein. According to one embodiment, the feature detection operator or feature operator O(/) on an image / is a function of the Hessian matrix of a local area of the image that is based on the Laplacian operator performed on the area. Specifically, the salient feature operator O(/) can be defined as:

/ = O(I) = Det[H(I)] - λG(I)

15 where I_f is, as a result, defined as a feature image resulting from the salient feature detection processing by 0(1). Det( ) is the determinant of matrix H and λ is a controllable scaling constant and:

G(I) = Ixx + Iyy

The Hessian matrix can be further expressed as follows:

Ixx Ixy (/) = i^χy lyy

where x and y are the horizontal and vertical direction, respectively, and the second order derivatives:

d²ι. 5

/ = dx² dxdy yy ay

and l_s is a smooth version of image / by performing an image convolution with a 2D Gaussian kernel that is typically 1 1x1 1 to 15 x 15 pixels in size.

One of the unique features of the salient feature operator described herein is the ability of emphasizing only the corner-like regions, such as 212, while suppressing edge or homogeneous regions, such as 214, 208 and 210 in Figure 3. This is useful as only corner-like features provide constraints in two axes. After image / is processed by the salient feature operator, the local maximums of salient image I_f are then extracted which correspond to the salient features. Typically, image / is an intensity image that may be an intensity component in the HIS color space or a luminance component derived from the original color image.

16 Generally each of the salient features is presented as a template, such as a 1 1 -by-1 1 or 13-by-13 image template. The characteristicsor attributes of a salient feature template may comprise the location of the feature in the image, color information and strength thereof. The location indicates where the detected salient feature or the template is located within the image, commonly expressed in coordinates (i, j). The color information may carry color information of the template centered at (i, j). The strength may include information on how strongly the salient feature is extracted or computed as l_f(i, j). In operation, there are N color images sequentially received from an imager. As each color image is received, it is first transformed to a color space in which the luminance or intensity component may be separated from the chrominance components. As understood by those skilled in the art, the color image conversation is only needed when the original color image is presented in a format that is not suitable for the feature extraction process. For example, many color images are in the RGB color space and therefore may be preferably transformed to a color space in which the luminance component may be consolidated into an image. The above feature operator is then applied to the luminance component to produce a plurality of the salient features that preferably are indexed and kept in a table as a plurality of templates. Each of the templates may record the characteristics or attributes of each feature.

By the time the N color images are processed, there shall be N corresponding feature tables, each comprising a plurality of the salient features. The tables can then be organized as a map, referred to herein as a feature tracking map, that can be used to detect how each of the features is moving from one image frame to another.

17 Tracking of Salient Features

The goal of tracking is to find image features in consecutive image frames, which correspond to the same physical point in the world. Since computerized tracking of salient features in the image frames are based on finding image regions which appear visually similar to those centered at the salient features, it is often the case that incorrect matches are found. This is usually due to changes in imaging conditions or in viewing perspectives or occurrence of repetitive patterns. If the feature tracker estimates a bad feature correspondence then this error may have a detrimental effect on the subsequent pose estimation. Therefore it is of great importance to eliminate these bad feature matches and maintain the highest possible number of correct matches. According to one aspect of the present invention, a combination of techniques, such as Multi- resolution feature tracking, Epipolar geometry-based tracking and

Navigation-assisted tracking are designed to find only correct feature correspondence in the image sequence using the least possible computation.

Multi-resolution Feature Tracking In a preferred embodiment, a multi-resolution hierarchical feature structure is used to extract the features for tracking. To be specific, Figure 4A shows two consecutive images 402 and 404 are successively received from an imager. After the salient feature operator is applied to image 402, it is assumed that one feature 406 is detected and the characteristics thereof are recorded. When second image 404 comes in, a multi-resolution hierarchical image pyramid from the image is generated.

Figure 4B shows an exemplary multi-resolution hierarchical feature structure 408 for extracting feature 406 in image 404. There

18 are a number of image layers 410 (e.g. L layers) in image structure 408. Each of the image layers 410 is successively generated from the original image 404 by a decimation process around the feature location. For example, layer 410-L is generated by decimating layer 410-(L-1). The decimation factor is typically a constant, preferably equal to 2. Given the characteristics of the feature found in image 402 and knowing that two image 402 and 404 are two successive images, the feature and its location 405 in image 404 shall not alter drastically. Therefore an approximate search area for the feature can be defined in the second image and centered at the original location of the feature.

More specifically, if feature 406 is located at coordinates (152, 234) in image 402, the window to search for the same feature may be defined as a square centered at (152, 234) in image 404.

Multi-resolution hierarchical feature structure 408 shows that as the number of layers 410 is increased upward, the resolution of each layer 410 decreases. In other words, when the size of the search window remains the same, the search area is essentially enlarged. As shown in the figure, search window 412 covers relatively a larger area in layer 410-L than in layer 410-(L-1 ). In operation, layer 410-L is first used to find an approximated location of the feature within search window 412. One of the available methods for finding the location of the corresponding function in the consecutive images is to use a template matching process. The template is defined as typically a square image region (11-by-1 1 to 15-by-15) centered at the location of the original feature extracted by the salient feature operator, or at the predicted location of feature in a new frame. Then the corresponding subpixel accurate location of the match can be found at that position where the normalized cross-correlation of the two corresponding images regions is the largest (ideally "1 " for a complete match). Layer

19 410-(L-1 ) is then used to refine the approximated location of the feature within the closest area in the same window size and finally layer 410 is used to precisely determine the exact location (x, y) of the feature. It can be appreciated that the use of the feature structure has many advantages over prior art feature extraction approaches. In essence, an effectively larger representation of the feature template can be achieved, which makes it possible to track a feature effectively and precisely and is directly suitable to the hierarchical tracking mechanism. Generally there are K salient features in an image and K can be in a range of 1 ~ 1000. Hence there are K feature structures like the one in Figure 4B. Figure 4C shows K feature structures 420 from a single image, each of feature structures 420 is for one feature. As a result of the feature extraction, a set of attributes F(...) describing each of the K features are produced and may comprise information of the location, strength and color of the feature.

With N image frames and K sets of the attributes Fj(...), i = 1 , 2, ... K, Figure 4D shows what is called herein a "feature tracking map", or simply, a feature map that illustrates collectively all the features found for N images and is used for tracking the features so as to estimate the motion of the imager. In addition, Figure 4E shows a flowchart of the feature extraction process. Both of Figures 4D-4E are described conjointly to fully understand the feature detection and tracking process in the present invention. At 452, color images are successively received from the imager.

A dominant component, preferably the luminance or intensity component is extracted from the color images at 454. In one embodiment, the color images are simply transformed to another color space that provides a separate luminance component. At 456, the

20 process looks up, for example, a memory area, for any features or feature templates stored there. If there are sufficient number of feature templates in the memory area, that means that the process needs to proceed with feature tracking in the next image, otherwise, the process needs to check if new features must be extracted at 458. In operation, the first received image always invokes the feature extraction operation with the salient feature operator as there are no stored features or feature templates to perform the feature tracking process. So the process now goes to 460. At 460, the feature extraction process generates K features in the received image (e.g. frame #1 ). As illustrated in Figure 4D, there are K features in the received image frame #1 . Preferably, the attributes of the K features, as feature templates, are stored in a memory space for subsequent feature extraction process. When a next image comes in at 462, the process goes to 464 to generate the multiple-resolution hierarchical image pyramid preferably having the newly arrived image as the base. As described above and shown in Figure 4C, the tracking process searches for locations in the image pyramid which demonstrate most similarity to the respective layers of the feature templates stored in the feature structures. With each of the K multi-resolution feature structures, K or less corresponding features are localized from each corresponding layer in the image pyramid at 466 and the K feature locations are then collected and appended to the features map for frame 2. Similarly, for the next n1 frames, the process goes to 462 via 456 repeatedly to extract K features from each of the n1 frames.

It may be observed that, as images are generated, the imager may have been moved around the object considerably with respect to the initial position from which the first image is captured. Some of the K

21 features may not necessarily be found in those late generated images. Because of the perspective changes and motion of the imager, those features may be either out of the view or completely changed so that they could be no longer tracked. For example, a corner of a roof of a house may be out of the view or lose its salient feature when viewed from a particular perspective. Therefore, the representation 430 of the K features for n1 images in Figure 4D shows the dropping of the number of the features.

As one of the features in the present invention, the generation of additional new features is invoked when the number of features drops more than a predefined threshold (T). At 456, when it is found that a certain number of the features can not be found in an incoming image, the process goes to 458 to determine if it is necessary to extract new features to make up the K features. As described above, when the number of the features drops due to a perspective change or occlusion, new features may have to be extracted and added to maintain sufficient amount of features to be tracked in an image. The process restarts the feature detection at 460, namely applying the salient feature operator to the image to generate a set of salient features to make up for those that have been lost. The process is shown, as an example, to restart the feature detection at frame n1 in Figure 4D.

As indicated in Figure 4E and also shown in Figure 4D, the feature templates to be matched with consecutive images remain as the original set in tracking the features in subsequent images and do not change from one frame to another. Typically establishing feature correspondence between consecutive image frames can be accomplished by two ways. One is to achieve this in directly consecutive image pairs, the other one is by fixing the first frame as

22 reference and finding the corresponding locations in all other frames with respect to this reference frame. In one embodiment, the second approach is used since it minimizes possible bias or drifts in finding the accurate feature locations, as opposed to the first approach where significant drifts can be accumulated over several image frames.

However, the second approach permits only short-lived feature persistence over a few frames as the scene viewed by the camera undergoes large changes of view as the camera covers large displacements, which ultimately causes the tracking process proceed to 472.

In order to maintain feature tracking over many of subsequent frames using the second approach, a feature template update mechanism is incorporated in 474. As shown in Figure 4F, if no corresponding feature locations can be found in the most recent frame 492 with respect to the features in the reference frame 490, the templates of the lost features are replaced by the ones located in the most recent frame 494 in which they have been successfully tracked, i.e. at 494. The template update at 474 of Figure 4E provides the benefits that features can be successfully tracked even if they may have had a significant perspective view change by minimizing accumulative drift typical for the first approach.

Understandably, the feature regeneration is invoked after every certain number of frames. Figure 4D shows, respectively, feature sets 432 - 436 for images at frame number n1 , n2, n3, n4, n5 and n6. The frame number n1 , n2, n3, n4, n5 and n6 may not necessarily have an identical number of frames in between. As the imager further moves and generates more images, some of the features may reappear in some of the subsequent images, are shown as 438-440, and may be reused depending on the

23 implementation preference. At 470, the process ensures that all the frames are processed and features thereof are obtained. As a result, a feature map, as an example in Figure 4D, is obtained.

Epipolar constraint-based Feature Tracking While the multi-resolution approach can significantly reduce the search area for the tracker, it does not guarantee that the feature is matched correctly since it uses a similarity measure of the feature template only and does not use any geometric constraint of the 3D points in the world reference frame. As well-known by those skilled in the art, the epipolar constraint defines a one-dimensional restriction on the projected image locations of stationary points in the 3D environment being observed by the imaging device. According to this constraint, given pairs of corresponding points p1=(x1 ,y1 ,1)^τ and p2=(x2,y2,1)^τ in two image frames, they have to satisfy the equation p1 ^{τ *}F^*p2 = 0, where F is the

Fundamental matrix. This matrix can be computed from 7 or more corresponding point-pairs. Having computed this matrix one can derive the epipolar constraints for the correspondence of all the rest of the feature points in the particular pair of image frames. In order to guarantee that the original 7 point-pairs are correctly matched we apply a technique based on RANSAC (RANdom Sample Concensus) principles - known by those skilled in the art. After this constraint is computed the rest of the feature matches can be found by searching along these epipolar lines only. This approach greatly reduces the area of search and eliminates the problem of false feature matches which fall outside of this search area.

Combining the pair-wise epipolar constraint in three image frames results an even stronger constraint called the trifocal constraint -known by those skilled in the art. This latter constraint fully

24 determines the locations of the feature points in the third frame given their locations in the two other frames and that the translation motion is not co-linear. Figure 4G shows a process flowchart of the enforcement of the epipolar geometry between the current image frame #n 480 and previous image frames, which results in a multiple enforcement 482 and verification of the trifocal constraint on the location of feature points in fame #n given their locations in the earlier frames.

Kalman filter navigation processing is used as a subsequent process following the extraction of suitable features from the imagery. As used herein, Kalman navigation refers to navigation based on the

Kalman Filter process that is well known in the art. Often times one wants to estimate the state of a system given a set of measurements taken over an interval of time. The state of the system refers to a set of variables that describe the inherent properties of the system at a specific instant of time. The Kalman filter is a useful technique for estimating, or updating the previous estimate of, a system's state by, for example, using indirect measurements of the state variables and using the covariance information of both the state variables and the indirect measurements.

It is noted that falsely matched features may cause serious problems for the Kalman navigation as may those which are not stationary, i.e., moving independently. Using the epipolar and the trifocal constraints in tracking, it is now possible to eliminate most of these wrong or undesired correspondences between the current frame (Frame #n) and the previous ones without knowing the motion of the camera platform. It should be pointed out that the Kalman filter is one of many statistic estimation processes that may be used to provide navigation feedback to the feature tracking. For clarity, the Kalman filter is used to describe the invention according to one embodiment.

25 The combination of hierarchical feature tracking and the epipolar-constrained tracking provide a powerful tool for accurate tracking of features during the initialization of the Kalman navigation process or the estimation of a newly extracted feature, thereby it provides optimal convergence during the critical initialization phase of the Kalman Filter.

Navigation-Assisted Feature Tracking

As indicated above, the Kalman Filter process produces the optimal estimates of the platform position and feature locations given the feature set provided by the vision processing. The performance is achieved with maximum feature persistence and re-acquisition of archived features.

Using the estimates of the platform position and feature locations we can accurately predict the image locations of those salient features whose 3D locations are already known, or we can constrain the tracking of newly acquired features to the epipolar lines. As soon as the camera motion estimate becomes accurate we can compute the epipolar constraint from this parameter as well and then use this constraint for those newly acquired features whose 3D locations is yet to be determined.

The combination of these techniques provides key assistance for the feature tracker. The predicted platform position and attitude, along with the current feature location estimate, allows the prediction of the pixel values at the next image occurrence. Thus the Kalman filter provides a natural linkage to the feature tracking process. This navigation feedback to the tracker not only greatly reduces the risk of false feature mismatches but it also reduces the search area thereby accelerating the tracking process. The more accurate the Kalman Filter

26 navigation estimates are, the more accurate the tracking assistance and thus the tracking process.

The benefits of the integration of the feature tracking and the Kalman Filter process are one of the most important aspects of the current invention. It not only improves accuracy and robustness of the navigation method but it also makes it possible to use significantly lower quality navigation and imaging components in the apparatus than in similar approaches of the prior art systems.

Figure 5A shows a process flowchart of the integration of feature tracking and navigation according to one embodiment of the present invention and shall be understood in conjunction with Figure 6 that shows a functional block diagram of a system employing the present invention. The Kalman filter 604 predicts ahead the platform states using the motion-sensor-derived information shown in Figure 5B. The predicted platform position and attitude, along with the current feature location estimate, allows the prediction of the pixel values at the next image occurrence 603. This information is provided to the feature estimation process. The search area for the next feature occurrence is steered by the Kalman filter. If the expected feature is not found for an established number of frames, then the feature is declared lost and its image templates and position estimates are archived 610.

As described above, one of the benefits in the navigation prediction of the invention is the improvement upon an imagery-only solution by predicting the camera translation and rotation motion between each image collection thereby avoiding a major cause of feature tracking failure often seen in the prior art systems. Now it is possible to tolerate a large angular motion between frames including features that leave and return to the field of view of the imager. The

27 residual feature motion on a subsequent frame is predominantly a result of estimation errors in the relative platform-to-feature position defined according to the epipolar geometry. This tight coupling between the navigation and feature tracking process offers major benefits both in feature track reliability as well as reduced search window size.

Management of Features in the Kalman Filter

The dynamic nature of tracking features requires the process of feature management for the Kalman Filter, which is illustrated in Figure 5A. The feature manager attempts to keep the number of features currently in the Kalman filter at a constant value.

If a feature cannot be detected on a series of N frames, then the feature is declared as lost. When a feature is lost, a new feature is selected 601. Attempts are made to re-acquire an archived feature 606 that is expected to be visible. The expected visibility is determined by the stored feature location, the current platform location, and the aspect of prior stored feature templates. That is, if the platform-to- feature line-of-sight is near a previous line-of-sight where a template is available from the archive, then this feature is declared to be tentatively visible. The feature tracker then attempts to acquire this archived feature. If successful, then the feature is re-inserted into the Kalman filter as an active feature 607. If no archived feature is available, or if no archived feature can be successfully acquired, then a new feature is inserted into the filter 608 and initialized to be at a location along the LOS ray at a pre-set range 609

The number of features in the Kalman filter can be set at 1 -20 in one embodiment. The upper limit to the number of features is dictated only by the processing capability in a system. All features may not be visible on a given frame, however, the complete covariance is maintained for all features kept in the Kalman filter.

Navigational Data Processing

For a rigid body moving in relation to the earth, the following equations of motion describe how position and attitude are computed from the acceleration and rotation rate.

r = v (2

^b — ^b (ω_b/l - C_e ω_e/l) (3

Where

v^e = velocity vector in the Earth-Centered Earth-Fixed (ECEF) coordinate system

r^e = position vector in the ECEF coordinates

Cl = Direction Cosine matrix from the boby-frame to the ECEF frame

f^h = specific force (non-gravitational acceleration) in body axes

(explicitly measured by the IMU accelerometer instruments) Ω_e = rotation rate of the earth

g^e = earth's gravity vector in the ECEF coordinates

29 ω_bh = body-to-inertial angular rotation rate in body axes (explicitly measured by the IMU rate gyro sensors)

ω_ell = earth-to-inertial angular rotation rate in ECEF axes

The resulting position and attitude are relative to an axes system fixed to the earth - and the equations account for the rotation of the earth. The invention requires such attention to detail for two reasons. First, we use rate gyros that may have the capability to sense the earth rotation rate. Second, we allow, as an optional input, the use of GPS which naturally provides measurements with respect to Earth-Centered-Earth- Fixed (ECEF) coordinates.

In the above computation process , the IMU measurements are passed into a strapdown navigation algorithm that is well known in the state of the art. The strapdown navigation algorithm results in computation for position, velocity, and attitude relative to the ECEF coordinates. This trajectory formed by the nonlinear differential equations presented above, forms the basis for a linearized Kalman Filter algorithm. The state vector for this Kalman filter may include errors in position, velocity, and attitude for the system relative to the strapdown navigation trajectory estimate.

Because of errors inherent in the IMU measurements and errors in intializing the strapdown navigation solution, the strapdown navigation solution will drift significantly from the true trajectory. Thus some external aid must be applied to maintain pose estimates close to the true solution. In traditional aided-IMU systems, the external aids may come from GPS or other independent navigation systems. In tradition vision-aided navigation such as stellar-aided navigation or DSMAC, aiding may come from known georeferenced attributes of the visual features such as known geodetic position or known geodetic aspect. For example, with stellar-aided navigation, the earth-referenced star sightings are known to be along a

30 prior-known line-of-sight. For DSMAC, the feature within the visual reference scene is known to be at a geodetic earth latitude, longitude, and altitude. In the present invention, unknown features are acquired and used as previously described. As shown in Figure 6, the central estimation processing uses an

Extended Kalman filter process 614. The Kalman filter process is governed by the following factors:

(1 ) A "whole value" strapdown algorithm 604 runs in parallel with the Kalman filter. The strapdown algorithm propagates position and attitude from the high-rate measurements of acceleration and rotation rate from the IMU 600.

(2) The Kalman filter determines errors with respect to the strapdown navigator.

(3) The Kalman filter state vector is based upon the strapdown instrument error model 607.

(4) The instrument errors are fed back into the strapdown navigation model 604 to prevent large drift leading to nonlinear effects.

(5) The Kalman filter uses partial derivatives of the nonlinear system dynamics with respect to the system states to form the propagation 605 and update 611 steps.

The external aiding is accomplished as measurement updates to the basic extended Kalman filter. These updates can be completely asynchronous from the IMU measurements but their temporal relationship must be precisely known. The vision processing provides feature measurements 610 that are represented by the pixel coordinates of the feature within the camera field- of-view. The Kalman filter requires that the definition of the mathematics

31 relating the pixel measurement to the position and attitude states ofthe platform and features 609. We model each feature as an X-Y-Z triplet in the ECEF space of the navigation process. Thus if there are a total of 10 features in effect, than we require a total of 30 new states within the Kalman filter 610. The formulation is flexible as to the number of features that are carried within the Kalman filter.

As illustrated in Figure 7, a feature 702 is a discrete point detected within an image 700 observed from the digital camera. The digital camera boresight 709 is approximately known relative the IMU axes set 701. This relationship may be rigidly fixed or the camera may be moved in a precisely known manner relative to the IMU axes. The pixel location 700 within an image frame that is tracked over time by the vision subsystem can provide motion information that will aid the strapdown navigation algorithm. It is very important to model the camera image collection process and the geometry of the camera 609 relative to the IMU axes - and to ensure that the temporal relationship of the camera frames are known with respect to the IMU samples.

The most critical aspect of the feature is that each feature must be stationary within the surrounding scene and, most importantly, the "feature correspondence" is correct. Feature correspondence implies that a feature observed in one frame corresponds to the same physical point as the feature observed in a prior frame. Features will be detected automatically, based upon the properties of the scene content, and thousands of features may beconsidered for processing by the Kalman filter. The feature processing 603 & 606 is conducted in parallel by the vision processor.

For the purposes of modeling the feature measurement, each feature is represented precisely by three components of position within an earth-fixed reference frame 704. Thus for each feature currently being

32 formally processed by the Kalman filter, there are three time-invariant unknowns. These unknowns are added to the state vector formed by the IMU instrument errors.

The Kalman filter process requires the propagation 605 of the covariance matrix of the state vector using methods that are well known in the state of the art. This propagation computational burden is typically assumed to grow with the cube of the state vector length. To avoid undue computational burden, the Kalman filter includes the ability of maintaining an adaptive state vector where feature states are archived and de- archived 613 according to their presence in the Field-Of-View (FOV) and the current features 610 in the state vector.

Features are "archived" as they are observed to leave the camera FOV. Feature archival implies storing the current feature position estimate, the feature component covariance matrix, feature component-to-platform correlation matrix, and the feature reference image template. The de- archival process replaces the feature within the Kalman filter formulation and resets the feature position and correlation properties.

The relationship between a Y, Z feature location in the imager sensor plane 700 (in pixels) and the unit vector u to the feature in camera axes is given by

We can also write, by inspection:

33 Y = l- u^c u_x ^c (5) ε ^}

Z : / u / u' (6)

The physical measurement of the feature location is made from interpretation of patterns formed by greyscale values from the CCD elements. This measurement process contains a random frame-to-frame noise associated with image processing techniques. Errors also result from the CCD physical layout, the CCD signal sampling process, and the lens/optics path where the light is presented to the CCD array. Nominally, the net resulting pixel space represents samples from a perfectly rectangular grid. The exact grid dimensions (in pixels) and the approximate pixel-spacing in physical dimensions are available from the manufacturer.

An empirical model is often used to represent the consistent measurement errors associated with the feature location on the focal plane. This model is given by

Y

Y'= Y+ ε_uY+ ε_uZ + Y\ κ {γ² + z²) + κ₂(γ² +z²)² 1 ¹ P1X (7)

2

1 Z

Z' = Z + ε ,Y + ε Z + Z\ K,(Y² + Z²) + K₂(Y² +Z²f (8)

\ 2

where

34 Y,Z = true physical displacement of the feature ray on the focal plane

T ,T = measured pixel count of the feature location

Y_P1X ,Z_PIX = number of pixels in the Y and Z dimensions

ε , ε₂₂ = error in the pixel spacing in both dimensions 706

ε_u ,ε₂ = non-rectangular pixel skew terms 707

K , K₂ = First- and second-order radial distortion terms 708

This describes the basic camera-based measurements for a point feature and are related to "error states" that are estimated within the filter. These error states are

• Errors (3) in the platform position 705 (δr_^e _p )

• Errors (3) in the feature position 704 (δr^e _f )

• Errors (3) in camera-to-IMU alignment 701 ( Δφ)

• Errors (3) in ECEF (e-frame) alignment ( Δθ )

To complete the definition of the Extended Kalman Filter for the camera update processing, we must linearize the camera measurements ( Y and Z) in terms of these error states. This linearization is defined by the following expressions

35 dY dY du dZ dZ du

(9) dδf_f du dδrf_j dδf_f du dδ^_f

Where, from Eq 5 and Eq 6,

dY _τ = {-(f/ε /uf (f/ε)/, (10) du

dZ _T = {-(f/ε u 0 (f/ε)/u (11)

The expressions for the necessary partial derivatives relating measurements to states is given by

dY dY

= u x (15) dΔφ du

36 It is evident to those skilled in the art that the similar partial derivatives for the "Z" feature measurement are identical to the above expressions except that "Y" is replaced with "Z".

These equations represent a complete 3D linearized model of the feature measurement process. The feature update within the Kalman filter is performed using these partial derivatives completely asynchronously with any optional GPS pseudorange or delta-pseudo-range processing. The extended Kalman filter feature update processing uses methods well known to the state of the art. Optional GPS and/or feature update processing can occur as they are available and the Kalman filter ensures optimal processing based upon the assumed statistics and math models.

In practice, we attempt to keep at least three reference features "active" from the archive of candidate features. We expand this active feature set to 10-20 (or more) if needed. An active feature will have its covariance properties maintained and updated within the Kalman filter whereas the feature archive may have thousands of candidate features in various stages of estimation accuracy available for use. A feature can also become sufficiently well localized so that its position is no longer updated by the Kalman filter - the feature then becomes a "landmark". When a feature standard deviation becomes sufficiently small, further attempts to update the feature may, in fact, lead to numerical instability in the filter computations. In some applications (e.g., robotic navigation of a closed circuit), we expect that a local scene may become "calibrated" in that its features have been sufficiently localized so that all are considered as landmarks.

37 The Kalman filter requires an initial estimate of the feature location at each occurrence of a new feature. This seemingly contradicts the notion that we have no a priori information regarding the feature set. Our solution is to initialize each feature as lying along a ray defined by the center of its template projected through the camera geometric model, and assuming a range along that ray. Typically, this range is assumed to have a value of 100m. The processing results are insensitive to this initial guess within a factor of about 10. This is, the true range may be from a distance anywhere between 10m and 1000m. An alternate solution is sometimes used where the Kalman filter processing is iterated by processing forward with an assumed initial range by several seconds. By observing the behavior of the mean-square Kalman filter residuals, we can infer that the initial estimate is poor. An iteration is performed by re-initializing the range with a different estimate until acceptable convergence is obtained. For general urban navigation situations, this feature is not required.

According to another embodiment, GPS is considered as an optional measurement for the vision data processing system in the invention. The GPS measurement update process treats each satellite measurement separately. Typically, there may be eight (or more) satellites in view and each satellite provides a code-based range measurement as well as a carrier phase-based delta-range measurement. Both the range and delta-range measurement include a bias error associated with the unknown parameters of the GPS receiver clock. The GPS-based Kalman filter process thus includes two measurement update steps for each in- view satellite at a once-per-sec rate.

Figure 8 shows a set of exemplary states that that include sensor error model parameters and dynamic platform motion parameters. These states are propagated by the Kalman filter process 614 of Figure 6 between asynchronous measurements and can be updated at each measurement

(Per Arthur: copied from the background portion in the original version) One of the key aspects of the present invention is that the pose of a platform can be determined in relationship to a surrounding scene without any advance knowledge of the scene and no special pre-surveyed targets placed in the scene. The invention is achieved by a tight integration of inertial navigation, image processing and photogrammetric processes. Furthermore, because of the use of the two distinct navigation modalities, the instruments that mechanize the separate navigation solutions can be calibrated as part of the navigation process so that very low-precision instruments can be used.

The invention described above obviates the limitations in the prior art systems by including a second motion sensing modality. The current invention requires the independent measurement of motion relating the camera pose at each imagery collection time. By integrating the frame-to- frame pose information, a pose time-history can be generated. An initialization of this solution uses an approximate pose derived either from reference features or from some external geodetic information. The independently-derived pose time-history will drift from the true pose unless it is updated by observing reference features fixed in the surrounding scene. The camera subsystem views its surroundings and imagery analysis methods are used to automatically select a feature set that will be suitable for navigation reference points. This automated process uses the nature of the feature within the context of its localized image characteristics and the spatial diversity of the feature set within the scene.

The feature set is tracked by the camera/image processing system through sequential image frames. Feature tracking is simplified because the

39 feature location at the next image point can be predicted by the independent navigation process and assumptions that the feature is stationary in the scene (with non-stationary features automatically detected and discarded). The measurement of the feature location within the camera Field-Of-View (FOV) is compared with the predicted value to provide converging estimates of the location of the selected features with the surrounding scene. As the platform moves around the scene, the features become successively better localized and the platform pose is determined relative to a coordinate system fixed with respect to the scene. Features from the feature set are occasionally lost from the camera FOV but their last-known properties (image reference templates, and location estimation) are retained in a database. The archived features can be reacquired and used again as reference features without a requirement for re-location. By navigating within the scene, the feature archive is built up and the navigation precision becomes more and more precise. Excellent camera position estimation in the horizontal plane can be recovered using only observation of a single feature.

The process described above contains redundant information about the platform pose dynamics derived from imagery and motion measurements. This redundancy is most useful in that additional information can be learned regarding the errors in the sensing components. By modeling the mathematics and statistics of errors in the accelerometers, rate gyros, and digital camera sensors, we can estimate their respective errors in parallel with the estimation of feature locations and the platform pose. This self-calibration objective is very important in this invention because many applications demand very small, low power, and low cost sensing components. These factors prevent the use of high-precision components. Our invention is insensitive to significant accelerometer and gyro error, misalignments between motion sensors and camera axes, and scale, skew, and radial distortion inherent in low-precision camera systems. This insensitivity is a

40 result of modeling these known error phenomenons and estimating the required model coefficients simultaneously with the pose information

It may be appreciated by those skilled in the art that the present invention, significantly apart from the prior art systems, introduces a system for automatically obtaining pose information of a platform including a motion sensing system and an imaging device. The motion sensing system and the imaging device are configured to work together to provide the pose information of the platform without a priori knowledge of the environment in which the platform navigates. The present invention has been described in sufficient detail with a certain degree of particularity. It is understood to those skilled in the art that the present disclosure of embodiments has been made by way of examples only and that numerous changes in the arrangement and combination of parts may be resorted without departing from the spirit and scope of the invention as claimed. Further the disclosed invention may be implemented in numerous ways, including a method, a system and a computer readable medium. Accordingly, the scope of the present invention is defined by the appended claims rather than the forgoing description of embodiments.

4 1 WORLD INTELLECTUAL PROPERTY ORGANIZAΉON

International Bureau

INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERAΗON TREATY (PCT)

(51) International Patent Classification 7 (11) International Publication Number: WO 00/34803 G01S 17/00 A2

(43) International Publication Date: 15 June 2000 (15.06.00)

(21) International Application Number: PCT/US99/27483 (81) Designated States: JP, European patent (AT, BE, CH, CY, DE, DK, ES, FI, FR, GB, GR, IE, IT, LU, MC, NL, PT, SE).

(22) International Filing Date: 19 November 1999 (19.11.99)

Published

(30) Priority Data: Without international search report and to be republished

60/109,295 20 November 1998 (20.11.98) US upon receipt of that report. 60/110,016 25 November 1998 (25.11.98) US

(71) Applicants: GEOMETRIX, INC. [US/US]; 124 Race Street,

San Jose, CA 95131 (US). KORBIN SYSTEMS, INC. [US/US]; 924 Central Avenue, Ft. Walton Beach, FL 32547 (US).

(72) Inventors: KAIN, Jim; 501 Lexington Street, Waltham, MA

02452 (US). YATES, Charlie; 924 Central Avenue, Ft. Walton Beach, FL 32547 (US). ZWERN, Arthur; 2226 Coastland Avenue, San Jose, CA 95125 (US). FEJES, Sandor; 4859 Clydelle Avenue, San Jose, CA 95124 (US). CHEN, Jinlong; 4664 Cheeney Street, Santa Clara, CA 95054 (US). JABLONSKI, Marc; 3625A 17th Street, San Francisco, CA 94114 (US).

(74) Agents: ZHENG, Joe et al.; Geometrix, Inc., 124 Race Street, San Jose, CA 95131 (US).

(54) Title: VISION-ASSISTED CAMERA POSE DETERMINATION

(57) Abstract

A completely passive and self-contained system for determining pose information of a platform is disclosed. The system comprises a motion sensing device and an imaging device, both operating together in a known temporal relationship, so that each of the images generated from the imaging device corresponds to a set of motion data provided by the motion sensing device. In a preferred embodiment, the motion sensing device and the imaging device are integrated together and or operate synchronously. The imaging device senses the surrounding scene from which features are extracted and tracked to determine the imaging device motion. Hence no advance information regarding the scene or no special scene preparation is required. Further, a statistic estimation process, such as the Kalman filter, is employed to assist the feature tracking. To determine the pose information, the features and the motion data propagated by a strapdown navigation process are provided to the statistic estimation process. Errors from the statistic estimation process are used to refine the features and the motion data. As a result, the pose information outputted from the statistic estimation process is of high accuracy regardless the accuracy of the motion data and features as well as the associated equipment.

FOR THE PURPOSES OF INFORMATION ONLY

Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT.

AL Albania ES Spain LS Lesotho SI Slovenia

AM Armenia FI Finland LT Lithuania SK Slovakia

AT Austria FR France LU Luxembourg SN Senegal

AU Australia GA Gabon LV Latvia SZ Swaziland

AZ Azerbaijan GB United Kingdom MC Monaco TD Chad

BA Bosnia and Herzegovina GE Georgia MD Republic of Moldova TG Togo

BB Barbados GH Ghana MG Madagascar TJ Tajikistan

BE Belgium GN Guinea MK The former Yugoslav TM Turkmenistan

BF Burkina Faso GR Greece Republic of Macedonia TR Turkey

BG Bulgaria HU Hungary ML Mali TT Trinidad and Tobago

BJ Benin IE Ireland MN Mongolia UA Ukraine

BR Brazil IL Israel MR Mauritania UG Uganda

BY Belarus IS Iceland MW Malawi US United States of America

CA Canada IT Italy MX Mexico UZ Uzbekistan

CF Central African Republic P Japan NE Niger VN Viet Nam

CG Congo KE Kenya NL Netherlands YU Yugoslavia

CH Switzerland KG Kyrgyzstan NO Norway ZW Zimbabwe

CI Cδte d'lvoire KP Democratic People's NZ New Zealand

CM Cameroon Republic of Korea PL Poland

CN China KR Republic of Korea PT Portugal

CU Cuba KZ Kazakstan RO Romania

CZ Czech Republic LC Saint Lucia RU Russian Federation

DE Germany LI Liechtenstein SD Sudan

DK Denmark LK Sri Lanka SE Sweden

EE Estonia LR Liberia SG Singapore CLAIMS

We claim:

1. A method for obtaining pose information of a platform including a motion sensing device and an imaging device, the method comprising: receiving from the imaging device a sequence of images of a surrounding scene when the platform navigates therein, the imaging device operating in a known temporal relationship with the motion sensing device that supplies motion data about the platform so that each of the images corresponds to a respective set of the motion data; and deriving the pose information of the platform, through an estimation process, with respect to features extracted from each of the images.

2. The method of claim 1 , wherein each of the features reflects certain characteristics of objects in the surrounding scene and the method further comprising processing at least one of the images to extract the features therefrom.

3. The method of claim 2 further comprising tracking successively the features in each of the images subsequent to the at least one processed image.

4. The method of claim 3 further comprising determining respective locations of the features with respect to a first coordinate space collaborating with the motion measurement device.

5. The method of claim 2, wherein each of the features is salient and least variant from one image to another in the sequence of images. 6. The method of claim 3, wherein the certain characteristics of objects are incorporated into said tracking the features so that said tracking the features can be operated more efficiently.

7. The method of claim 6, wherein the certain characteristics of objects include pre-defined characteristics of objects intentionally placed in the scene.

8. The method of claim 7, wherein the pre-defined characteristics includes at least one of (i) intersection of line-like features, (ii) verticality of line-like features, and (iii) color-based features.

9. The method of claim 2, wherein said processing at least one of the images to extract features of the objects comprises detecting each of the features by applying a salient feature operator, wherein the salient feature operator, when applied to the at least one of the images, emphases corner-like regions while suppressing edge-like and homogeneous regions in the at least one of the images.

10. The method of claim 9, wherein said feature operator is based on a function of the Hessian matrix comprising the Laplacian operator and performing on a smoothed version of the at least one of the images.

11.The method of claim 9, wherein said tracking successively the features in each of the images comprises detecting the features in each of the images subsequent to the at least one processed image along an epipolar line in accordance with the features extracted from the at least one processed image. 12. The method of claim 9, wherein said tracking successively the features in each of the images comprises predicting locations of the features in each of the images subsequent to the at least one processed image by using the estimation process so that a search area for each of the features in each of the subsequent images can be considerably focused.

13. The method of claim 12, wherein the estimation process is based on the Kalman Filter.

14. The method of claim 1 further comprising computing a set of parameters from the motion data using a strapdown navigation process.

15. The method of claim 14, wherein the estimation process is based on the Kalman Filter.

16. The method of claim 15, wherein the set of parameters include at least one of (i) position, (ii) velocity, and (iii) attitude information of the platform.

17. The method of claim 16, wherein each of the parameters is not necessarily precise.

18. The method of claim 17 further comprising updating the set of parameters with inputs from an external positioning system when the external positioning system becomes available.

19. The method of claim 18, wherein the external positioning system is the Global Positioning System (GPS). 20. The method of claim 19, wherein the parameters further include the features from the images.

21.The method of claim 20, wherein said deriving the pose information of the platform further comprises: obtaining an imaging model of the imaging device, the imaging model reflecting how each of the features is transformed from the objects in the surrounding to the images; and providing the parameters to the estimation process to estimate the pose information of the platform.

22. The method of claim 21 , wherein said obtaining an imaging model further comprises updating the imaging model upon receiving error data from the estimation process so that the imaging model is constantly refined.

23. The method of claim 1 , wherein the platform is selected from a group consisting of a vehicle, a plane, a boat, a human operator and a missile, wherein each in the group is equipped with the motion sensor integrated with the imaging device.

24. A method for obtaining pose information of a platform including a motion sensing device and an imaging device, the method comprising: generating from the imaging device a sequence of images of a surrounding in which the platform is navigating, the imaging device operating in a known temporal relationship with the motion sensing device providing motion data about the platform so that each of the images corresponds to a respective set of the motion data; processing at least one of the images to extract features reflecting certain characteristics of objects in the surrounding; tracking, following the extracted features, successively each of the features in the images subsequent to the at least one processed image; obtaining a set of parameters from the motion data though a strapdown navigation process; and deriving the pose information of the platform using an estimation process operating with the features as part of inputs thereto, wherein the estimation process is coupled to the strapdown navigation process and receives the parameters so that the pose information of the platform can be statistically estimated from the estimation process.

25. The method of claim 24, wherein the certain characteristics are incorporated into said processing so that said processing can be operated more efficiently.

26. The method of claim 24, wherein said processing at least one of the images to extract features comprises detecting each of the features by applying a feature operator.

27. The method of claim 26, wherein the feature operator is a salient feature operator, wherein the salient feature operator, when applied to the at least one of the images, emphases corner-like regions while suppressing edge-like and homogeneous regions in the at least one of the images.

28. The method of claim 25, wherein said tracking successively the features in each of the images comprises predicting locations of the features in each of the images subsequent to the at least one processed image by using the estimation process so that a search area for each of the features in each of the subsequent images can be considerably focused.

29. The method of claim 28, wherein the estimation process is based on the Kalman Filter.

30. The method of claim 24, wherein said tracking successively each of the features in the images comprises: maintaining a feature list including the extracted features; and updating the feature list every time one of the extracted features disappears in one of the subsequent images.

31. The method of claim 30, wherein said updating the feature list further comprises: processing the one of the subsequent images to extract new features to be inserted in the feature list so that the number of the features in the feature list can be maintained constant.

32. The method of claim 31 further comprising: determining location information of each of the features with respect to a coordinate space in which the motion sensing device operates; providing the motion data along with the location information to the estimation process to estimate errors of an imaging model, wherein the imaging model shows a mapping relationship from the objects to the features in the images; and refining the imaging model upon receiving the error data from the estimation process so that the imaging model has minimum errors.

33. The method of claim 32 further comprising: refining the location information of each of the features upon receiving the errors data from the estimation process so that the location information has minimum errors.

34. The method of claim 33, wherein the motion data include at least one of (i) rotational data and (ii) translational data about the platform.

35. The method of claim 24, wherein the motion data need not be precise and said deriving the pose information operates with the motion data along with the extracted features using the estimation process.

36. The method of claim 35, wherein the motion sensing device is a global positioning system (GPS) sensing device providing pseudorange and pseudorange rates from the imaging device to GPS satellites

37. The method of claim 35, wherein the motion sensing device is an inertial measurement unit (IMU) including sensor is at least one rate gyro and one accelerometer providing respectively rotational and translational data.

38. A system for obtaining pose information of a platform in a scene without any advanced knowledge of the scene, the system comprising: a motion sensing device providing motion data about the platform, wherein the motion sensing device is integrated to the platform; an imaging device integrated to and working in a known temporal relationship with the motion sensing device, the imaging device configured to generate a sequence of images of the scene, each of the images corresponding to one set of the motion data; a computing system, coupled to the motion sensing device and the imaging device, receiving the motion data and the images and comprising a processor and a memory space for storing code for an application module, the code, when executed by the processor, causing the application module to perform operations of: processing at least one of the images to extract features reflecting certain characteristics of objects in the surrounding; tracking, following the extracted features, successively each of the features in the images subsequent to the at least one processed image; obtaining a set of parameters from the motion data though a strapdown navigation process; and deriving the pose information of the platform using an estimation process operating with the features as part of inputs thereto, wherein the estimation process is coupled to the strapdown navigation process and receives the parameters so that the pose information of the platform can be statistically estimated from the estimation process.

39. The system of claim 38, wherein each of the features is extracted by using a feature operator in accordance with the certain characteristics of the features. 40. The system of claim 39, wherein each of the features is salient; and wherein the feature operator is a salient feature operator, the salient feature operator, when applied to the at least one of the images, emphases corner-like regions while suppressing edgelike and homogeneous regions in the at least one of the images.

41. The system of claim 40, wherein said tracking successively each of the features in the images comprises predicting locations of the features in each of the images subsequent to the at least one processed image by using the estimation process so that a search area for each of the features in each of the subsequent images can be considerably focused.

42. The system of claim 41 , wherein the estimation process is based on a Kalman Filter.

43. The system of claim 38, wherein said tracking successively each of the features in the images comprises: maintaining a feature list including the extracted features; and updating the feature list every time one of the extracted features disappears in one of the subsequent images.

44. The system of claim 43 wherein said updating the feature list further comprises: processing the one of the subsequent images to extract new features to be inserted in the feature list so that the number of the features in the feature list can be maintained a constant. 45. The system of claim 38, wherein the application module is further caused to perform operations of: determining location information of each of the features with respect to a coordinate space in which the motion sensing device operates; providing the motion data along with the location information to the estimation process to estimate errors of an imaging model, wherein the imaging model showing a mapping relationship from the objects to the features in the images; and refining the imaging model upon receiving the errors data from the estimation process so that the imaging model has minimum errors.

46. The system of claim 45, wherein the application module is further caused to perform operations of: refining the location information of each of the features upon receiving the errors data from the estimation process so that the location information has minimum errors.

47. The system of claim 46, wherein the motion data include at least one of (i) rotational data and (ii) translational data about the platform.

48. The system of claim 47, wherein the motion data need not be precise and said deriving the pose information operates with the motion data along with the extracted features using the estimation process.

Claims

EMI1.1 <tb>

<SEP> INTERNATIONAL <SEP> APPLICATION <SEP> PUBLISHED <SEP> UNDER <SEP> THE <SEP> PATENT <SEP> COOPERATION <SEP> TREATY <SEP> (PCT) <tb> (51) <SEP> International <SEP> Patent <SEP> Classification <SEP> 7 <SEP> : <SEP> (11) <SEP> International <SEP> Publication <SEP> Number: <SEP> WO <SEP> 00/34803 <tb> <SEP> G01S <SEP> 17/00 <SEP> A2 <tb> <SEP> (43) <SEP> International <SEP> Publication <SEP> Date: <SEP> 15 <SEP> June <SEP> 2000 <SEP> (15.06.00) <tb> (21) <SEP> International <SEP> Application <SEP> Number: <SEP> PCT/US99/27483 <SEP> (81) <SEP> Designated <SEP> States: <SEP> JP, <SEP> European <SEP> patent <SEP> (AT, <SEP> BE, <SEP> CH, <SEP> CY, <SEP> DE, <tb> <SEP> DK, <SEP> ES, <SEP> FI, <SEP> FR, <SEP> GB, <SEP> GR, <SEP> IE, <SEP> IT, <SEP> LU, <SEP> MC, <SEP> NL, <SEP> PT, <SEP> SE). <tb>

(22) <SEP> International <SEP> Filing <SEP> Date: <SEP> 19 <SEP> November <SEP> 1999 <SEP> (19.11.99) <tb> <SEP> Published <tb> (30) <SEP> Priority <SEP> Data <SEP> : <SEP> Without <SEP> international <SEP> search <SEP> report <SEP> and <SEP> to <SEP> be <SEP> republished <tb> <SEP> 60/109,295 <SEP> 20 <SEP> November <SEP> 1998 <SEP> (20.11.98) <SEP> US <SEP> upon <SEP> receipt <SEP> of <SEP> that <SEP> report. <tb>

<SEP> 60/110,016 <SEP> 25 <SEP> November <SEP> 1998 <SEP> US <tb> (71) <SEP> Applicants: <SEP> GEOMETRIX, <SEP> INC. <SEP> [US/US]; <SEP> 124 <SEP> Race <SEP> Street, <tb> <SEP> San <SEP> Jose, <SEP> CA <SEP> 95131 <SEP> (US). <SEP> KORBIN <SEP> SYSTEMS, <SEP> INC. <tb>

<SEP> US/US; <SEP> 924 <SEP> Central <SEP> Avenue, <SEP> Ft. <SEP> Walton <SEP> Beach, <SEP> FL <SEP> 32547 <tb> <SEP> (US). <tb>

(72) <SEP> Inventors: <SEP> KAIN, <SEP> Jim; <SEP> 501 <SEP> Lexington <SEP> Street, <SEP> Waltham, <SEP> MA <tb> <SEP> 02452 <SEP> (US). <SEP> YATES, <SEP> Charlie; <SEP> 924 <SEP> Central <SEP> Avenue, <SEP> Ft. <tb>

<SEP> Walton <SEP> Beach, <SEP> FL <SEP> 32547 <SEP> (US). <SEP> ZWERN, <SEP> Arthur; <SEP> 2226 <tb> <SEP> Coastland <SEP> Avenue, <SEP> San <SEP> Jose, <SEP> CA <SEP> 95125 <SEP> (US). <SEP> FEJES, <tb> <SEP> Sandor; <SEP> 4859 <SEP> Clydelle <SEP> Avenue, <SEP> San <SEP> Jose, <SEP> CA <SEP> 95124 <SEP> (US). <tb>

<SEP> CHEN, <SEP> Jinlong; <SEP> 4664 <SEP> Cheeney <SEP> Street, <SEP> Santa <SEP> Clara, <SEP> CA <tb> <SEP> 95054 <SEP> (US). <SEP> JABLONSKI, <SEP> Marc; <SEP> 3625A <SEP> 17th <SEP> Street, <SEP> San <tb> <SEP> Francisco, <SEP> CA <SEP> 94114 <SEP> (US). <tb>

(74) <SEP> Agents: <SEP> ZHENG, <SEP> Joe <SEP> et <SEP> al.; <SEP> Geometrix, <SEP> Inc., <SEP> 124 <SEP> Race <SEP> Street, <tb> <SEP> San <SEP> Jose, <SEP> CA <SEP> 95131 <SEP> (US). <tb>

(54) <SEP> Title: <SEP> VISION-ASSISTED <SEP> CAMERA <SEP> POSE <SEP> DETERMINATION <tb> <SEP> 600 <SEP> 60o > <SEP> g6M <SEP> sou <tb> <SEP> rate <tb> <SEP> error <tb> <SEP> Hardware <SEP> accel <SEP> Navigation <SEP> I <SEP> el'ror <SEP> ¯. <SEP> ¯ <tb> <SEP> vig <SEP> model <tb> <SEP> O <tb> <SEP> y. <SEP> ¯.--¯¯¯¯¯¯¯¯¯¯¯¯. <SEP> ¯¯¯¯¯¯¯¯¯ <SEP> ¯¯¯¯¯¯¯¯¯¯¯. <SEP> ¯¯¯¯¯¯¯¯¯¯. <SEP> ¯¯¯¯¯-¯¯¯¯¯. <SEP> ¯¯¯¯¯¯¯¯¯¯¯¯ <SEP> ¯¯¯¯¯¯¯¯.. <SEP> ¯¯. <SEP> ¯¯¯ <tb> <SEP> 605 <SEP> 60B <SEP> 611 <tb> <SEP> KF <SEP> propagate <SEP> Meement <SEP> KF <SEP> Update <SEP> i"er <tb> <SEP> KF <SEP> propagate <tb> <SEP> & . <SEP> & <SEP> < t4 <SEP> Extended <SEP> Kahmm <SEP> Fitter------ <SEP> :

<SEP> :---------------" <tb> <SEP> camera <tb> <SEP> V1S10I1 <SEP> 609 <SEP> 6W <SEP> Calibiation <tb> <SEP> Camera <SEP> Processor <SEP> Feature <SEP> Camera <SEP> Camera <tb> <SEP> Hardware <SEP> & <SEP> Feature <SEP> Manager <SEP> model <SEP> Error <SEP> model <tb> <SEP> Tracker <SEP> , <SEP> 1 <tb> <SEP> Pose-tagged <SEP> Aid <SEP> to <SEP> feariue <SEP> Current <SEP> feahue <tb> <SEP> fPose-tagged <SEP> lAjd <SEP> to <SEP> < <SEP> L <SEP> wImage <SEP> Archive <SEP> feature <SEP> Archive <SEP> database <tb> <SEP> Optimal <SEP> pose <SEP> estimates <tb> (57) <SEP> Abstract <tb> <SEP> A <SEP> completely <SEP> passive <SEP> and <SEP> self-contained <SEP> system <SEP> for <SEP> determining <SEP> pose <SEP> information <SEP> of <SEP> a <SEP> platform <SEP> is <SEP> disclosed.

<SEP> The <SEP> system <SEP> comprises <tb> a <SEP> motion <SEP> sensing <SEP> device <SEP> and <SEP> an <SEP> imaging <SEP> device, <SEP> both <SEP> operating <SEP> together <SEP> in <SEP> a <SEP> known <SEP> temporal <SEP> relationship, <SEP> so <SEP> that <SEP> each <SEP> of <SEP> the <SEP> images <tb> generated <SEP> from <SEP> the <SEP> imaging <SEP> device <SEP> corresponds <SEP> to <SEP> a <SEP> set <SEP> of <SEP> motion <SEP> data <SEP> provided <SEP> by <SEP> the <SEP> motion <SEP> sensing <SEP> device. <SEP> In <SEP> a <SEP> preferred <SEP> embodiment, <tb> the <SEP> motion <SEP> sensing <SEP> device <SEP> and <SEP> the <SEP> imaging <SEP> device <SEP> are <SEP> integrated <SEP> together <SEP> and/or <SEP> operate <SEP> synchronously.

<SEP> The <SEP> imaging <SEP> device <SEP> senses <SEP> the <tb> surrounding <SEP> scene <SEP> from <SEP> which <SEP> features <SEP> are <SEP> extracted <SEP> and <SEP> tracked <SEP> to <SEP> determine <SEP> the <SEP> imaging <SEP> device <SEP> motion. <SEP> Hence <SEP> no <SEP> advance <SEP> information <tb> regarding <SEP> the <SEP> scene <SEP> or <SEP> no <SEP> special <SEP> scene <SEP> preparation <SEP> is <SEP> required. <SEP> Further, <SEP> a <SEP> statistic <SEP> estimation <SEP> process, <SEP> such <SEP> as <SEP> the <SEP> Kalman <SEP> filter, <SEP> is <SEP> employed <tb> to <SEP> assist <SEP> the <SEP> feature <SEP> tracking.

<SEP> To <SEP> determine <SEP> the <SEP> pose <SEP> information, <SEP> the <SEP> features <SEP> and <SEP> the <SEP> motion <SEP> data <SEP> propagated <SEP> by <SEP> a <SEP> strapdown <SEP> navigation <tb> process <SEP> are <SEP> provided <SEP> to <SEP> the <SEP> statistic <SEP> estimation <SEP> process. <SEP> Errors <SEP> from <SEP> the <SEP> statistic <SEP> estimation <SEP> process <SEP> are <SEP> used <SEP> to <SEP> refine <SEP> the <SEP> features <SEP> and <SEP> the <tb> motion <SEP> data. <SEP> As <SEP> a <SEP> result, <SEP> the <SEP> pose <SEP> information <SEP> outputted <SEP> from <SEP> the <SEP> statistic <SEP> estimation <SEP> process <SEP> is <SEP> of <SEP> high <SEP> accuracy <SEP> regardless <SEP> the <SEP> accuracy <SEP> of <tb> the <SEP> motion <SEP> data <SEP> and <SEP> features <SEP> as <SEP> well <SEP> as <SEP> the <SEP> associated <SEP> equipment. <tb>

FOR THE PURPOSES OF INFORMATION ONLY Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT.

AL Albania AM Armenia AT Austria AU Australia AZ Azerbaijan BA Bosnia and Herzegovina BB Barbados BE Belgium BF Burkina Faso BG Bulgaria BJ Benin BR Brazil BY Belarus CA Canada CF Central African Republic CG Congo CH Switzerland CI CBte d'Ivoire CM Cameroon CN China CU Cuba CZ Czech Republic DEGermany DK Denmark EE Estonia ES Spain FIFinland FRFrance GA Gabon GB United Kingdom GE Georgia GHGhana GN Guinea GR Greece HU Hungary IE Ireland IL Israel IS and IT Italy JP Japan KE Kenya KG Kyrgyzstan KP Democratic People's Republic of Korea KR Republic of Korea KZ Kazakstan LC Saint Lucia Ll Liechtenstein LK Sri Lanka LR Liberia LS Lesotho LT Lithuania LULuxembourg LV Latvia MC Monaco MD Republic of Moldova MG Madagascar MK The former Yugoslav Republic of Macedonia ML Mali MN Mongolia MRMauritania MW Malawi MX Mexico NE Niger NL Netherlands NO Norway NZ New Zealand PL and PT

Portugal RORomania RU Russian Federation SD Sudan SE Sweden SG Singapore Sl Slovenia SK Slovakia SN Senegal SZSwaziland TD Chad TG Togo TJTajikistan TMTurkmenistan TR Turkey TT Trinidad and Tobago UA Ukraine UG Uganda US United States of America UZUzbekistan VN Viet Nam YU Yugoslavia ZW Zimbabwe