WO2011048497A2

WO2011048497A2 - Computer vision based hybrid tracking for augmented reality in outdoor urban environments

Info

Publication number: WO2011048497A2
Application number: PCT/IB2010/002885
Authority: WO
Inventors: Wee Teck Fong; Soh Khim Ong; Andrew Yeh-Ching Nee
Original assignee: National University Of Singapore
Priority date: 2009-10-19
Filing date: 2010-10-18
Publication date: 2011-04-28
Also published as: WO2011048497A3

Abstract

A computer vision based hybrid tracking system integrates computer vision, global positioning services and inertial sensing to provide robust, accurate and jitter- free augmentation in outdoor environments. The hybrid tracking system operates in two phases: configuration phase followed by tracking phase. In configuration phase, the hybrid tracking system selects a reference frame from an input video sequence captured by a camera and detects feature signatures of each planar surface of objects in the reference frame. In tracking phase, the hybrid tracking system uses the selected reference frame and detected feature signatures from configuration phase to estimate 3D positions and orientations of objects in the video sequence relative to camera motion.

Description

Computer Vision Based Hybrid Tracking for Augmented Reality in

Outdoor Urban Environments

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Provisional Application No.

61/252,726, filed October 19, 2009, entitled "Markerless 6DOF Computer Vision Tracker for 3D Augmentation of Patterned Planar Surfaces," which is incorporated by reference in its entirety.

BACKGROUND

[0002] This invention relates to computer augmented reality, and in particular to a hybrid tracking system and method using computer vision, global positioning services (GPS) and inertial sensing for augmented reality in outdoor environments.

[0003] Augmented Reality (AR) can be broadly defined as the splicing of virtual elements onto real world so that both virtual elements and real world can be perceived by users at the same time. A typical augmented reality system allows a user to view a real scene augmented by virtual elements (e.g., computer graphics generated from both virtual object and real object in the real scene). For example, an AR system uses a camera (e.g., webcam) to capture a video of one or more planar objects and computes three-dimensional (3D) orientations and positions of the objects relative to the camera. The orientations and positions of the planar objects are used by the AR system to render computer generated 3D graphics onto the captured video so that the graphics appear to be attached to the real world scene of the video and move along as the video scene changes.

[0004] In order to have correct alignment of the virtual elements with the real scene, an AR system needs to estimate the position and orientation of a camera that captures the real scene. The position and orientation of a camera is commonly referred to as "pose" of the camera and the process of estimating the pose of a camera is called "tracking." In the context of computer vision, tracking refers to camera pose estimation through a video sequence. One conventional scheme of tracking is based on the features points of moving objects in a video sequence using multiple feature points for correspondences to the reference view of the object to calculate the pose. [0005] An outdoor AR system requires a tracking system to operate under a wide range of environmental conditions and motion. Robustness, precision, low jitter and ease of use are important requirements for satisfactory augmented reality, and present significant challenges in uncontrolled environments. Furthermore, the tracking system is often required to operate without modifications to the outdoor environment, where the conventional tracking technologies for AR have to rely on the natural properties of the environment to perform the tracking.

[0006] One conventional outdoor AR tracking system uses GPS and a magnetic compass for registering buildings. It is used for navigation and display of interesting information about buildings. Due to the limited computational power of mobile systems and the ineffectiveness of conventional computer vision tracking, the AR tracking system is limited to the accuracy of GPS.

[0007] Another conventional computer vision based AR tracking method uses Fourier- based two-dimensional (2D) image registration to accurately augment missing parts of outdoor environment (e.g., archaeological sites). However, the method limits the users to stand at several predetermined locations to view the augmented buildings and is not truly mobile. Other existing outdoor AR tracking systems uses textured 3D models, inertial and computer vision tracking to improve the performance of outdoor AR tracking. These AR tracking systems suffer from poor performance scalability because the map searching of correspondence between consecutive video frames of a real scene in these methods are not scaled up and the tracking needs support from a reliable mobile positioning systems (e.g., GPS).

SUMMARY OF THE INVENTION

[0008] According to an embodiment of the invention, a computer-implemented method is provided for determining a pose (e.g., position and orientation) of an object of a video frame relative to camera motion. In one embodiment, object pose determination comprises selecting a reference frame comprising a plurality of objects and computing a signature for each keypoint feature of the plurality of the objects, where the signature of a keypoint feature is a descriptor of the keypoint feature. The method divides the reference frame into multiple sub-grids, computes an average gradient of each sub-grid and selects sub-grids with average gradient greater than a threshold value for pose estimation. The method further comprises extracting keypoint features of each video frame of a video sequence and determining a set of matches for each video frame of the video sequence based on signatures of the extracted keypoint features of the video frame ant he signatures of the keypoint features of at least one of the plurality of the objects of the reference frame. Based on the set of matches, the method estimates the pose of at least one of the plurality of the objects.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] Fig. 1 shows a block diagram of a computer vision based hybrid tracking system, in accordance with an embodiment of the invention.

[0010] Fig. 2 is a flow diagram of steps performed by configuration module shown in Fig. 1.

[0011] Fig. 3 shows selection of planar surfaces for feature matching based on GPS position and North-East-Down (NED) orientation of a camera.

[0012] Fig. 4 is a flow diagram of a tracking method for augmented reality, in accordance with an embodiment of the invention.

[0013] Fig. 5 shows a block diagram of an augmented reality system using the computer vision based hybrid tracking system shown in Fig. 1, in accordance with an embodiment of the invention.

[0014] Fig. 6 is an illustration of results from a computer vision based hybrid tracking with rotation and scale changes in a real scene, in accordance with an embodiment of the invention.

[0015] Fig. 7 is an illustration of augmentation onto various types of surfaces from a computer vision based hybrid tracking, in accordance with an embodiment of the invention.

[0016] The figures depict various embodiments of the invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0017] Referring to Fig.1 for purposes of explanation, embodiments of the invention provide a computer vision based hybrid tracking system 100 to track poses (i.e., position and orientation) of a camera 102 without using any visual markers. The hybrid tracking system 100 makes use of the strengths of computer vision, inertial and GPS tracking technologies, which satisfy the requirements of accuracy, robustness, low-jitter and ease of use for an AR system in outdoor environments. For example, inertial systems are robust and self contained. Current human-scale inertial measurement systems are capable of providing accurate orientation relative to Earth local level, but have large positional drifts. In contrast, standalone GPS positions do not drift, but have errors in the order of tens of meters and have high jitter. The weakness of GPS positioning can be compensated by using computer vision algorithms, which have high accuracy and low jitter. However, computer vision based tracking is not as robust and has high computational load. The hybrid tracking system 100 uses inertial and GPS technologies to initialize conditions so as to reduce the computational load, improve the robustness of computer vision based tracking, and to enable automated initialization of computer vision based tracking with improved usability.

[0018] The hybrid tracking system 100 comprises a configuration module 200 and a computer vision based (CV-based) hybrid tracker 300. The hybrid tracking system 100 operates in two phases: configuration phase by the configuration module 200 followed by tracking phase by the CV-based hybrid tracker 300. The configuration module 200 is configured to select a reference frame from an input video sequence 104 captured by a camera 102 and detect feature signatures of each planar surface of objects in the reference frame. The CV-based hybrid tracker 300 uses the selected reference frame and detected feature signatures from the configuration module 200 to perform the tracking with the input video sequence 104.

CONFIGURATION MODULE

[0019] The configuration module 200 is configured to select a reference frame from an input video sequence 104 captured by the camera 102. In one embodiment, the configuration module 200 receives a reference frame selected by a user of the camera 102 and stores the user selected reference frame for feature matching. For example, the user selects a reference frame by selecting a video frame and placing the planar surface to be tracked within a square of 192x 192 pixels that is positioned at the centre of the video frame. The plane of the planar surface should be as parallel to the camera image plane as possible for performance efficiency.

[0020] In response to the user selection of the reference fame and placement of the object to be tracked (i.e., the 192x192 pixels square), the configuration module 200 divides the square into 24x24 pixels sub-grids. The configuration module 200 computes the average image gradient within each sub-grid, and selects only those sub-grids where the gradient is above 10 grey levels per pixel to be used in tracking. For example, the tracking module 330 of the CV-based hybrid tracker 300 uses the image gradient to construct the Jacobian matrices used in second-order minimization of matching errors from feature matching.

Experimental observation shows that image regions with low gradients do not contribute additional information for second-order minimization convergence, and in certain cases causes convergence towards the wrong minima. To aid the user in selecting surfaces with high image gradient, only sub-grids with sufficient gradient are rendered during the selection process. This provides an indication of the suitability of a surface for tracking with minimized matching errors.

[0021] After the selection of sub-grids, the user moves the camera 102 so that the normal of the planar surface of objects in a scene can be obtained using decomposition of

homography, which can be computed using a second-order minimization algorithm. In the formulation of the homography, the normal vector is defined in the reference frame, and does not change as the camera moves. In one embodiment, the normal vector that does not change when the sideway motion is greater than 0.5% of the perpendicular distance between the camera and the plane, is chosen as the normal of the planar surface. After the normal vector in the reference frame has been obtained, subsequent homography decomposition can be performed using a computer vision algorithm, which gives only one set of rotation matrix and translation vector required for the augmentation of virtual 3D objects.

[0022] For outdoor tracking, the distances encountered can be more than a hundred meters for building facades, which in turn means a sideway motion of more than half a meter. To check the accuracy of the normal vector, the hybrid tracking system 100 can track the objects in the reference frame with a virtual 3D object augmented onto the planar surface after the normal vector is determined. This allows the user to visually check the accuracy of the normal vector with camera motion.

[0023] The configuration module 200 further detects and extracts reliable keypoint features of the reference frame of the input video sequence 104. In a video sequence, moving objects are often represented by their detected feature points, which are referred to as keypoint features throughout the description. For example, the keypoint features of a video frame may include features such as points, lines, and circles of objects in the video frame. In one embodiment, the configuration module 200 is configured to detect keypoint features of objects in a video frame by a feature detector, which relies on an intensity threshold to determine whether a pixel in the video frame is a corner-like feature.

[0024] Generally, high intensity threshold values limit the features to the ones with high contrast, while low intensity threshold values may cause an excessive number of features to be extracted. The configuration module 200 uses adaptive thresholding to control the numbers of features detected in the video frame. For example, the configuration module 200 divides the entire video frame into smaller grids (e.g., 8x8 pixels blocks) and controls the number of features detected in each grid through adjusting the threshold. If there are too many features in a grid, the threshold is increased in proportion to the number of features in excess of a predetermined target value. The threshold is reduced when the number of features is low. For example for an 8 bit (256 grey-levels) camera, the configuration module 200 uses a minimum threshold of 25, which prevents spurious features from being detected due to noise. This adaptive thresholding enables feature matching under varying contrast and illumination conditions, which is important for effective computer vision based tracking in outdoor environments.

[0025] The configuration module 200 further computes keypoint signatures from the extracted keypoint features. A keypoint signature is a descriptor of corner- like features, which are rotation and scale invariant. In one embodiment, the signature is generated using machine learning techniques and can be computed efficiently. The keypoint signatures form a framework for object detection and pose estimation to be used by the configuration module 200 and the CV-based hybrid tracker 300. Furthermore, the keypoint signature is highly discriminative, allowing for higher confidence in feature matching and eliminating the need for conventional 3D models with poor geometric consistency of feature matching.

[0026] The keypoint signature, s, is a vector where each element, Sj, is the response for the z^'-th base class in a training set of a set of binary features (e.g., the generic Ferns), for a keypoint. For each incoming image, the signature of image keypoints, si, are computed once and matched against the signatures of objects keypoints, so. Due to real time constraints, the number of Ferns used is limited. This results in significant noise in s and an increased number of false positives, which limits the effectiveness of keypoint correspondence using traditional nearest neighbors algorithms.

[0027] To overcome the ineffectiveness of keypoint correspondence described above, the configuration module 200 computes probabilities of peaks for each detected feature's keypoint signature of the reference frame. It is noted that for a keypoint signature, the probability of occurrence, p of the z^'-th base class in the set of /c-largest values, ¾, is more stable and discriminative than using magnitudes of the keypoint signature. Due to signature noise, there is variation of the base classes in ¾ of a keypoint, for different perspective projections. For each keypoint, certain base classes occur within ¾ with high probability, and this is effective for discriminating between keypoints. As only the ordering of ¾ is required, normalization of s, is not necessary, thus further reducing the computational load.

[0028] In one embodiment, the configuration module 200 uses the probability of the peaks occurring within the 15 largest peaks of the keypoint signatures of one feature undergoing appearance changes due to camera motion. These probabilities can be obtained through training (e.g., keypoint recognition using generic Ferns). For example, the configuration module 200 uses 16 generic Ferns for training the base classes with 14 random point pair tests for each Fern. The multiple bins for orientation and the random point pair tests of the Ferns are pre -rotated so that they are performed relative to the dominant orientation of the feature. This reduces the number of Ferns required to achieve the same accuracy where no orientation information is used. The size of planar patch surrounding a feature is chosen to be 41x41 pixels and 15,000 homographies are used for training the base classes.

[0029] Turning now to Fig. 2 for the functionality of the configuration module 220, Fig. 2 is a flow diagram of steps performed by configuration module 200 shown in Fig. 1. Initially, the configuration module 200 receives 202 a video frame of an input video sequence and divides 204 the video frame into sub-grids and display the sub-grids of the video frame. The configuration module 200 computes 206 the average gradient of each sub-grid of the video frame and presents the video frame to a user for reference frame selection. Based on the computed gradients of the sub-grids of the video frame, the user determines 208 whether to select the video frame as a reference image for tracking. Responsive to the user selecting the video frame as a reference image, the configuration module 200 stores 210 the frame as a reference image and detects 212 the keypoint features of objects in the selected reference image. To have efficient keypoint feature matching, the configuration module 200 computes 214 the probabilities of peaks for each detected feature's keypoint signature.

COMPUTER VISION BASED HYBRID TRACKER

[0030] The CV-based hybrid tracker 300 uses the selected reference frame and the detected feature signatures from the configuration module 200 to perform the tracking with the input video sequence 104. In the embodiment illustrated in Fig. 1, the CV-based hybrid tracker 300 comprises an initialization module 310, a feature matching module 320, a tracking module 330 and a relocalization module 340. Other embodiments of the hybrid computer vision tracker 300 may include additional module, e.g., an augmentation module, and different modules.

[0031] The initialization module 310 is configured to initialize current tracking session including estimating the 3D positions and orientations of the planar objects with respect to the camera that captured the video fame. In one embodiment, initialization takes place when the hybrid tracking system 100 is first started. For efficient tracking in outdoor environments, the initialization module 310 is designed to recover accurate camera positions using distinct planar surfaces patches within a limited search radius, which is defined by the currently estimated GPS positions of the camera. The initialization module 310 further uses the North-East-Down (NED) orientation of the camera measured by Inertial Measurement Units (IMU) to further reduce the search set by eliminating planar surfaces that are not visible from the current camera orientation. The GPS positions and NED orientations of a camera can also be used to determine the positions and orientations of objects during training phase.

[0032] The availability of miniature Micro-Electro-Mechanical Systems (MEMS) inertial sensors, such as accelerometer and gyroscope chips, enables the creation of low weight and low power IMU. Such units are generally robust to external interferences, except for temperature changes, and have low latency and jitter. However, the performance of MEMS inertial sensors suffers from positional drifts observed in MEMS-based IMU (e.g., the random zero bias drifts of MEMS gyroscopes). This is because of the requirement of obtaining the position measurements of the inertial sensors in the Earth local level, or the NED frame. As the IMU is of the strap-down configuration, the axes of the accelerometers are not aligned to NED. This requires the measurement of the orientation of the

accelerometers with respect to NED. The IMU can be used to provide reliable NED orientation of a camera by independently measuring the gravity vector and the Earth magnetic north vector respectively using the accelerometers and magnetometers.

[0033] Referring to Fig. 3, Fig. 3 shows selection of planar surfaces for feature matching based on GPS position 320 and NED orientation 310 of a camera. The GPS position 320 and the expected GPS error 330 are used to define a circular region encompassing the possible positions of the user of the camera. All planar patches 340a-f with reference camera GPS positions within this region are tentatively included in the search set. A planar patch corresponds to an observation of a locally planar surface in a 3D scene. The current NED orientation of each of the planar patches 340a-f measured using the IMU is used to reduce the search set by eliminating surfaces where the NED orientation of the surface normal 340 is greater than 45° from the current NED orientation. Based on the GPS positions and NED orientations of the planar surfaces, the initialization module 310 selects the planar patches 340a and 340b, where the GPS positions of the selected patches are within the defined circular region and the NED orientations of the selected patches are smaller than 45° from the NED orientation of surface normal.

[0034] The feature matching module 320 is configured to find matches between features tracked through a video sequence using keypoint signatures. For each video frame of a video sequence, s* of each image keypoint, s^t, is computed. For matching each object keypoint, only the logarithms of its p_{ corresponding to base classes in _k are summed to obtain a response, r. This is equivalent to a multiplication of the probabilities and object keypoints with large r are considered potential matches. Thus, signature matching is cast as a statistical classification problem instead of the ^-nearest neighbors. In one embodiment, the recommended value for k is 15, which keeps computational time low as ten p_t are added for each object keypoint during matching. It is observed that larger values of k do not increase the rate of finding correct matches as more unstable peaks are included.

[0035] In ideal cases, for each image keypoint, values of r for all object keypoints form a sharp peak at the correct match. Due to keypoint similarity and signature noise, the object keypoint with the largest r is not necessarily the correct match. It is observed that the correct match occurs within the four highest r values for 80% of the time. Therefore, four object keypoints with the highest r for each image keypoint, subject to a threshold, t\, are retained as potential matches. To reduce false positives, potential matches with r less than a second threshold, t₂, above the mean r are rejected. This ensures that potential matches are peaks, similar to the ideal case, as certain image keypoints have been observed to generate large r for many object keypoints. Refinement of potential matches described here is optional for the computer vision based hybrid tracking system 100.

[0036] During matching process, the signature of an image feature in a video frame currently being processed is computed, and the top fifteen peaks are found. The fifteen base classes corresponding to these 15 peaks are used for matching the object keypoint signatures in a database. For each object feature to be matched, the probabilities obtained from training for the 15 base classes are multiplied together. The object feature with the largest multiplied probability is considered as the best match because both the video frame and object feature have a high probability of sharing the same set of peaks.

[0037] The matching module 320 is further configured to estimate object pose (i.e., position and orientation) from the potential feature matches. In one embodiment, the matching module 320 uses Random Sample Consensus (RANSAC) estimation of

homography of the detected features. The homography of the detected features correctly model the perspective transformation of the detected features due to camera motion. In one embodiment, the minimum number of inliers for RANSAC is set at 15 to ensure that the initial pose is geometrically consistent. This threshold is chosen because ten inliers are typically observed when the planar surface is not present. As feature matching is imperfect, the number of features can be approximately 30 to 100 for RANSAC to be effective. [0038] The tracking module 330 is configured to continuously track object poses to obtain the 3D position and orientation for augmenting the virtual objects by frame-by-frame pose refinement. In one embodiment, the tracking module 330 uses efficient second-order minimization (ESM) algorithm to refine the initial poses of the potential surfaces obtained using RANSAC. ESM is computationally efficient and has a convergence region that is sufficient for a majority of the estimated poses and refined poses from the previous frame. Therefore, relatively slow keypoint matching is avoided except after an ESM tracking failure. With a suitable parameterization of small motion about the current camera pose, the ESM can iteratively converge to the camera pose that gives the minimal image error between the reference and warped images. As a large number of pixels are used in an efficient manner, the end result is highly accurate and jitter- free 3D augmentation. This scheme can be extended to the tracking of non-planar surfaces using suitable transformation models, such as the tri-focal tensor transfer.

[0039] After initialization, the tracking module 330 continuously tracks the detected surfaces using ESM. Surfaces with an average pixel error below a pre-defined threshold of 20 are considered to have their pose accurately determined. Tracking failures are detected when the average pixel error goes above the threshold of 20. Recently lost surfaces are given the highest priority for feature matching, which reduces or decays with time. The feature matching module 320 performs feature matching in the background to detect new surfaces. The initialization module 310 continuously tracks GPS positions and inertial measurement to speed up recovery from complete tracking failure.

[0040] The formulation of ESM tracking in terms of sub-grids also improves the tolerance to illumination changes and partial occlusion. In reality, illumination is rarely constant. In one embodiment, the tracking module 330 uses an illumination model for adjusting the pixel intensities in the warped image. A reference image is divided into sub- grids, where illumination changes are applied equally within. Illumination changes are estimated directly from the warped and reference images because the predicted pose is close to the current one during ESM tracking. For illumination changes, the mean and standard deviation of the pixel intensities within each warped sub-grid is adjusted to match those of the corresponding reference sub-grid.

[0041] Specifically, the illumination change is modeled as follows. Let 1) be the intensity of pixel i in the sub-grid j for the warped image. Let ni_j and d_j be the mean and standard deviation of the pixel intensities in the sub-grid j in the warped image, and m and d be the corresponding values for the reference image. The modified pixel intensity, Γ , is obtained using the illumination model shown in Equation (1) below:

The proposed illumination model equalizes ni_j and m , as well as d_j and d .

[0042] There are several advantages using the proposed illumination model. First, the model accuracy is high. This is because both the mean illumination and the spread of values within a sub-grid are adjusted for the proposed illumination model instead of a single scaling factor in the conventional discrete illumination model. Second, the computational load is reduced significantly as parameters are directly estimated without the use of large sparse Jacobian matrices. Third, the detection of occlusion is improved. In a conventional illumination model, parameters can be over adjusted within ESM to compensate for intensity errors caused by occlusion until they reach normal error levels, and this complicates occlusion detection. For the proposed illumination model, over adjustment is avoided as parameters are directly obtained from the images. Occluded sub-grids are simply those with error levels above a predetermined threshold.

[0043] This simple illumination model is found to produce an average pixel error of less than three grey levels between the warped and reference images. As both the transformation and illumination models are accurate, the occlusion of a sub-grid can be simply detected when its average pixel error is above a pre-defined threshold, which is set to 25 in one embodiment.

[0044] The relocalization module 340 is configured to recover tracking failures of planar surfaces processed by the tracking module 330, where tracking failures are detected when the average pixel error goes above the threshold value of 20. The relocalization module 340 gives recently lost surfaces highest priority for feature matching. In one embodiment, the relocaliation module 340 repeats the operations performed by the initialization module 310, the feature matching module 320 and the tracking module 330 except that the lost surfaces are given higher priority for tracking failure recovery.

HYBRID COMPUTER VISION TRACKING SYSTEM OPERATION

[0045] Fig. 4 is a flow diagram of a tracking method for augmented reality performed by the CV-based hybrid tracker 300. Initially, a camera captures 402 a video sequence containing a plurality objects for tracking. The hybrid tracker 300 receives 404 a video frame of the video sequence and determines 406, for each object of the video frame, whether the object was tracked in a previous frame. Responsive to the object not being tracked before, the hybrid tracker 300 obtains 408 keypoint features of the video frame, and computes 410 signatures of the detect keypoint features. The hybrid tracker 300 further finds 412 matches between the signatures of keypoint features of the video frame and the object and estimates 414 the poses of surfaces of the video frame based on the matches. The hybrid tracker 300 checks 416 whether the pose estimation is successful (e.g., comparing the average pixel errors the surface with a threshold value). Responsive to failure of the pose estimation, the hybrid tracker 300 flags 418 the object as not being tracked and decreases re -tracking priority of the object to recover the tracking failure. With tracking operations performed by step 406 through step 416 over the frames of the video sequence, the re-tracking priority of objects identified by failed pose estimation is reduced to a minimum level because it is very unlikely to re-track these identified objects in video frames.

[0046] On the other hand, responsive to the object being tracked in the previous frame (i.e., the "YES" path from step 406 determination), the hybrid tracker 300 obtains 420 the pose information of the object from the previous frame. The hybrid tracker 300 determines 422 whether to eliminate surfaces which are more likely to produce accurate pose estimation by comparing the average pixel intensity error of the surface against a threshold value.

Responsive to a positive determination (i.e., the "YES" path), the hybrid tracker 300 augments the object onto the video frame and fetches the next video frame for tracking. Responsive to the inertial measurement of the surface is larger than the threshold value, the hybrid tracker 300 flags 426 the object as not being tracked and increases re-tracking priority of the object to recover the tracking failure.

[0047] To further illustrate the operations of the CV-based hybrid tracker 300, following is pseudo-code for the tracking process performed by the CV-based hybrid tracker 300 in one embodiment:

Before tracking (one time process)

Step 1 : obtain generic Ferns using randomly chosen keypoints for all objects;

Step 2: for each object, generate logarithms of p_{ for all object keypoints using 500 random warps.

During tracking

For each incoming image (video frame):

Step 1 : extract image keypoints;

Step 2: computes for all image keypoints. For each object:

Step 3: for each image keppoint, computes response r of the object keypoints and retain potential matches;

Step 4: estimate pose of the object using RANSAC;

If RANSAC is successful:

Step 5 : perform pose refinement using ESM;

Step 6: augment the object onto the video frame.

[0048] Fig. 5 shows a block diagram of an augmented reality system using the computer vision based hybrid tracking system 100 shown in Fig. 1. The augmented reality system comprises a video camera 510 to capture a video sequence 520 to be tracked by the hybrid tracking system 100. The hybrid tracking system 100 receives the video sequence 520, selects a reference fame and detects feature signatures of each planar surfaces of objects in the reference frame by the configuration module 200. Using the detected feature signatures of the planar surfaces, the CV-based hybrid tracker 300 tracks the objects of the video sequence and determines the 3D positions and orientations of the objects with respect to the camera motion. The information of the 3D positions and orientations of the objects allows a user to use the information for applications such as augmented reality, where the information is used to render computer-generated virtual objects onto the real objects of the video sequence without markers. For example, the augmentation module 530 is configured to use the 3D positions and orientations of the objects determined by the hybrid tracking system 100 to generate an output video sequence 540 with augmentation.

[0049] In one implementation of the computer vision based hybrid tracking system 100, the base classes for training the feature keypoints (e.g., generic Ferns) consist of 256 corner keypoints randomly selected from five sample images with a rich variety of keypoints. Using 256 corner keypoints enables base classes to be indexed using a single 8-bit byte without significant impact on tracking performance. A reference frame is selected for the ESM tracking, where the surface plane of the selected reference frame is set approximately parallel to the camera image plane. The probability for object keypoints are obtained from this reference image using 500 random warps. The maximum number of image keypoints per frame is limited to 1000, and the maximum time for obtaining the keypoint signatures is set to 30 milliseconds.

[0050] Fig. 6 is an illustration of results from the computer vision based hybrid tracking with rotation and scale changes in a real scene, in accordance with an embodiment of the invention. Fig. 6 demonstrates a scenario where a user moves around an apartment building and attempts to augment a computer-generated teapot object onto the various locations in the outdoor environment.

[0051] In Fig. 610(a), the computer-generated teapot is augmented onto one side of the apartment building shown in the image. The hybrid computer vision tracking system 100 tracks the keypoint features of the image (e.g., the selected planar surfaces of the apartment building) and accurately estimates the positions and orientations of the objects in the image with respect to camera motion. The hybrid tracking system 100 uses the information of the positions and orientations of the objects in the image to position the teapot onto the image. Fig. 610(b) shows the teapot augmented onto the same image with camera rotation and scale changes.

[0052] Fig. 7 is an illustration of augmentation onto various types of surfaces from the computer vision based hybrid tracking, in accordance with an embodiment of the invention. For example, Fig. 710(a) shows a teapot augmented onto a surface of a sign image, and Fig. 710(b) shows the teapot augmented onto a surface of an image of road marking. Fig. 710(c) shows the teapot augmented onto the same image of Fig. 710(b) with different camera rotation and scale. Similarly, Fig. 710(d) shows the teapot augmented onto the same image of Fig. 710(b) with different camera orientation.

APPLICATIONS OF HYBRID COMPUTER VISION TRACKING SYSTEM

[0053] The computer vision based hybrid tracking system 100 integrates GPS, inertial and computer vision tracking systems, where their complementary properties are combined to achieve robust, accurate and jitter- free augmentation. Comparing with conventional tracking systems, the hybrid tracking system 100 has advantages in terns of markerless operation, low jitter with high accuracy, robustness to illumination changes and partial occlusion and high computational efficiency.

[0054] Most commercial and conventional computer vision based trackers use markers to aid the tracking. The CV-based hybrid tracker 300 does not require any marker and it can track and augment virtual objects and media onto planar surfaces that have varied patterns and images. The training time to recognize the patterned planar surface is short and can be optimized in less than a minute.

[0055] Existing trackers typically exhibit high jitter. This is highly undesirable for augmented reality as the virtual objects appear to vibrate or jump about, as opposed to the real objects which are stable. This destroys the realism of the augmentation and reduces the utility of AR for many applications where stability of augmentation is critical. The CV-based hybrid tracker 300 is highly stable with high accuracy due to the use of second-order optimization (e.g., using ESM for pose refinement).

[0056] The computer vision based hybrid tracking system 100 uses a new illumination model, which allows the CV-based hybrid tracker 300 to continue with accurate tracking even if there are general lighting changes, shadows and particular glare. The hybrid tracking system 100 can continue tracking even if part of a planar surface is not visible. This is due to the improved accuracy of the illumination model used by the hybrid tracking system 100, which allows for easy detection of occluded portions of the surface and omit them from further computation.

[0057] The computer vision based hybrid system 100 further optimizes feature matching using keypoint signature and pose refinement using ESM to achieve high computational efficiency. For keypoint signature, the features in a current image are matched to those in the database using a method that allows for fewer computations while maintaining the accuracy. For ESM, the new illumination model is more efficient to compute, while at the same time more accurate.

[0058] The advantages provided by the computer vision based hybrid tracking system 100 allows the system to be easily applied to a wide range of applications that requires 3D positions and orientations of a moving planar surface. For example, in addition to single object tracking described above, the hybrid tracking system 100 is able to track multiple independently moving planar surfaces simultaneously. In the area of AR, the hybrid tracking system 100 can be applied to applications of human computer interface (e.g., static map augmented with dynamic information obtained from the tracking system 100), entertainment and advertisement, design visualization and mobile augmentation in outdoor urban environment. Other non-AR applications, such as precision navigation and robotics, can also use the hybrid tracking system 100 to measure precise positions and orientations of a camera relative to the planar surface being tracked.

[0059] The methods and techniques described herein can be performed by a computer program product and/or on a computer-implemented system. For example, to perform the steps described, appropriate modules are designed to implement the method in software, hardware, firmware, or a combination thereof. The invention therefore encompasses a system, such as a computer system installed with appropriate software, that is adapted to perform these techniques for creating soft shadows. Similarly, the invention includes a computer program product comprising a computer-readable medium containing computer program code for performing these techniques for creating soft shadows, and specifically for determining an extent to which an area light source is occluded at a particular shading point in an image.

[0060] The foregoing description of the embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above teaching. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.

Claims

What is claimed is:

1. A computer-implemented method for determining a pose of an object of a video frame relative to camera motion, the method comprising:

selecting a reference frame, the reference frame comprising a plurality of objects; computing a signature for each keypoint feature of the plurality of the objects of the reference frame, wherein the signature of a keypoint feature comprises a descriptor of the keypoint feature;

extracting keypoint features of each video frame of a video sequence captured by a camera, the video sequence comprising a plurality of video frames;

determining a set of matches for each video frame of the video sequence based on signatures of the extracted keypoint features of the video frame and the signatures of the keypoint features of at least one of the plurality of the objects in the reference frame; and

estimating the pose of at least one of the plurality of the objects based on the set of matches, the pose of an object comprising a three-dimensional position and orientation of the object relative to a motion of the camera.

2. The method of claim 1, further comprising:

refining the estimated pose of the object using a matching error minimization scheme.

3. The method of claim 2, wherein the matching error minimization scheme is efficient second-order minimization algorithm.

4. The method of claim 1, wherein selecting the reference frame comprises: receiving a video frame from the video sequence, the video sequence comprising a plurality of video frames;

dividing the video frame into a plurality of sub-grids, wherein each sub-grid is a portion of the video frame;

computing an average gradient of each sub-grids of the video frame; and selecting sub-grids with average gradient greater than a threshold value for pose estimation.

5. The method of claim 4, wherein the threshold value is 10 gray levels per pixel of the sub-grid.

6. The method of claim 1, wherein computing a signature for each keypoint features of the plurality of the objects comprises:

detecting the keypoint features of the plurality of the objects of the reference

frame, wherein the keypoint features are corner-like features; and computing a signature for each detected corner-like feature, wherein the signature for a corner-like feature is a response for the base class of the corner-like feature in a training set of binary features.

7. The method of claim 1, wherein determining the set of matches for each video frame of the video sequence comprises:

computing probabilities of a pre-determined number of peaks for each signature of keypoint features of the reference frame;

computing probabilities of a pre-determined number of peaks for each signature of keypoint features of the video frame;

summing the logarithms of the probabilities of each signature of keypoint features of the video frame to generate a matching response; and determining whether the keypoint feature of the video frame matches a keypoint feature of the reference frame based on the matching response.

8. The method of claim 7, wherein the pre-determined number of peaks is 15.

9. The method of claim 1, further comprising:

determining a search set of camera positions while capturing the video sequence using distinct planar surface patches within a limited search radius, a planar surface patch corresponding to an observation of a planar surface in a 3D video scene.

10. The method of claim 9, further comprising:

reducing the search set of camera positions using camera orientation information to eliminate planar surfaces not visible from current camera position.

11. The method of claim 1 , further comprising:

modeling illumination changes over the video frames of the video sequence,

wherein the illumination changes are estimated directly from the video frame and warped image of the video frame.

12. The method of claim 1, further comprising:

detecting pose estimation failure of an object; and

responsive to pose estimation failure, flagging the object for further feature

matching.

13. The method of claim 1, further comprising:

generating a modified video sequence, wherein the modified video sequence

comprises a computer generated graphic augmented onto each video frame of the video sequence, and the position of the augmented graphic on each video frame is based on the estimated pose of an object of the video frame.

14. A computer program product for determining a pose of an object of a video frame relative to camera motion, the computer program product comprising a non-transitory computer-readable storage medium containing computer program code for performing the operations:

15. The computer program product of claim 14, further comprising computer program code for refining the estimated pose of the object using a matching error

minimization scheme.

16. The computer program product of claim 15, wherein the matching error minimization scheme is efficient second-order minimization algorithm.

17. The computer program product of claim 14, wherein the computer program code for selecting the reference frame comprises computer program code for:

receiving a video frame from the video sequence, the video sequence comprising a plurality of video frames;

18. The computer program product of claim 17, wherein the threshold value is 10 gray levels per pixel of the sub-grid.

19. The computer program product of claim 14, wherein the computer program code for computing a signature for each keypoint features of the plurality of the objects comprises computer program code for:

20. The computer program product of claim 14, wherein the computer program code for determining the set of matches for each video frame of the video sequence comprises computer program code for:

summing the logarithms of the probabilities of each signature of keypoint features of the video frame to generate a matching response; and determine whether the keypoint feature of the video frame matches a keypoint feature of the reference frame based on the matching response.

21. The computer program product of claim 14, further comprises computer program code for determining a search set of camera positions while capturing the video sequence using distinct planar surface patches within a limited search radius, a planar surface patch corresponding to an observation of a planar surface in a 3D video scene.

22. The computer program product of claim 21, further comprises computer program code for reducing the search set of camera positions using camera orientation information to eliminate planar surfaces not visible from current camera position.

23. The computer program product of claim 14, further comprises computer program code for modeling illumination changes over the video frames of the video sequence, wherein the illumination changes are estimated directly from the video frame and warped image of the video frame.

24. The computer program product of claim 14, further comprises computer program code for:

detecting pose estimation failure of an object; and

responsive to pose estimation failure, flagging the object for further feature

matching.

25. The computer program product of claim 14, further comprising computer program code for generating a modified video sequence, wherein the modified video sequence comprises a computer generated graphic augmented onto each video frame of the video sequence, and the position of the augmented graphic on each video frame is based on the estimated pose of an object of the video frame.

26. A computer system for determining a pose of an object of a video frame relative to camera motion, the system comprising:

a non-transitory computer-readable storage medium storing executable computer program modules comprising:

a configuration module configured to:

select a reference frame, the reference frame comprising a plurality of objects; and

compute a signature for each keypoint feature of the plurality of the

objects of the reference frame, wherein the signature of a keypoint feature comprises a descriptor of the keypoint feature; and

a computer vision based hybrid tracker configured to: extract keypoint features of each video frame of a video sequence

captured by a camera, the video sequence comprising a plurality of video frames;

determine a set of matches for each video frame of the video sequence based on signatures of the extracted keypoint features of the video frame and the signatures of the keypoint features of at least one of the plurality of the objects in the reference frame; and estimate the pose of at least one of the plurality of the objects based on the set of matches, the pose of an object comprising a three-dimensional position and orientation of the object relative to a motion of the camera; and

a processor for executing the computer program modules.

27. The system of claim 26, wherein the computer vision based hybrid tracker is further configured to refine the estimated pose of the object using a matching error minimization scheme.

28. The system of claim 27, wherein the matching error minimization scheme is efficient second-order minimization algorithm.

29. The system of claim 26, wherein the configuration module is further configured to:

receive a video frame from the video sequence, the video sequence comprising a plurality of video frames;

divide the video frame into a plurality of sub-grids, wherein each sub-grid is a portion of the video frame;

compute an average gradient of each sub-grids of the video frame; and select sub-grids with average gradient greater than a threshold value for pose estimation.

30. The system of claim 26, wherein the configuration module is further configured to:

detect the keypoint features of the plurality of the objects of the reference frame, wherein the keypoint features are corner-like features; and compute a signature for each detected corner-like feature, wherein the signature for a corner-like feature is a response for the base class of the corner-like feature in a training set of binary features.

31. The system of claim 26, wherein the computer vision based hybrid tracker is further configured to:

compute probabilities of a pre-determined number of peaks for each signature of keypoint features of the reference frame;

compute probabilities of a pre-determined number of peaks for each signature of keypoint features of the video frame;

sum the logarithms of the probabilities of each signature of keypoint features of the video frame to generate a matching response; and

determine whether the keypoint feature of the video frame matches a keypoint feature of the reference frame based on the matching response.

32. The system of claim 26, wherein the computer vision based hybrid tracker is further configured to determine a search set of camera positions while capturing the video sequence using distinct planar surface patches within a limited search radius, a planar surface patch corresponding to an observation of a planar surface in a 3D video scene.

33. The system of claim 32, wherein the computer vision based hybrid tracker is further configured to reduce the search set of camera positions using camera orientation information to eliminate planar surfaces not visible from current camera position.

34. The system of claim 26, wherein the computer vision based hybrid tracker is further configured to:

detect pose estimation failure of an object; and

responsive to pose estimation failure, flag the object for further feature matching.

35. The system of claim 26, further comprising an augmentation module configured to generate a modified video sequence, wherein the modified video sequence comprises a computer generated graphic augmented onto each video frame of the video sequence, and the position of the augmented graphic on each video frame is based on the estimated pose of an object of the video frame.