GB2549940A

GB2549940A - Discovering points of interest and identifying reference images in video processing and efficient search and storage therefor

Info

Publication number: GB2549940A
Application number: GB1607577.2A
Authority: GB
Inventors: M Williams John; Ono Tomohiro
Original assignee: Kudan Ltd
Current assignee: Kudan Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2017-11-08
Also published as: GB201607577D0; WO2017186578A1

Abstract

Discovering points of interest in a video or scene by selecting a first sub-portion 220 of a first frame and performing feature extraction to provide a first set of features 211, 212. Feature extraction is performed on a second sub-portion 230 of a second frame to identify a second set of features 213, while tracking the first set of features in the second frame. The method may comprise tracking the second set of features in a third frame and performing feature extraction on a third sub-portion 240 in the third frame to provide a third set of features 215. The method may comprise dividing the image into N sub-portions. Also claimed is a method of identifying reference images by performing feature extraction, identifying a reference image, sending the reference image identity to a server and receiving a description of the reference image from the server (fig 1, fig 9.)

Description

DISCOVERING POINTS OF INTEREST AND IDENTIFYING REFERENCE IMAGES IN VIDEO PROCESSING AND EFFICIENT SEARCH AND STORAGE THEREFOR

Field of the invention [0001] The invention relates to augmented reality in which video elements (e.g. elements of a live direct or indirect view of a physical, real-world environment) are augmented or supplemented by computer-generated video or graphical input. In particular, it relates to a method, an apparatus and a computer program product for discovering points of interest of a marker in a sequence of video frames, e.g. for purposes of superimposing graphical input on an object represented by the marker.

Background to the invention [0002] Augmented reality is form of video processing by which real images captured by a camera are overlaid with virtual images (or additional data) or virtual images ('stored in memory or created by computer generated imagery (CGI) are overlaid with real images.

[0003] Applications for augmented reality (AR) are many and varied. They include, for example, creation of publicity video in which a virtual image of a product (e.g. a car) is superimposed into a real image of a scene (e.g. a customer's forecourt) to enable visualization of the product in situ. By AR, many aspects of a video (e.g. a live video in real time) can be altered.

[0004] Typically AR involves a marker or reference image in a real scene upon which a substitute image will be projected - i.e. the reference image will assume some characteristic of or be entirely replaced by the substitute image.

[0005] In early forms of AR, a highly identifiable marker was placed into the scene. For example, a sheet of white paper with a cross on it, or just a sheet of white paper with four identifiable corners. With more recent AR processes, almost anything can be chosen as a marker provided that the AR software has prior knowledge of what is the marker - i.e. a reference image is stored in memory. In this way, a reference image may be used that already exists in a scene - for example a dollar bill lying on a table or a double-decker bus moving along a road.

[0006] EP2597623A2 describes a method for providing augmented reality service in a mobile terminal using a marker. Real information acquired from the surrounding environment is matched to a marker and the marker is supplemented by augmented information.

[0007] If complex reference images are to be used (or even if a simple marker is used), there is a high processing burden in finding the reference image in the scene. For example, a simple black-on-white cross on paper may call for a limited amount of processing, but a complex reference image that is itself an image of a real object may require identification of many points of interest (POIs).

[0008] A reference image with many POIs can be image processed as a 3D item in a 3D environment. E.g. many POIs beyond the mere corners of the reference image may be used. It can be a moving object and can turn around three axes (pitch, roll and yaw). It can also bend and fold. The image to be superimposed in place of the reference image can move, turn, bend and fold like the reference image.

[0009] Discovering points of interest for a reference image is a processor-intensive operation. It is not unusual, for example, to have 1000 POIs in a reference image that may have to be discovered. It may not be necessary to use all the POIs, but 100 POIs is a useful number to work with. Searching for 1000 POIs and identifying the best 100 matches may take several seconds. This can be an irritating delay. It is sufficiently long as to frustrate a user and put in question whether the process is operating properly. If the user is unsure whether the process is operating, he or she may move the camera relative to the scene to try again, either with a different reference image or returning to the same reference image. Movement of the camera while trying to discover a reference image may delay the process further.

[0010] A further problem with existing processes for discovering reference images is the amount of memory required. A reference image with 1000 POIs is a very large file. It may include, for example, the feature descriptors of the POIs and their relative locations.

[0011] There is a need for a process of discovering POIs for a reference image that is less processor intensive or can discover a required number of POIs in less time.

[0012] Separately, there is a need for a process of discovering POIs that takes up less memory.

Summary of the Invention [0013] In accordance with a first aspect of the invention, a method of discovering points of interest (POIs) of a reference image in a scene is provided, comprising: (a) selecting a first subportion of a first frame and performing feature extraction for the sub-portion to provide a first set of features; (b) selecting a second sub-portion of a second frame and performing feature extraction for the second sub-portion to provide a second set of features; and (c) tracking the first set of features in the second frame.

[0014] Step (b) is preferably repeated for a third frame, a third sub-portion and a third set of features. The first and second sets of features are tracked in the third frame, for example using frame-to-frame motion vectors. The first, second and third sub-portions may make up the entire scene.

[0015] More generally, the scene may be divided into N sub-portions and, after repeating for frame 2, step (b) may be repeated for frames 3 to N, sub-portions 3 to N. Further features are extracted. In the meantime, previously discovered features are tracked in frames 3 to N.

[0016] Sub-portions may be selected in accordance with a direction of relative movement of the scene and a camera capturing the scene. In particular, the scene may be separated into sub-portions in a direction of movement of the camera relative to the scene.

[0017] The scene may be separated into vertical sub-portions, with the sub-portions being selected from left to right when the camera is moving from left to right relative to the scene (and vice-versa when moving from right to left) or it may be separated into horizontal subportions with the sub-portions being selected from bottom to top when the camera is moving upwards relative to the scene (and vice-versa when moving downwards). Alternatively, the scene may be separated into annular sub-portions. When the camera moves inwards relative to the scene, the sub-portions are selected from outermost to innermost, and vice-versa when moving outwards.

[0018] Motion above a predetermined threshold may be detected, e.g. using a motion detector such as a gyroscope, or by frame-to-frame image motion analysis. Template matching (or other POI discovery) may be inhibited until motion has slowed to a lower threshold. This allows a "settling down" of the image to a relatively stable state before commencing template matching. It has been found that delaying the start of template matching can, paradoxically, speed up completion of the process.

[0019] In accordance with a second aspect of the invention, a device with augmented reality capability is provided, comprising: means for capturing or receiving image frames; means for selecting a first sub-portion of a first frame and performing feature extraction for the subportion and for selecting a second sub-portion of a second frame and performing feature extraction for the second sub-portion; and means for tracking features from the first frame to the second frame.

[0020] In accordance with a third aspect of the invention, a method of discovering points of interest (POIs) of a reference image in a scene is provided, comprising: searching for locally stored feature descriptors in captured frame or part thereof (e.g. by a process of template matching); identifying a reference image from feature descriptors found; sending the identity of the reference image to a remote memory such as a server and receiving a description of the reference image from the remote memory.

[0021] Thus, a method of performing augmented reality in a portable device is provided, comprising: capturing an image; performing feature extraction (and/or template matching against a set of locally stored templates to discover a set of POIs); identifying a reference image from feature descriptors found; sending an identifier of the reference image to a remote memory; receiving a description of the reference image from the server; and using the reference image description to track the reference image from frame to frame and replace it or supplement it with alternative graphical input.

[0022] In this way, the local memory of the device need only store the feature descriptors and some means (e.g. a histogram or other summary) of relating feature descriptors discovered to reference images. The local memory need not store the entire description of the reference image. E.g. it need not store the relative positions of the POIs in the reference image. This and any other additional information can be downloaded when a reference image has been identified.

[0023] A given feature descriptor may be applicable to different reference images.

[0024] The description of the reference image received is a complete description including relative positions of all POIs in the reference image and can be used for tracking POis from frame-to-frame.

[0025] A device and system with augmented reality capability are provided. The device comprises: means for capturing or receiving image frames; means for performing feature extraction, in a captured frame or part thereof, for feature descriptors stored in the memory and identifying a reference image therefrom; means for sending a summary of extracted features to a server; and means for receiving a description of the reference image from the server. The system additionally has a server for storing descriptions of reference images and for returning to the device a description of a reference image identified to the server by the device. The server may store many reference image descriptions and returns a selected reference image description, selected according to an identifier given to it by the device.

Brief Description of the Drawings [0026] Fig. 1 is an overview of a system implementing embodiments of the invention.

[0027] Figs. 2a, 2b and 2c are schematic representations of a static image captured by a camera across three frames.

[0028] Fig. 3 shows representations of POIs and a reference image as stored in memory.

[0029] Fig. 4 shows the image of Fig. 2a with an alternative subdivision.

[0030] Figs. 5a, 5b and 5c show the image of Fig. 2a moving from frame to frame.

[0031] Figs. 6, 7 and 8 show alternatives to the arrangements of Figs. 2 and 4.

[0032] Fig. 9 is a flow diagram illustrating process carried out by elements of the system of Fig. l.

Detailed Description of Preferred Embodiments [0033] Referring to Fig. 1, an image capture device 100 is shown, for example in the form of a mobile phone or tablet computer or the like, having a camera 101 and a screen 102. The device has a processor 105 and memory 108 and optionally other sensors 110, for example a gyroscope. The device 100 communicates wirelessly with an access point or base station 120, which is connected to a server 130 via a network 140.

[0034] In an augmented reality application, one of the tasks of the processor 105 is to identify, in an image captured by the camera 101, a reference image identified or stored in the memory 108. This is described with reference to Fig. 2a.

[0035] In Fig. 2a, a scene 200 captured by camera 101 is shown. Let it be supposed that in the scene there is an image 210 which matches a reference image stored in memory 108. The task of the processor 105 is to identify points of interest (POIs - see points 211, 212, 213 and 215) on the captured image 210. If these POIs can be identified (discovered), it can be verified that the image 210 matches the reference image. Moreover, when the sub-image has successfully been matched against the reference image, it can be used as a "marker" onto which an alternative image can be projected from the memory 108.

[0036] For example, the image 200 may be that of a house with a forecourt. The reference image may be a car sitting on the forecourt. If the reference image is successfully identified, it can be substituted with an alternative image from the memory, for example a different car, or the same car with a different colour, etc. As another example, the image 210 may be moving, for example it may be a bus, and its image is to be augmented by projecting an advertisement on the side of the bus. As another example, it may be a banknote, and in augmented reality it may be folded into a paper aeroplane.

[0037] To perform augmented reality on the image (e.g. to augment or supplement the image), the POIs in the image need to be discovered and tracked.

[0038] The process of discovering the POIs 211, 212, 213 and 214 involves performing feature extraction on the image 200. Feature extraction is a known process, and is computationally complex. It involves detecting and isolating various desired portions or shapes (features) of the image. It typically includes edge detection and corner detection, but can also involve blob detection, ridge detection and detection of other features. It is typically scale-invariant. The result of feature extraction is a set of points (sometimes referred to as keypoints) at which features have been identified. Each point has, at a minimum, a short descriptor (with or without orientation assignment) and a location (x, y).

[0039] A task of the processor is to identify, from the identified features, a reference image (stored in memory) and to match the features identified with feature descriptors for the reference image. For example, with reference to Fig. 3, three feature descriptors 311, 312 and 313 are shown, labelled here as "P", "Q" and "R". Each feature descriptor is a small patch of image with an assigned orientation. Representing of features as feature descriptors is described, for example, in the paper Distinctive Image Features from Scale-Invariant Keypoints by David G. Lowe, Computer Science Department University of British Columbia , 5 January 2004. In this description, the expression "feature descriptor" will be used for a scale invariant feature transform (known as SIFT and also described in Video Object Tracking using SIFT and Mean Shift by Chaoyang Zhu, Chalmers University of Technology, Gothenburg, 2011). The term "point descriptor" or "short descriptor" will be used to refer to a mere type of point in feature extraction (e.g. a type of local extrema), which is less distinctive than a SIFT feature descriptor (e.g. is devoid of orientation).

[0040] In the example given in Fig.3, the image 210 is a simple triangle and the points of the interest are the three corners of that triangle. In this simple example, there are only three points of interest to be found and three patches to be searched for. In practice, the POIs may be many and varied depending on the reference image.

[0041] The reference image may comprise a thousand points of interest (or some number, N, say between about 50 and 1000), so the search algorithm may be searching for a N patches, and it seeks the best one hundred matches (or some number less than N, say between about 20 and 200) to thereby select a lesser number of POIs for the reference image.

[0042] Also stored in memory 108 are the relative positions of the POIs in the reference image, represented by the triangle 314 in Fig. 3, which is indicates that points P, Q and R are normally expected to be found at certain points relative to each other in the reference image (when the image to be found is presented face on). Note that the relative positions of the POIs and other data (including the feature descriptors for each of the POIs) describing the reference image may represent a large file, typically larger file that the sum of the short descriptors of the extracted features (the keypoints).

[0043] In practice, when a user turns on the camera 101 or commences an augmented reality process, the task of searching for the reference image is slow. Note also that when a user first turns on a camera 101, there may be significant movement of the camera relative to the scene. There may be significant movement in the scene 200 and the image 210 from frame to frame.

[0044] To address these problems, the scene (frame) 200 is split into sub-portions 220, 230 and 240 as shown in Fig. 2a. In this example, the sub-portions are vertical strips of the scene. In this example there are three sub-portions.

[0045] In operation, a first frame is captured by the camera 101 and feature extraction is performed in sub-portion 220 (the left-hand vertical strip) in the first frame. In the first frame, it can be expected that features 211 and 212 are detected/discovered.

[0046] A second frame (Fig. 2b) is captured by the camera 101 and in the second frame, feature extraction is performed in the centre portion 230. In this frame, it can be expected that feature 213 is discovered. In the meantime, features 211 and 212 are tracked in the second frame. This process of tracking involves determining a motion vector (or motion vectors) for the image or for regions of the image between frames one and two. The motion vector is applied to features 211 and 212. Thus, the positions of these features are located in the second frame.

[0047] In a third frame (Fig. 2c), feature extraction is performed on right-hand sub-portion 240. In this example, feature 215 is found and, after three frames, the entire scene has been searched and all the features that will be extracted have been extracted. In the meantime, features 211, 212 and 213 extracted in the first and second frames are tracked in the third frame.

[0048] The process used for tracking features from frame to frame can be performed by a known process sometimes referred to as "optical flow". Note that there may be more than one vector for a frame. For example, if there is rotational movement of the image 201 (or of the scene 200), there will be different vectors applicable to different regions of the frame. Thus different features may be subject to different vectors.

[0049] When sufficient portions of the image have undergone feature extraction, the process is ready to attempt to identify a reference image. This is done by simple comparison of a histogram (or other summary) of the features extracts with histograms (or corresponding summaries) of features for candidate reference images. This is described in greater detail below.

[0050] Once a reference image (or part thereof) has been found, the process of tracking that image or partial image into a next frame may be computationally less onerous than discovering it in the first place.

[0051] An advantage of tracking features extracted in one frame into the next frame is that, before performing reference image identification, it is useful to: (i) check that features previously extracted will still be present in the subsequent frame; and (ii) not to confuse them with features extracted in the subsequent frame. Thus, if extracted features are on a motion vector taking them out-of-frame, they can be eliminated (e.g. not included in the histogram). Conversely, if they are on a vector toward the next portion, double counting of that feature can be avoided. Tracking of the features can be performed from the knowledge of the (x, y) coordinates of the feature extracted in the first frame and the optical flow of that region of the image from the first frame to the next frame. Tracking of the features can be performed before look-up of the reference image and therefore before identifying the full feature descriptors of the POIs in the reference image that correspond to the feature points extracted.

[0052] When the reference image has been identified and the full feature descriptors for all the POIs in the reference image are available, matching can be performed between the feature descriptors and areas of the image around the corresponding extracted features. This process will be referred to as "POI discovery". This matching is facilitated by knowledge of the locations of the extracted features. The search process is reduced by limiting the search for different feature descriptors to small bounded areas around the extracted features. This in turn is facilitated by having tracked features extracted in a first frame into the second and subsequent frames.

[0053] The POIs that have been discovered are rank ordered, for example they are rank ordered by order of quality of match between feature descriptors for those POIs and the corresponding portions of the scene, (e.g. the best quality match is ranked highest). Having rank ordered the POIs, a set of POIs is determined, e.g. the best one hundred, for the reference image and these POIs can be tracked into future frames. When selecting discovered POIs to create a selected set, it is possible to select the POIs that best match the corresponding feature descriptors, but it is preferable to also select POIs that are distributed across the reference image (and not, for example, select many POIs clustered in one part of the image where the image quality is highest). Accordingly, the POIs may be first rank ordered by quality and then selected, in decreasing order of quality of match, skipping POIs that are in close proximity to already selected POIs until a satisfactory dispersion of POIs is achieved, and then returning to skipped POIs until a satisfactory number of POIs have been selected to create the discovered set.

[0054] By this process, the task of discovering points of interest for a given frame is reduced Little or nothing is lost by not performing feature extraction, reference image identification and POI discovery, all in the first frame, because in most applications, a user will require some time to position the image in the frame before expecting to augment that image according to the augmented reality process. The first few frames are frequently subject to high relative motion or blur, until the user has positioned the camera to focus on the image. During this period of settling down, the feature extraction process can begin to extract features and can have completed process in just a few frames, by which time the user is ready to identify the reference image and apply the augmented reality.

[0055] Also note that the tracking or optical flow process is subject to error/drift, but the amount of error is small across only a few frames (e.g. three frames).

[0056] As an alternative to dividing the image into three vertical strips, it can be divided into horizontal strips 420, 430 and 440 as shown in Fig. 4. There may be fewer or more than three vertical or horizontal strips. There may, for example be two or four or up to eight.

[0057] Referring now to Figs. 5a, b and c, the same image (described above with reference to Fig. 2a) is shown, subject to movement between frames. The movement is illustrated by vectors 510 and 520. As can be seen, this sub-image 50 is moving from left to right between frames. When it is known (e.g. it has been identified, from the standard image processing algorithms) that there is movement in the frame (or movement of the camera relative to the scene), the selection of the sub-portions for feature extraction is made dependent on the movement. Preferably, sub-portions are selected in order in the direction of movement. Thus, in the example given, the order of the selection of the sub-portions is 220, 230 and 240, from left to right. In this way, sub-portions are selected into which the image is moving (rather than selecting sub-portions from which the image is moving away).

[0058] Other arrangements are described with references to Figs. 6, 7 and 8. In Fig. 6, the image 200 is split into four quadrants 610, 620, 630 and 640. This sub-dividing may be preferable if, for example, the image is rotating. If the image is rotating clockwise, the quadrants will be selected in clockwise order. If the image is rotating anticlockwise, the quadrants will be selected in anticlockwise order. There may be fewer or more than four sectors. There may, for example be three sectors or six or more.

[0059] Referring to Fig. 7, the image is divided into three (or two or more than three) diagonal slices 710, 720 and 730. If, for example, the image is moving from top left to bottom right, the slices will be selected in the order: top-left to bottom-right. Equally, if the image is moving from bottom right to top left, they will be selected in the reverse order. If the image is moving from bottom-left to top-right or vice-versa, the diagonal slices can be selected on an alternative diagonal in that order.

[0060] Referring to Fig. 8, the image is segmented into concentric portions (two, three or more). In this example, the portions are rectangular, but any ring or annular shape can be used. There is an outer portion 810, a middle portion 820 and an inner portion 830. In this particular example, there is a vanishing point 840 around which the concentric portions are centred. This point (which is optional) may be in the centre of the frame or, as illustrated, it may be offset. This configuration is useful if the camera is zooming in or zooming out. For example if the camera is zooming in to the point 840, then the portions will be selected from outermost to innermost. If the camera is zooming out, they will be selected from innermost to outermost.

[0061] The process performed by processor 105 will be described with greater detail with reference to the flowchart of Fig. 9.

[0062] The process in question begins with capturing of the frame of an image by the camera 101. The frame is captured in step 910. At step 920, an optional step is performed whereby a stability test or simple timeout is conducted. In one example this step involves receiving two frames and determining an amount of motion between them, or amount of blur (e.g. lack of correlation) between them and determining that the image is unstable. In such a situation, the process returns to step 910 to capture another frame.

[0063] Alternatively, in step 920, there is a simple settling down time of some milliseconds or some number of frames to allow the image to settle down to a stable position.

[0064] As an alternative, step 920 takes place further down in the process, as will be described.

[0065] At step 930, a sub-portion of the image frame is selected. Step 930 may utilise motion already detected between frames captured in step 910 to select a manner of subdividing the image according to detected motion. Thus, it may subdivide the image in subportions that divide the image in the direction of motion. Alternatively, e.g. if there is just one frame captured, step 930 simply defaults to a particular sub-portion, (e.g. a centre slice of the image).

[0066] At step 940 feature extraction is performed in the sub-portion of the image selected. If, at this point, previous features have already been extracted (e.g. features that have been extracted in a previous pass of step 940), these are tracked in step 950.

[0067] At step 960, the process determines whether a complete set of sub-portions has been searched for features. For example, if the image has been divided into three subportions, step 960 may be complete after three passes of step 940. Alternatively, step 960 may be complete when a desired number of features have been extracted. Alternatively, if the image is divided into N sub-portions, step 960 may wait until N+l or N+2 frames have been searched, or N+M frames, where M depends on the amount of motion, for example M is greater if the motion is high and lower or zero if the motion is low or there is no discernible motion. If insufficient frame portions have been processed, the process returns to step 910 and another frame is captured.

[0068] When feature extraction has been performed on sufficient frame sub-portions the process is ready to identify a reference image. The reference image is identified in step 970. A histogram for the reference image (or each reference image) is stored in memory 108, as shown at 975 in Fig. 9. The particular reference image is identified by comparison between a histogram of the features extracted and the stored histogram. The result of this comparison is a simple reference image identifier. Other ways of summarizing the results of the feature extraction process can be used in place of a histogram to allow a reference image to be selected from among a set of candidate reference images.

[0069] Thus, the reference image is identified by generating a histogram of features (e.g. frequency of occurrence of keypoint descriptors or short feature descriptors) found, and comparing the histogram with locally stored histograms of candidate reference images.

[0070] Storing histograms of reference images is very memory efficient, requiring only a few tens of megabytes of memory 108. Note that a given short feature descriptor (or SIFT feature descriptor) may be applicable for more than one reference image or indeed many reference images.

[0071] The process is akin to a bag-of-words model in language processing. In this model, image segments (feature descriptors) are represented as a multi-set without regard to relative spatial positioning of different features in the reference image. The relative spatial position information 314 shown in Figure 3 does not need to be stored in the memory 108.

[0072] When a sufficient set of matches has been found in the multi-set of feature descriptors (the bag of words), the identifier (only) of the corresponding reference image is sent to the server in step 980, via the access point 120 and the network 140. The identifier is a very small amount of data and is quick and easy to send across the network. The server 130 acts as a simple look-up memory. It fetches the reference image (the marker) according to the identifier presented, and it delivers (replies with) the set of features (SIFT feature descriptors) defining that image. This information includes all the spatial relationship information between the POIs. The complete description of the reference image is sent in step 985 from the server 130 to the device 100. At this point, the device 100 is ready to perform augmented reality, by, for example, superimposing an alternative image on the reference image.

[0073] The further information 314 that the device 100 needs to track the reference image may also be stored in memory 108.

[0074] Note that in step 980 more than one identifier may be sent to the server. For example the reference image may not be uniquely identified, but the best N matches may be sent to the server, and the server can send N reference images. In this way, if in future frames a better match is performed, for example additional templates 950 are matched in step 940 and the resulting histogram more closely resembles one of the images, that reference image can be selected from among the N images fetched.

[0075] It has been mentioned that step 920 can be performed lower down the process of Fig. 9. For example, in step 940, the feature extraction process can be performed on only a small subset of the feature descriptors 950 and, when a few points of interest have been found, these can be tracked in step 950 and the motion of these points of interest can determine whether to select a different sub-portion in step 930 and whether to perform full template matching in step 940.

[0076] As an alternative to performing feature extraction on each sub-portion, a template matching process can be used, in which sub-portions are searched for matches with POI templates (SIFT feature descriptors) from the stored reference image, and when sufficient subportions have been searched and templates matched, the optical flow process can be used to track the corresponding POIs.

[0077] The processes described herein, e.g. the process described with reference to Fig. 9, can be expressed in computer program language which can be stored on a medium or in memory for execution by a processor.

[0078] As an alternative to performing frame to frame motion detection (either for purposes of the stability test 920 or for purposes of selecting sub-portions in step 930 as previously described with reference to Figs. 4, 6, 7 and 8), a gyroscope 110 can be used to detect motion. The gyroscope can provide direction of motion for selection of how to subdivide the frame, or angular direction of motion for determining order of selection of subsections. It can also determine whether there is sufficient stability to begin the process -i.e. proceed from step 920 to step 930.

Claims

1. A method of discovering points of interest (POIs) for matching in a scene, comprising (a) selecting a first sub-portion of a first frame and performing feature extraction for the sub-portion to provide a first set of features; (b) selecting a second sub-portion of a second frame and performing feature extraction for the second sub-portion to provide a second set of features; and (c) tracking the first set of features in the second frame.

2. The method of claim 1, further comprising repeating step (b) for a third frame, a third sub-portion and a third set of features and tracking the first and second sets of features in the third frame, wherein the first, second and third sub-portions make up the entire scene.

3. The method of claim 1, further comprising dividing the scene into N sub-portions and repeating step (b) for frames 3 to N, sub-portions 3 to N and further features and tracking previously extracted features in frames 3 to N.

4. The method of claim 1, 2 or 3, wherein tracking of discovered POIs is performed using frame-to-frame motion vectors.

5. The method of claim 1, 2, 3 or 4, wherein feature extraction comprises generating feature descriptors for the respective sub-portion and wherein a reference image is identified by generating a histogram of extracted feature descriptors and comparing the histogram with locally stored histograms of candidate reference images.

6. The method of claim 5, wherein the histogram comparison is performed only after feature extraction across a plurality of sub-portions of a plurality of frames.

7. The method of any of the preceding claims, further comprising selecting the subportions in accordance with a direction of relative movement of the scene and a camera capturing the scene.

8. The method of claim 7, wherein the scene is separated into subportions in a direction of movement of the camera relative to the scene.

9. The method of claim 8, wherein the scene is separated into vertical subportions and wherein, when the camera is moving from left to right relative to the scene, the subportions are selected from left to right, and vice-versa when moving from right to left.

10. The method of claim 8, wherein the scene is separated into horizontal subportions and wherein, when the camera is moving upwards relative to the scene, the subportions are selected from bottom to top, and vice-versa when moving downwards.

11. The method of claim 7, wherein the scene is separated into annular subportions and wherein, when the camera is moving inwards relative to the scene, the subportions are selected from outermost to innermost, and vice-versa when moving outwards.

12. The method of any one of the preceding claims comprising a step of detecting motion above a predetermined threshold and inhibiting template matching until motion has slowed to a lower threshold.

13. A device with augmented reality capability comprising means for capturing or receiving image frames; means for selecting a first sub-portion of a first frame and performing feature extraction for the sub-portion and for selecting a second sub-portion of a second frame and performing feature extraction for the second sub-portion; and means for tracking features from the first frame to the second frame.

14. The device of claim 13 comprising a camera and means for determining a direction of relative movement of the scene and the camera, wherein the means for selecting the subportions are arranged to select in accordance with the direction of relative movement.

15. A method of discovering points of interest (POIs) of a reference image in a scene, comprising: performing feature extraction in a captured frame or part thereof; identifying a reference image from feature descriptors found; sending the identity of the reference image to a server and receiving a description of the reference image from the server.

16. The method of claim 15, wherein the reference image is identified by generating a histogram of feature descriptors found and comparing the histogram with locally stored histograms of candidate reference images.

17. The method of claim 15 or 16, wherein the description of the reference image received includes relative positions of POIs in the reference image.

18. The method of claim 17, wherein the description of the reference image received is a complete description including relative positions of all POIs in the reference image.

19. The method of any one of claims 15 to 18, further comprising using the received description of the reference image for tracking POIs from frame-to-frame.

20. A device with augmented reality capability comprising: means for capturing or receiving image frames; means for performing feature extraction, in a captured frame or part thereof; means for sending a summary of extracted features to a server and means for receiving a description of the reference image from the server based on the summary of extracted features.

21. A system comprising the device of claim 20 in combination with a server for storing descriptions of reference images and for returning to the device a description of a reference image identified to the server by the device.

22. The system of claim 21 wherein the server stores many reference image descriptions and returns a selected reference image description, selected according to a summary given to it by the device.

23. A computer program product having computer code stored thereon which when executed by a processor, causes the processor to perform the method of any one of claims 1 to 12 and 15 to 19.