WO2005096178A1 - Method and apparatus for retrieving visual object categories from a database containing images - Google Patents

Method and apparatus for retrieving visual object categories from a database containing images Download PDF

Info

Publication number
WO2005096178A1
WO2005096178A1 PCT/GB2005/001124 GB2005001124W WO2005096178A1 WO 2005096178 A1 WO2005096178 A1 WO 2005096178A1 GB 2005001124 W GB2005001124 W GB 2005001124W WO 2005096178 A1 WO2005096178 A1 WO 2005096178A1
Authority
WO
WIPO (PCT)
Prior art keywords
images
model
visual object
image
object category
Prior art date
Application number
PCT/GB2005/001124
Other languages
French (fr)
Other versions
WO2005096178A8 (en
Inventor
Andrew Zisserman
Robert Fergus
Pietro Perona
Original Assignee
Isis Innovation Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Isis Innovation Limited filed Critical Isis Innovation Limited
Priority to JP2007505620A priority Critical patent/JP2007531136A/en
Priority to EP05729251A priority patent/EP1730658A1/en
Publication of WO2005096178A1 publication Critical patent/WO2005096178A1/en
Publication of WO2005096178A8 publication Critical patent/WO2005096178A8/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5854Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using shape and object relationship
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Definitions

  • This invention relates to a method and apparatus for retrieving visual object categories from a database containing images and, more particularly, to an improved method and apparatus for searching for, and retrieving, relevant images corresponding to visual object categories specified by a user by means of, for example, an Internet search engine or the like.
  • the most relevant returned items i.e. those containing precisely the keyword(s) entered, are identified and then ranked according to a numeric value based on the number of links existing to each respective web page in other web pages.
  • the results likely to be of most relevance to the user are listed in the first few pages of the search results.
  • the results most likely to be of relevance are not likely to be returned in the first few pages of the search results, but instead are more likely to be evenly mixed with unrelated images.
  • This method is highly effective in quickly gathering related images from the millions available across the World Wide Web, but the final outcome is far from perfect in the sense that the user may then have to go through tens or even hundreds or thousands of result entries to find the images of interest. We have now devised an improved arrangement.
  • apparatus for determining the relevance of images retrieved from a database relative to a specified visual object category, the apparatus comprising means for transforming a visual object category into a model defining features of said visual object category and a spatial relationship therebetween.
  • Means may be provided for storing said model.
  • means are provided for comparing a set of images retrieved from a database with the stored model and calculating a likelihood value relating to each image based on its correspondence with said model.
  • Means may further be provided for ranking the images in order of the respective likelihood values; and/or for retrieving further images corresponding to the specified visual object category.
  • a method for determining the relevance of images retrieved from a database relative to a specified visual object category comprising transforming a visual object category into a model defining features of said visual object category and a spatial relationship therebetween.
  • the method may further include the step of storing said model.
  • the method may further include the steps of comparing a set of images retrieved from the database with the stored model and calculating a likelihood value relating to each image based on its correspondence with the model.
  • the method includes ranking the images in order of the respective likelihood values; and/or for finding further images corresponding to the specified visual object category.
  • the set of images may be retrieved from a database during a search of that database, using for example, a search engine.
  • each part is represented by one or more of its appearance and/or geometry, its scale relative to the model, and its occlusion probability, which parameters may be modelled by probability density functions, such as Gaussian probability functions or the like.
  • the step of comparing an image with the models preferably includes identifying features of the image and evaluating the features using the above-mentioned probability densities.
  • the method may include the step of selecting a sub-set of the images retrieved during the database search, and creating the model from this sub-set of images.
  • substantially all of the images retrieved during the database search may be used to create the model.
  • at least two different models may be created in respect of a set of images retrieved during, for example, a database search, say patches and curves, although other features are envisaged.
  • a heterogeneous model made up of a combination of features may be created.
  • the method preferably includes the step of selecting the nature or type of model to be used for the comparison and ranking steps in respect of a particular set of images.
  • the selective step may be performed by calculating a differential ranking measure in respect of each model, and selecting the model having the largest differential ranking measure.
  • Figure 1 is a schematic block diagram illustrating the principal steps of a method according to a first exemplary embodiment of the present invention
  • Figure 2 is a schematic block diagram illustrating the principal components of a method according to a second exemplary embodiment of the present invention.
  • Figure 3 is a schematic block diagram illustrating the principal steps of a patch feature extraction method for use in the method of Figure 1 or Figure 2;
  • Figure 4 is a schematic block diagram illustrating the principal steps of a curve feature extraction method for use in a method of Figure 1 or Figure 2;
  • Figure 5 is a schematic block diagram illustrating the principal steps of a model learning method in the supervised case used in the method of Figure 1;
  • Figure 6 is a schematic block diagram illustrating the principal steps of a model learning method in the unsupervised case used in the method of Figure 2 (note: a rectangle denotes a process while a parallelogram denotes data).
  • the present invention is based on the principle that, even without improving the performance of a search engine per se the above-mentioned problems related to image-based Internet searching may be alleviated by measuring 'visual consistency' amongst the images that are returned by the search and re-ranking them on the basis of this consistency, thereby increasing the proportion of relevant images returned to the user within the first few entries in the search results.
  • This concept is based on the assumption that images related to the search requirements will typically be visually similar, while images that are unrelated to the search requirements will typically look different from each other as well.
  • the problem of how to measure 'visual consistency' is approached in the following exemplary embodiments of the present invention as one of probabilistic modelling and robust statistics.
  • the algorithm employed therein robustly learns the common visual elements in a set of returned images so that the unwanted (non-category) images can be rejected, or at least so that the returned images can be ranked according to their resemblance to this commonality. More precisely, a visual object model is learned which can accommodate the intra-class variation in the requested category.
  • the apparatus and method of these exemplary embodiments of the invention employ an extension of a constellation model, and are designed to learn object categories from images containing clutter, thereby at least minimising the requirement for human intervention.
  • An object or constellation model consists of a number of parts which are spatially arranged over the object, wherein each part has an appearance and can be occluded or not.
  • a part in this case may, for example, be a patch of picture elements (pixels) or a curve segment.
  • a part is represented by its intrinsic description (appearance or geometry), its scale relative to the model, and its occlusion probability.
  • the shape of the object is represented by the mutual position of the parts.
  • the entire model is generative and probabilistic, in the sense that part description, scale model shape and occlusion are all modelled by probability density functions, which in this case are Gaussians.
  • the process of learning an object category is one of first detecting features with characteristic scales, and then estimating the parameters of the above densities from these features, such that the model gives a maximum-likelihood description of the training data.
  • a model consists of P parts and is specified by parameters ⁇ .
  • the model is scale invariant. Full details of this model and its fitting to training data using the EM algorithm are given by R. Fergus, P. Perona, and A. Zisserman in Object Class Recognition by Unsupervised Scale-Invariant Learning, In Proc. CVPR, 2003, and essentially the same representations and estimation methods are used in the following exemplary embodiments of the present invention.
  • An interest operator such as that described by T. Kadir and M. Brady in Scale, Saliency and Image Description, IJCV, 45(2):83-105, 2001 , may be used to find regions that are salient over both location and scale. It is based on measurements of the grey level histogram and entropy over the entire region. The operator detects a set of circular regions so that both position (the circle centre) and scale (the circle radius) are determined. The operator is largely invariant to scale changes and rotation of the image. Thus, for example, if the image is doubled in size, then the corresponding set of regions will be detected (at twice the scale).
  • extended edge chains may be used as detected, for example, by the edge operator described by J.F. Canny in A Computational Approach to Edge Detection, IEEE PAMI, 8(6):679-698, 1986.
  • the chains are then segmented into segments between bitangent point, i.e. points at which a line has two points of tangency with the curve.
  • bitangency is covariant with projective transformations. This means that for near planar curves the segmentation is invariant to viewpoint, an important requirement if the same, or similar, objects are imaged at different scales and orientations.
  • each patch exists in a predetermined dimensional space. Since the appearance densities of the model must also exist in this space, it is necessary from a practical point-of-view to somehow reduce the dimensionality of each patch whilst retaining its distinctiveness.
  • PCA principal component analysis
  • the patches from all images are collected and PCA performed on them.
  • the appearance of each patch is then a vector of the coordinates within the first predetermined number k principal components, thereby giving A. This results in a good reconstruction of the original patch whilst using a moderate number of parameters per part.
  • an image search using a search engine such as Google® may be used to download a set of images and the integrity of the downloaded images is checked. In addition, those outside a reasonable size range, say between 100 and 600 pixels on the major axis) are discarded.
  • a typical image search is likely to return in the region of 450-700 usable images and a script may be employed to automate the procedure.
  • the images returned can be divided into three distinct types: • Good images, i.e. good examples of the keyword category, lacking major occlusion, although there may be a variety of viewpoints, scalings and orientations. • Intermediate images, i.e.
  • each image is converted into greyscale (because colour information is not used in the model described above, although colour information may be used in other models applied to embodiments of the present invention, and the invention is not intended to be limited in this regard), and curves and regions of interest are identified within the images.
  • a predetermined number of regions with the highest saliency are used from each image.
  • the learning process takes one of two distinct forms: unsupervised learning ( Figure 6) and limited supervision ( Figure 5).
  • unsupervised learning a model is learnt using all images in a dataset. No human intervention is required in the process.
  • limited supervision an alternative approach using relevance feedback is used, whereby a user selects, say, 10 or so images from the dataset that are close to the required image, and a model is learnt using these selected images.
  • the learning task takes the form of estimating the parameters ⁇ of the model discussed above.
  • the model is ⁇ learnt using the EM algorithm as described by R. Fergus et al in the reference specified above.
  • a variety of models may be learned, each made up of a variety of feature types (e.g. patches, curves, etc), and a decision must then be made as to which should give the final ranking that will be presented to a user.
  • this is done by using a second set of images, consisting entirely of "junk" images (i.e. images which are totally unrelated to the specified visual object category). These may be collected by, for example, typing "things" into a search engine's image search facility.
  • each model evaluates the likelihood of images from both datasets and a differential ranking measure is computed between them, for example, by looking at the area under an ROC curve between the two data sets. The model which gives the largest differential ranking measure is selected to give the final ranking presented to the user.
  • the model fitting situation dealt with herein is equivalent to that faced in the area of robust statistics: in the sense that there is an attempt to learn a model from a dataset which contains valid data (the good images) but also outliers (the intermediate and junk images) which cannot be fitted by the model. Consequently, a robust fitting algorithm, RANS AC may be adapted to the needs of the present invention.
  • a set of images sufficient to train a model (10, in this case) is randomly sampled from the images retrieved during a database search. This model is then scored on the remaining images by the differential ranking measure explained above. The sampling process is repeated a sufficient number of times to ensure a good chance of a sample set consisting entirely of inliers (good images).
  • the models of a category have been shown to be capable of being learnt from training sets containing large amounts of unrelated images (say up to 50% and beyond) and it is this ability that allows the present invention to handle the type of datasets returned by conventional Internet search engines.
  • the algorithm only requires images as its input, so the method and apparatus of the present invention can be used in conjunction with any existing search engine. Still further, it will be appreciated by a person skilled in the art that the present invention has as a significant advantage that it is scale invariant in its ability to retrieve/rank relevant images.
  • category keyword (needed for (i) above) can be automatically selected by choosing the most commonly searched for categories.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A method for determining the relevance of images retrieved from a database to a specified visual object category, represented by the query keyword, the method comprising learning a model defining features of said visual object category and a spatial relationship therebetween, storing said model, comparing a set of images retrieved from said database with said stored model and calculating a likelihood value relating to each image based on its correspondence with said model.

Description

Method and Apparatus for Retrieving Nisual Object Categories from a Database Containing Images
This invention relates to a method and apparatus for retrieving visual object categories from a database containing images and, more particularly, to an improved method and apparatus for searching for, and retrieving, relevant images corresponding to visual object categories specified by a user by means of, for example, an Internet search engine or the like.
It is relatively simple to conduct a search of the World Wide Web for images by simply entering one or more keywords into a search engine, in response to which, hundreds and sometimes thousands of related images may be returned in the search results for selection by the user. However, not all of the images returned in the results will be particularly relevant to the search. In fact, many of the images returned are likely to be completely unrelated.
In a text-based Internet search, the most relevant returned items (i.e. those containing precisely the keyword(s) entered, are identified and then ranked according to a numeric value based on the number of links existing to each respective web page in other web pages. As a result, the results likely to be of most relevance to the user are listed in the first few pages of the search results.
In the case of an image-based search, however, the results most likely to be of relevance are not likely to be returned in the first few pages of the search results, but instead are more likely to be evenly mixed with unrelated images. This is because current Internet image search technology is based on words, rather than image content, such that the images returned in the results contain the entered keyword(s) in either the filename of the image or text appearing near the image on a web page, and the results are then ranked as described above with reference to a text-based search. This method is highly effective in quickly gathering related images from the millions available across the World Wide Web, but the final outcome is far from perfect in the sense that the user may then have to go through tens or even hundreds or thousands of result entries to find the images of interest. We have now devised an improved arrangement.
In accordance with the present invention, there is provided apparatus for determining the relevance of images retrieved from a database relative to a specified visual object category, the apparatus comprising means for transforming a visual object category into a model defining features of said visual object category and a spatial relationship therebetween.
Means may be provided for storing said model. In one exemplary embodiment of the invention, means are provided for comparing a set of images retrieved from a database with the stored model and calculating a likelihood value relating to each image based on its correspondence with said model. Means may further be provided for ranking the images in order of the respective likelihood values; and/or for retrieving further images corresponding to the specified visual object category.
Also in accordance with the present invention, there is provided a method for determining the relevance of images retrieved from a database relative to a specified visual object category, the method comprising transforming a visual object category into a model defining features of said visual object category and a spatial relationship therebetween. The method may further include the step of storing said model. In one exemplary embodiment of the invention, the method may further include the steps of comparing a set of images retrieved from the database with the stored model and calculating a likelihood value relating to each image based on its correspondence with the model. Preferably, the method includes ranking the images in order of the respective likelihood values; and/or for finding further images corresponding to the specified visual object category.
In any event, it will be appreciated that the set of images may be retrieved from a database during a search of that database, using for example, a search engine.
The features beneficially comprise at least two types, which categories may include pixel patches, curve segments, corners and texture. In a preferred embodiment, each part is represented by one or more of its appearance and/or geometry, its scale relative to the model, and its occlusion probability, which parameters may be modelled by probability density functions, such as Gaussian probability functions or the like.
The step of comparing an image with the models preferably includes identifying features of the image and evaluating the features using the above-mentioned probability densities.
The method may include the step of selecting a sub-set of the images retrieved during the database search, and creating the model from this sub-set of images. Alternatively, substantially all of the images retrieved during the database search may be used to create the model. In either case, at least two different models may be created in respect of a set of images retrieved during, for example, a database search, say patches and curves, although other features are envisaged. Alternatively, and more preferably, a heterogeneous model made up of a combination of features may be created. In any event, the method preferably includes the step of selecting the nature or type of model to be used for the comparison and ranking steps in respect of a particular set of images.
In one embodiment, the selective step may be performed by calculating a differential ranking measure in respect of each model, and selecting the model having the largest differential ranking measure.
These and other aspects of the present invention will be apparent from, and elucidated with reference to, the embodiments described herein.
Embodiments of the present invention will now be described by way of examples only and with reference to the accompanying drawings, in which:
Figure 1 is a schematic block diagram illustrating the principal steps of a method according to a first exemplary embodiment of the present invention;
Figure 2 is a schematic block diagram illustrating the principal components of a method according to a second exemplary embodiment of the present invention. Figure 3 is a schematic block diagram illustrating the principal steps of a patch feature extraction method for use in the method of Figure 1 or Figure 2;
Figure 4 is a schematic block diagram illustrating the principal steps of a curve feature extraction method for use in a method of Figure 1 or Figure 2;
Figure 5 is a schematic block diagram illustrating the principal steps of a model learning method in the supervised case used in the method of Figure 1; and
Figure 6 is a schematic block diagram illustrating the principal steps of a model learning method in the unsupervised case used in the method of Figure 2 (note: a rectangle denotes a process while a parallelogram denotes data).
Thus, the present invention is based on the principle that, even without improving the performance of a search engine per se the above-mentioned problems related to image-based Internet searching may be alleviated by measuring 'visual consistency' amongst the images that are returned by the search and re-ranking them on the basis of this consistency, thereby increasing the proportion of relevant images returned to the user within the first few entries in the search results. This concept is based on the assumption that images related to the search requirements will typically be visually similar, while images that are unrelated to the search requirements will typically look different from each other as well.
The problem of how to measure 'visual consistency' is approached in the following exemplary embodiments of the present invention as one of probabilistic modelling and robust statistics. The algorithm employed therein robustly learns the common visual elements in a set of returned images so that the unwanted (non-category) images can be rejected, or at least so that the returned images can be ranked according to their resemblance to this commonality. More precisely, a visual object model is learned which can accommodate the intra-class variation in the requested category. It will be appreciated by a person skilled in the art that this is an extremely challenging visual task: not only are there visual difficulties in learning from images, such as lighting and viewpoint variations (scale, foreshortening) and partial occlusion, but the object may only actually be present in a sub-set of the returned images, and this subset (and even its size) is unknown.
Referring to Figures 1 and 2 of the drawings, the apparatus and method of these exemplary embodiments of the invention employ an extension of a constellation model, and are designed to learn object categories from images containing clutter, thereby at least minimising the requirement for human intervention.
An object or constellation model consists of a number of parts which are spatially arranged over the object, wherein each part has an appearance and can be occluded or not. A part in this case may, for example, be a patch of picture elements (pixels) or a curve segment. In either case, a part is represented by its intrinsic description (appearance or geometry), its scale relative to the model, and its occlusion probability. The shape of the object (or overall model shape) is represented by the mutual position of the parts. The entire model is generative and probabilistic, in the sense that part description, scale model shape and occlusion are all modelled by probability density functions, which in this case are Gaussians.
The process of learning an object category is one of first detecting features with characteristic scales, and then estimating the parameters of the above densities from these features, such that the model gives a maximum-likelihood description of the training data.
In this exemplary embodiment, a model consists of P parts and is specified by parameters υ. Given N detected features with locations X, scales S, and descriptions D, the likelihood that an image contains an object is assumed to have the following form: R = p(X,S,D | 0) p(X,S,B\ θ )
Where the summation is over allocations, h, of parts to features. Typically, a model has 5 — 7 parts and there will be up to forty features in an image. Similarly, it is assumed that non-object background images can be modelled by a likelihood of the same form with parameters υ^. The decision as to whether a particular image contains an object or not is determined by the likelihood ratio: i?(X,S,D| <9) = ∑ p D \ h, θ) pX \ S, h,θ) p(S \ h,θ) p( \ θ) PartDescription Shape Re I. Scale Other
The model, at both the fitting and recognition stages, is scale invariant. Full details of this model and its fitting to training data using the EM algorithm are given by R. Fergus, P. Perona, and A. Zisserman in Object Class Recognition by Unsupervised Scale-Invariant Learning, In Proc. CVPR, 2003, and essentially the same representations and estimation methods are used in the following exemplary embodiments of the present invention.
Existing approaches to recognition learn a model based on a single type of feature, for example, image patches, texture regions or Harr wavelets, from which a model is learnt. However, the different visual nature of objects means that this approach is limiting. For some objects, say for example, wine bottles, the essence of the object is captured far better with geometric information (i.e. the outline) rather than by patches of pixels and, of course, the reverse is true for many objects, for example, human faces. Consequently, for a flexible visual recognition system, it is necessary to have multiple feature types. The flexible nature of the constellation model described above permits this in view of the fact that, because the description densities of each part are independent, each can use a different type of feature.
In the following description, and referring to Figure 3 of the drawings, only two types of features are considered, although more (e.g. corners, texture, etc.) can easily be added. The first of these types consists of regions of pixels, and the second consists of curve segments. It will be appreciated that these types of feature are complementary in the sense that the first represents the appearance of an object, whereas the other represents the object geometry.
An interest operator, such as that described by T. Kadir and M. Brady in Scale, Saliency and Image Description, IJCV, 45(2):83-105, 2001 , may be used to find regions that are salient over both location and scale. It is based on measurements of the grey level histogram and entropy over the entire region. The operator detects a set of circular regions so that both position (the circle centre) and scale (the circle radius) are determined. The operator is largely invariant to scale changes and rotation of the image. Thus, for example, if the image is doubled in size, then the corresponding set of regions will be detected (at twice the scale).
In order to determine curve segments, rather than only considering very local spatial arrangements of edge points, extended edge chains may be used as detected, for example, by the edge operator described by J.F. Canny in A Computational Approach to Edge Detection, IEEE PAMI, 8(6):679-698, 1986. The chains are then segmented into segments between bitangent point, i.e. points at which a line has two points of tangency with the curve. This decomposition is used herein for two reasons. First, bitangency is covariant with projective transformations. This means that for near planar curves the segmentation is invariant to viewpoint, an important requirement if the same, or similar, objects are imaged at different scales and orientations. Second, by segmenting curves using a bi-local property, interesting segments can be found consistently despite imperfect edgel data. Bitangent points are found on each chain using the method described by C. Rothwell, A. Zisserman, D. Forsyth and J. Mundy in Planar Object Recognition Using Projective Shape Representation, IJCV, 16(2), 1995. Since each pair of bitangent points defines a curve which is a sub-section of the chain, there may be multiple decompositions of the chain into curved sections. In practice, many curve segments are straight lines (within a threshold for noise) and these are discarded as they are far less informative than curves. In addition, the entire chain is also used, thereby retaining convex curve portions.
Thus, the above-mentioned feature detectors result in the provision of patches and curves of interest within each image. In order to use them in the model of the present invention, it is necessary to parameterise their properties to for D = [A, G] where A is the appearance of the regions within the image and G is the shape of the curves within the image.
Once the regions are identified, they are cropped from the image and rescaled to a smaller pixel patch. Each patch exists in a predetermined dimensional space. Since the appearance densities of the model must also exist in this space, it is necessary from a practical point-of-view to somehow reduce the dimensionality of each patch whilst retaining its distinctiveness. This is achieved in accordance with this exemplary embodiment of the invention using principal component analysis (PCA). In the learning stage, the patches from all images are collected and PCA performed on them. The appearance of each patch is then a vector of the coordinates within the first predetermined number k principal components, thereby giving A. This results in a good reconstruction of the original patch whilst using a moderate number of parameters per part.
Each curve is transformed to a canonical position using a similarity transformation such that it starts at the origin and ends at the point (1,0). If centroid of the curve is below the x-axis then it is flipped both in the x-axis and the line y = 0.5, so that the same curve is obtained independent of the edgel ordering. They value of the curve in this canonical position is sampled at, a number of equally spaced x intervals between (0,0) and (1,0). Since the model is not orientation-invariant, the original orientation of the curve is concatenated to a vector for each curve, giving another vector. Combining the vectors from all curves within the images gives G.
In the following, the exemplary implementation of the gathering of images, and the main steps in applying the above-described algorithm (namely, feature detection, model learning and ranking) will be described in more detail.
For a given keyword, an image search using a search engine such as Google® may be used to download a set of images and the integrity of the downloaded images is checked. In addition, those outside a reasonable size range, say between 100 and 600 pixels on the major axis) are discarded. A typical image search is likely to return in the region of 450-700 usable images and a script may be employed to automate the procedure. To evaluate the algorithms, the images returned can be divided into three distinct types: • Good images, i.e. good examples of the keyword category, lacking major occlusion, although there may be a variety of viewpoints, scalings and orientations. • Intermediate images, i.e. those images which are in some way related to the keyword category, but are of lower quality than the good images; they may have extensive occlusion, substantial image noise, be a caricature or cartoon of the category, or the category may be rather insignificant in the overall image, or there may be some other fault. • Junk images, i.e. those images which are totally unrelated to the keyword category.
In this particular case, each image is converted into greyscale (because colour information is not used in the model described above, although colour information may be used in other models applied to embodiments of the present invention, and the invention is not intended to be limited in this regard), and curves and regions of interest are identified within the images. This produces X, D and S for use in learning or recognition. A predetermined number of regions with the highest saliency are used from each image.
The learning process takes one of two distinct forms: unsupervised learning (Figure 6) and limited supervision (Figure 5). In unsupervised learning, a model is learnt using all images in a dataset. No human intervention is required in the process. In learning with limited supervision, an alternative approach using relevance feedback is used, whereby a user selects, say, 10 or so images from the dataset that are close to the required image, and a model is learnt using these selected images.
In both approaches, the learning task takes the form of estimating the parameters Θ of the model discussed above. The goal is to find the parameters θML which best explain the data X, D, S from the chosen training images (be it 10 or the whole dataset), i.e. maximise the likelihood ΘML = argmaxjσ(X,D,S | θ). The model is Θ learnt using the EM algorithm as described by R. Fergus et al in the reference specified above.
Given the learnt model, all hypotheses within a particular image are evaluated, and this determines the likelihood ratio for that image. This likelihood ratio is then used to rank all the images in the dataset. For each set of images, a variety of models may be learned, each made up of a variety of feature types (e.g. patches, curves, etc), and a decision must then be made as to which should give the final ranking that will be presented to a user. In accordance with an exemplary embodiment of the present invention, this is done by using a second set of images, consisting entirely of "junk" images (i.e. images which are totally unrelated to the specified visual object category). These may be collected by, for example, typing "things" into a search engine's image search facility. Thus, there are now two sets of images, or datasets: a) the one to be ranked (consisting of a mixture of junk and good images) and b) the junk dataset. In accordance with this exemplary embodiment of the invention, each model evaluates the likelihood of images from both datasets and a differential ranking measure is computed between them, for example, by looking at the area under an ROC curve between the two data sets. The model which gives the largest differential ranking measure is selected to give the final ranking presented to the user.
The rationale behind this exemplary approach is as follows. It can be assumed that the statistics of the junk images in the junk dataset b) are the same as those of the junk images in dataset a) to be ranked, such that by looking at a differential ranking measure, the contributions of the junk images in both datasets cancel, giving a measure of the good images alone. The higher their ranking, the better the model should be.
The model fitting situation dealt with herein is equivalent to that faced in the area of robust statistics: in the sense that there is an attempt to learn a model from a dataset which contains valid data (the good images) but also outliers (the intermediate and junk images) which cannot be fitted by the model. Consequently, a robust fitting algorithm, RANS AC may be adapted to the needs of the present invention. A set of images sufficient to train a model (10, in this case) is randomly sampled from the images retrieved during a database search. This model is then scored on the remaining images by the differential ranking measure explained above. The sampling process is repeated a sufficient number of times to ensure a good chance of a sample set consisting entirely of inliers (good images). The models of a category have been shown to be capable of being learnt from training sets containing large amounts of unrelated images (say up to 50% and beyond) and it is this ability that allows the present invention to handle the type of datasets returned by conventional Internet search engines. Further, in the present invention, as described above with respect to the two exemplary embodiments, the algorithm only requires images as its input, so the method and apparatus of the present invention can be used in conjunction with any existing search engine. Still further, it will be appreciated by a person skilled in the art that the present invention has as a significant advantage that it is scale invariant in its ability to retrieve/rank relevant images.
Two specific exemplary embodiments of the invention have been described: in the first, a user is required to spend a limited amount of time (say 20 — 30 seconds) selecting a small proportion of images of which they require examples (i.e. a simple form of relevance feedback or supervised learning) as illustrated in Figure 1 ; in the second, there is no requirement for user intervention in the learning (i.e. it is completely unsupervised), as illustrated in Figure 2.
The speed of the algorithm is of great practical importance: web-usage studies show that users are prepared to wait only a few seconds for a web-page to load. The timings given below are for a 3GHz machine.
In the case of the Internet search engine application, a large set of category keywords can be automatically obtained by choosing the most commonly searched for image categories (information that existing search engines can easily compile).
In the unsupervised learning case, everything can be pre-computed off-line, since no user input is required, for this set of category keywords. Therefore there is no time penalty for the algorithm. Although the off-line computation may take some time (perhaps even several days depending on the number of models learnt in the RANSAC approach) it only needs to be done once.
In the supervised learning case the situation is harder. Once the user has selected a few images, several models (corresponding to different combinations of feature types) must be learnt and then those models must be run over the entire dataset (—1000 images) all within a few seconds. To make this possible the following measures are undertaken:
(i) extract features from all images in dataset off-line and store them. This only needs to be done once; (ii) learn the different models in parallel;
(iii) run the different models over the entire dataset in parallel.
These measures mean that the speed bottle-necks are dependent on how quickly a model can be learnt and how quickly it can be used to evaluate an image. With the current non-optimized development implementation, the whole process takes around a minute, but with professional grade coding and optimisation this can be reduced to a few seconds.
Again, the choice of category keyword (needed for (i) above) can be automatically selected by choosing the most commonly searched for categories.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be capable of designing many alternative embodiments without departing from the scope of the invention as defined by the appended claims. In the claims, any reference signs placed in parentheses shall not be construed as limiting the claims. The word "comprising" and "comprises", and the like, does not exclude the presence of elements or steps other than those listed in any claim or the specification as a whole. The singular reference of an element does not exclude the plural reference of such elements and vice-versa. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

CLAIMS:
1. Apparatus for determining the relevance of a set of images retrieved from a database relative to a specified visual object category, the apparatus comprising means for transforming a visual object category into a model defining features of said visual object category and a spatial relationship therebetween, means for storing said model, means for comparing a set of images identified during said database search with said stored model and calculating a likelihood value relating to each image based on its correspondence with said model.
2. Apparatus according to claim 1 , wherein the means for comparing an image with the model includes means for identifying features of the image and estimating the probability densities of parameters of said features to determine a maximum likelihood description of the image.
3. Apparatus according to claim 1 or claim 2, further including means for ranking said images in order of said respective likelihood values; and/or for retrieving further images corresponding to said specified visual object category.
4. Apparatus according to any one of claims 1 to 3, wherein said features comprise at least two types of parts of an object.
5. Apparatus according to claim 4, wherein said types include pixel patches, curve segments, corners and texture.
6. Apparatus according to any one of claims 1 to 5, wherein each feature is represented by one or more parameters, which parameters include its appearance and/or geometry, its scale relative to the model, and its occlusion probability.
7. Apparatus according to claim 6, wherein said parameters are modelled by probability density functions.
8. Apparatus according to claims 1 to 7, wherein said set of images is obtained during a database search.
9. Apparatus according to any one of claims 1 to 8, further including means for selecting a sub-set of said set of images, and creating the model from said subset of images.
10. Apparatus according to claim 9, wherein said sub-set of images may be selected by a user.
1 1. Apparatus according to any one of claims 1 to 8, wherein substantially all of the images in said set of images are used to create the model.
12. Apparatus according to any one of claims 1 to 11, wherein at least two different models, each comprising one or a combination of feature types, are created in respect of said set of images.
13. Apparatus according to claim 12, further including means for selecting one of said at least two models for use by said comparing means.
14. Apparatus according to claim 13, wherein said selecting means is arranged to calculate a differential ranking measure in respect of each model, and select the model having the largest differential ranking measure.
15. A method for determining the relevance of images retrieved from a database for a specified visual object category, the method comprising transforming a visual object category into a model defining features of said visual object category and a spatial relationship therebetween, storing said model, comparing a set of images retrieved from said database with said stored model and calculating a likelihood value relating to each image based on its correspondence with said model.
PCT/GB2005/001124 2004-03-31 2005-03-11 Method and apparatus for retrieving visual object categories from a database containing images WO2005096178A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2007505620A JP2007531136A (en) 2004-03-31 2005-03-11 Method and apparatus for extracting visual object categories from a database with images
EP05729251A EP1730658A1 (en) 2004-03-31 2005-03-11 Method and apparatus for retrieving visual object categories from a database containing images

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0407252A GB2412756A (en) 2004-03-31 2004-03-31 Method and apparatus for retrieving visual object categories from a database containing images
GB0407252.6 2004-03-31

Publications (2)

Publication Number Publication Date
WO2005096178A1 true WO2005096178A1 (en) 2005-10-13
WO2005096178A8 WO2005096178A8 (en) 2006-02-09

Family

ID=32247573

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2005/001124 WO2005096178A1 (en) 2004-03-31 2005-03-11 Method and apparatus for retrieving visual object categories from a database containing images

Country Status (4)

Country Link
EP (1) EP1730658A1 (en)
JP (1) JP2007531136A (en)
GB (1) GB2412756A (en)
WO (1) WO2005096178A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008159056A (en) * 2006-12-22 2008-07-10 Palo Alto Research Center Inc Classification through generative model of feature occurring in image
EP2300947A2 (en) * 2008-06-16 2011-03-30 Microsoft Corporation Adaptive visual similarity for text-based image search results re-ranking
EP2350884A2 (en) * 2008-10-24 2011-08-03 Yahoo! Inc. Digital image retrieval by aggregating search results based on visual annotations
US8364462B2 (en) 2008-06-25 2013-01-29 Microsoft Corporation Cross lingual location search
US8457441B2 (en) 2008-06-25 2013-06-04 Microsoft Corporation Fast approximate spatial representations for informal retrieval
US8527564B2 (en) 2010-12-16 2013-09-03 Yahoo! Inc. Image object retrieval based on aggregation of visual annotations
WO2014151035A1 (en) * 2013-03-15 2014-09-25 Toyota Motor Engineering & Manufacturing North America, Inc. Computer-based method and system of dynamic category object recognition
GB2529427A (en) * 2014-08-19 2016-02-24 Cortexica Vision Systems Ltd Image processing

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6240424B1 (en) * 1998-04-22 2001-05-29 Nbc Usa, Inc. Method and system for similarity-based image classification
FR2779848B1 (en) * 1998-06-15 2001-09-14 Commissariat Energie Atomique INVARIANT METHOD OF INDEXING AN IMAGE USING FRACTAL CHARACTERIZATIONS AND AT TIMES
US7200270B2 (en) * 2001-12-13 2007-04-03 Kabushiki Kaisha Toshiba Pattern recognition apparatus and method using distributed model representation of partial images
US20030123737A1 (en) * 2001-12-27 2003-07-03 Aleksandra Mojsilovic Perceptual method for browsing, searching, querying and visualizing collections of digital images

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
DESELAERS T ET AL: "Features for Image Retrieval Clustering Visually Similar Images to Improve Image Search Engines", PROCEEDINGS INFORMATIKTAGE 2003 DER GESELLSCHAFT FÜR INFORMATIK, BAD SCHUSSENRIED, GERMANY, November 2003 (2003-11-01), XP002333390, Retrieved from the Internet <URL:http://www-i6.informatik.rwth-aachen.de/~deselaers/files/deselaers-clustering.pdf> *
FERGUS R ET AL.: "A Visual Category Filter for Google Images", ECCV 2004: 8TH EUROPEAN CONFERENCE ON COMPUTER VISION, PRAGUE, CZECH REPUBLIC, MAY 11-14, 2004. PROCEEDINGS, vol. 3021, 11 May 2004 (2004-05-11), LECTURE NOTES IN COMPUTER SCIENCE, SPRINGER-VERLAG, pages 242 - 256, XP002333388, ISBN: 3-540-21984-6 *
FERGUS R ET AL: "Object class recognition by unsupervised scale-invariant learning", PROCEEDINGS 2003 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION. CVPR 2003. MADISON, WI, JUNE 18 - 20, 2003, vol. 2, 18 June 2003 (2003-06-18), IEEE CS, USA, pages 264 - 271, XP010644682, ISBN: 0-7695-1900-8 *
HONG-JIANG ZHANG: "Learning semantics in content based image retrieval", ISPA 2003: PROCEEDINGS OF THE 3RD INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS, 2003. ROME, ITALY SEPT. 18-20, 2003, vol. 1, 18 September 2003 (2003-09-18), IEEE, USA, pages 284 - 288, XP010703873, ISBN: 953-184-061-X *
LAVRENKO V ET AL: "A Model for Learning the Semantics of Pictures", PROCEEDINGS NIPS 2003 ANNUAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS, DECEMBER 9, 2003 WHISTLER, BC, CANADA, 9 December 2003 (2003-12-09), MIT PRESS, pages 1 - 8, XP002333389, Retrieved from the Internet <URL:http://web.archive.org/web/20040202121612/http://books.nips.cc/nips16.html> *
SU ZHONG ET AL: "Relevance feedback using a Bayesian classifier in content-based image retrieval", PROCEEDINGS OF SPIE, vol. 4315, December 2000 (2000-12-01), pages 97 - 106, XP002333392 *
VASCONCELOS N ET AL: "Learning from User Feedback in Image Retrieval Systems", PROCEEDINGS NIPS 1999 ANNUAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS, DENVER, CO, USA, 29 NOVEMBER - 4 DECEMBER 1999, 29 November 1999 (1999-11-29), MIT PRESS, pages 977 - 983, XP002333391, Retrieved from the Internet <URL:http://www.svcl.ucsd.edu/publications> *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1936536A3 (en) * 2006-12-22 2012-05-09 Palo Alto Research Center Incorporated System and method for performing classification through generative models of features occuring in an image
JP2008159056A (en) * 2006-12-22 2008-07-10 Palo Alto Research Center Inc Classification through generative model of feature occurring in image
EP2300947A2 (en) * 2008-06-16 2011-03-30 Microsoft Corporation Adaptive visual similarity for text-based image search results re-ranking
CN102144231A (en) * 2008-06-16 2011-08-03 微软公司 Adaptive visual similarity for text-based image search results re-ranking
EP2300947A4 (en) * 2008-06-16 2012-09-05 Microsoft Corp Adaptive visual similarity for text-based image search results re-ranking
US8364462B2 (en) 2008-06-25 2013-01-29 Microsoft Corporation Cross lingual location search
US8457441B2 (en) 2008-06-25 2013-06-04 Microsoft Corporation Fast approximate spatial representations for informal retrieval
EP2350884A4 (en) * 2008-10-24 2012-11-07 Yahoo Inc Digital image retrieval by aggregating search results based on visual annotations
EP2350884A2 (en) * 2008-10-24 2011-08-03 Yahoo! Inc. Digital image retrieval by aggregating search results based on visual annotations
US8527564B2 (en) 2010-12-16 2013-09-03 Yahoo! Inc. Image object retrieval based on aggregation of visual annotations
WO2014151035A1 (en) * 2013-03-15 2014-09-25 Toyota Motor Engineering & Manufacturing North America, Inc. Computer-based method and system of dynamic category object recognition
US9111348B2 (en) 2013-03-15 2015-08-18 Toyota Motor Engineering & Manufacturing North America, Inc. Computer-based method and system of dynamic category object recognition
GB2529427A (en) * 2014-08-19 2016-02-24 Cortexica Vision Systems Ltd Image processing
GB2529427B (en) * 2014-08-19 2021-12-08 Zebra Tech Corp Processing query image data

Also Published As

Publication number Publication date
JP2007531136A (en) 2007-11-01
EP1730658A1 (en) 2006-12-13
GB0407252D0 (en) 2004-05-05
GB2412756A (en) 2005-10-05
WO2005096178A8 (en) 2006-02-09

Similar Documents

Publication Publication Date Title
US20050223031A1 (en) Method and apparatus for retrieving visual object categories from a database containing images
Carson et al. Blobworld: Image segmentation using expectation-maximization and its application to image querying
Fergus et al. A visual category filter for google images
US6763148B1 (en) Image recognition methods
Kakar et al. Exposing postprocessed copy–paste forgeries through transform-invariant features
US9092458B1 (en) System and method for managing search results including graphics
Beis et al. Indexing without invariants in 3d object recognition
EP1730658A1 (en) Method and apparatus for retrieving visual object categories from a database containing images
Zhang et al. Improved adaptive image retrieval with the use of shadowed sets
Del Bimbo et al. Shape indexing by multi-scale representation
Devareddi et al. Review on content-based image retrieval models for efficient feature extraction for data analysis
Janu et al. Query-based image retrieval using SVM
Cheikh MUVIS-a system for content-based image retrieval
Amelio Approximate matching in ACSM dissimilarity measure
Xu et al. Object segmentation and labeling by learning from examples
Mumtaz et al. A novel texture image retrieval system based on dual tree complex wavelet transform and support vector machines
Amin et al. A novel image retrieval technique using automatic and interactive segmentation.
Bashir et al. Fuzzy C-means based image retrieval system
Chan et al. Weight assignment in dissimilarity function for Chinese cursive script character image retrieval using genetic algorithm
Srinivasa et al. A neural network based CBIR system using STI features and relevance feedback
Strobel et al. MMAP: Modified Maximum a Posteriori algorithm for image segmentation in large image/video databases
Pitchandi Wild Image Retrieval with HAAR Features and Hybrid DBSCAN Clustering for 3D Cultural Artefact Landmarks Reconstruction
Rahman et al. Statistical similarity measures in image retrieval systems with categorization & block based partition
Kachouri et al. Heterogeneous image retrieval system based on features extraction and svm classifier
Hama A comparıson study on ımage content based retrıeval systems

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
CFP Corrected version of a pamphlet front page
CR1 Correction of entry in section i

Free format text: IN PCT GAZETTE 41/2005 UNDER (72, 75) REPLACE "PERANA, PIETRO " BY "PERONA, PIETRO"

WWE Wipo information: entry into national phase

Ref document number: 2005729251

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2007505620

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

WWP Wipo information: published in national office

Ref document number: 2005729251

Country of ref document: EP