WO2012156774A1 - Method and apparatus for detecting visual words which are representative of a specific image category - Google Patents

Method and apparatus for detecting visual words which are representative of a specific image category Download PDF

Info

Publication number
WO2012156774A1
WO2012156774A1 PCT/IB2011/001459 IB2011001459W WO2012156774A1 WO 2012156774 A1 WO2012156774 A1 WO 2012156774A1 IB 2011001459 W IB2011001459 W IB 2011001459W WO 2012156774 A1 WO2012156774 A1 WO 2012156774A1
Authority
WO
WIPO (PCT)
Prior art keywords
visual
visual words
category
training set
reference codebook
Prior art date
Application number
PCT/IB2011/001459
Other languages
French (fr)
Inventor
Simon Dolle
Ayoub Massoudi
Yannick ALLUSSE
Frédéric JAHARD
Alexandre Winter
Original Assignee
Ltu Technologies
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ltu Technologies filed Critical Ltu Technologies
Priority to PCT/IB2011/001459 priority Critical patent/WO2012156774A1/en
Publication of WO2012156774A1 publication Critical patent/WO2012156774A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation

Definitions

  • the present invention concerns a method for detecting visual words which are representative of a specific image category from a visual codebook obtained from visual local descriptors extracted from training images.
  • a document has cultural and commercial value at the same time.
  • the traceability of document can be of great interest in database search. This can be useful in history, human science or any other research domain. It could be interesting, for example, to analyze viewing statistics of a given political speech or documentary film...
  • traceability is an important feature that contributes to the enrichment of documents by making its indexing and retrieval more efficient and exploitable.
  • digital content producers, distributors as well as copyright holders are careful to guarantee the traceability of their content and protect their commercial rights on distributed content. This legal aspect of documents traceability is fundamental.
  • the aim of document traceability can be:
  • Copyright enforcement copyright holders need to know where and how their content is used.
  • Monitoring it consists in supervising a multimedia source (web, magazines, newspapers, TV%) and detect any usage of a given document.
  • Copy detection consists in identifying the original content. The main difference between monitoring and copy detection is the nature of transformations that are managed. Indeed, monitoring usually handles soft distortions whereas copy detection handles strong attacks. Thus, copy detection is more general than monitoring.
  • a document may comprise text and/or images and/or video and/or any other multimedia content.
  • the traceability of document which includes text means detecting text documents which belong to a specific category, i.e. documents which talk about a same subject-matter. This is usually carried out using text-based matching in which a list of words which are representative of the specific category is built. Such a list is usually called list of stop words in the state of the art.
  • L. Hao Automatic Identification of Stop Words in Chinese Text Classification
  • 2008 International Conference on Computer Science and Software Engineering, 2008, pp. 718-722 builds a list of stop words by gathering words which are both frequent in the text of a document and not correlated with existing document categories.
  • the method implies the collection of large sets for various categories and is limited to build a list of stop words for characterising a specific category of documents from text analysis.
  • stop words as words having the same probability of occurring in a pair of related text documents as in random pairs. They collect pairs of related text documents and random pairs of documents and identify words that are almost as frequent in both these two collections. As they consider random pair of documents to detect stop words, their method is unable to detect words which characterise a specific category. It is only able to detect general stop words.
  • Traceability means traceability of images but also traceability of any other documents which include, for example, text. Traceability is usually carried out using visual-word-based indexing and matching.
  • a visual word is an index of a reference codebook which gathers a finite set of visual words previously computed from a set of training images.
  • a set of visual words is standalone, i.e. it may be used independently of the documents from which this set has been computed.
  • Visuals words may thus be used for different purposes such as false positives removal in image or video retrieval, image or video retrieval recall improvement, image or video classification, image or video annotation, image or video segmentation and image or video clustering.
  • This list of applications is not restrictive but is given only to illustrate the broad scope of the present invention in image and more generally in document retrieval.
  • IDF Inverse Document Frequency weighting
  • the IDF weighting scheme will handle correctly words like "the”, “has”, “like” but category specific words such as "gene” or "cell” will still have a high weight because these words are rare in a text. This will be problematic if a document, similar to another document, already categorised as belonging to biological domain, is looking for because all the documents, which would contain rare words, would be very similar .
  • the "Bag of Features” indexation (BoF) is another visual-word-based indexing and matching technique which transposes techniques that have been successfully employed in text retrieval to image retrieval.
  • the BoF indexation of an image I is a process that takes as input the image I and computes a set of visual words VWj it contains, their position in the image I and eventually other relevant information. It implies the detection of interest points on the image, the extraction of patches, the computation of visual descriptors on each patch and their assignment to visual words of a reference codebook.
  • Detecting interest points means computing a set of points from the input image I.
  • the points are usually characterized by their position but also by their scale.
  • the aim of the interest point detection is to detect points that are stable under global or local transformation.
  • an interest point detector is applied on it and the resulting interest points are stored.
  • DoG detector (David G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, 60, 2 (2004), pp. 91-110), Hessian Affme... (T. Tuytelaars and K. Mikolajczyk, "Local invariant feature detectors: a survey,” Foundations and Trends® in Computer Graphics and Vision, vol. 3, 2008, pp. 177-280.)
  • Patch extraction consists in extracting small parts of the image around each interest point. Patch extraction takes into account the positions of the points in the image I (as illustrated in Fig. 1 by dotted lines) but also its scale and eventually other parameters obtained from the interest point detection. The extracted patches pm are then resized to a fixed size.
  • a visual descriptor Vm is then computed by gathering a set of features that characterize an extracted patch pm.
  • Visual descriptors are usually vector of floats. They are meant to be compared together to determine how close the neighborhood they represent are close to each other. The comparison is often made with the Euclidean distance or the cosine similarity.
  • a visual descriptor Vm should be both robust to transformations and discriminative. Many different visual descriptors are available in the state of the art : SIFT, GLOH, CSLBP (K. Mikolajczyk and C. Schmid, "A Performance Evaluation of Local Descriptors," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, 2005, pp. 1615-1630.)
  • the assignment consists in finding the visual words VWj of the reference codebook that are the most representative of a visual descriptor Vm.
  • Classical algorithms choose the visual words VWj which visual descriptors Vj are the closest to Vm. We say that the visual descriptor Vm is assigned to the visual word VWj.
  • Hi histogram of visual words present in the image I (respectively J) is computed and normalized to unit length.
  • Hi (and Hj) has size
  • the similarity score between the two images I and J is then usually defined as the scalar product of Hi and Hj.
  • the similarity score is refined by using a geometric consistency check which removes the geometrically incoherent pairs of corresponding image parts.
  • BoF indexation is problematic in some specific image categories such as faces of scanned documents where the documents contain essentially texts and few and/or small images because BoF indexing such documents leads to assign very few visual words relative to images.
  • Mainly visual words are relative to text parts and represent black corner on a white background, or black edge on a white background, while visual words assigned to image parts such as eye, nose, upper lip are very common in face images.
  • image parts such as eye, nose, upper lip are very common in face images.
  • M. Marszalek proposes a method for detecting visual words which are representative of a specific image category.
  • the author considers an image database which is split into training sets which gather each image belonging to a specific category.
  • a codebook of visual words for each of this training set is learnt and the codebooks are concatenated to get a general reference codebook of the image database.
  • the method allows thus to cluster an image database into multiple sets of specific image categories and to get visual words which are representative of each of these image categories, but this method does not allow to detect any new specific image category from the image database without a full re -indexation of the whole database.
  • the problem solved by the present invention is to remedy the above-cited drawbacks.
  • the present invention concerns a method which comprises the following steps:
  • the method is advantageous compared to the state of the art because visual words which are specific to an image category are detected from a reference codebook without requiring the re-computation of this reference codebook.
  • visual words representative of a new image category are detected from this reference codebook without computing again neither the distribution of visual words of the reference codebook over the general training set nor the visuals words which are representative of the already identified image categories.
  • the computation of visual words for a specific image category which is usually done off-line because the computing time is long, shall be done only once.
  • Another advantage is that the computation of a new set of visual words to characterize an image category requires only a collection of images which belong to this category.
  • the method allows specific processing of visual words of the reference codebook during the detection of visual words such as removal (occurrences of some visual words of the reference codebook are rejected systematically) or weighting (visual word of the reference codebook are associated to different weight to emphasize the meaning of some visual words).
  • the present invention concerns an apparatus adapted to implement the aforementioned method.
  • the present invention also concerns, in at least one embodiment, a computer program that can be downloaded from a communication network and/or stored on a medium that can be read by a computer and run by a processor.
  • This computer program comprises instructions for implementing the aforementioned methods in any one of their various embodiments, when said program is run by the processor.
  • the present invention also concerns an information storage means, storing a computer program comprising a set of instructions that can be run by a processor for implementing the aforementioned methods in any one of their various embodiments, when the stored information is read by a computer and run by a processor.
  • Fig. 1 represents a diagram of the BoF indexing of an image
  • Fig. 2 represents a diagram of the method according to the invention
  • Fig. 3 represents a chronogram of an embodiment of the method
  • Fig. 4 schematically represents architecture of an apparatus according to the present invention.
  • the present invention is about a method for detecting visual words which are representative of a specific image category but is also applicable for detecting words more generally which are representative of a specific document category, words can be seen for example as text words.
  • the method according to the present invention is applicable any time document matching is needed or when a specific process shall be applied on specific documents or on specific parts of documents of a collection.
  • Fig. 2 represents a diagram of the method according to the invention.
  • the method for detecting visual words which are representative of a specific image category from a reference codebook obtained from visual local descriptors extracted from training images, comprises the following steps:
  • the general training set S and the specific category training set D are both relatively large (a typical size is 100000).
  • the general training set S contains images that belong to different categories and/or obtained from different devices without any preference.
  • the specific category training set D contains images that belong to a specific image category to detect.
  • the general training set S and the specific category training set D can be built without any loss of generality from an image index of a search engine, from data of any image specialist and/or from a web crawl.
  • a reference codebook VCB is obtained from images of the training set S using the BoF indexing described in the introductive approach.
  • the cardinality of the reference codebook VCB equals J, i.e. the reference codebook VCB is formed by J visual words VWj.
  • the visual words, which are representative of the specific image category to detect are visual words which are much more frequent in the specific category training set D than in the general training set S.
  • the distributions of visual words of the reference codebook VCB over the training sets are computed by:
  • Hs(j) a first histogram for the general training set S.
  • the value of a bin j of Hs(j) is the frequency of a visual word VWj of the reference codebook VCB in this set;
  • Hd j a second histogram for the specific category training set D.
  • the value of a bin j of Hd j) is the frequency of a visual word VWj of the reference codebook VCB in this set.
  • the visual words of the reference codebook which are representative of the specific image category are determined by computing for each bin j of the histograms, a score r j) which indicates the likelihood for each of the visual words of the reference codebook VCB to be representative of the specific image category to detect.
  • the two histograms Hs and Hd are compared bin to bin.
  • a score r j) is computed for each visual word VWj of the reference codebook VWj.
  • the visual word with the highest score r is the visual word that is the most characteristic of the specific image category to detect.
  • the visual words representative of this category are then built thanks to this score r.
  • the visual words which are representative of the specific image category to detect are visual words with a score r j) greater than a given threshold TH or, alternatively, the N (integer) visual words with the highest scores.
  • Fig. 3 illustrates the former alternative.
  • the cardinality of the reference codebook VCB equals 10 in this example.
  • the black rectangles represent the values of the bin j of Hs and white rectangles represent the values of the bin j of Hd.
  • the threshold TH equals 2.
  • the score for the bins 2 and 6 is greater than TH. Consequently, the visual words VW2 and VW6 are detected as being representative to the specific image category to detect.
  • the present invention concerns an apparatus 30 which comprises means for implementing the aforementioned method for detecting visual words which are representative of a specific image category from a reference codebook VCB.
  • Fig. 4 schematically represents an architecture of the apparatus 40.
  • the apparatus 40 comprises the following components interconnected by a communications bus 410: a processor, microprocessor, microcontroller or CPU ⁇ Central Processing Unit) 400; a RAM ⁇ Random-Access Memory) 401 ; a ROM ⁇ Read-Only Memory) 402; a HDD ⁇ Hard-Disk Drive) 403, or any other device adapted to read information stored on storage means.
  • CPU 400 is capable of executing instructions loaded into RAM 401 from ROM 402 or from an external memory, such as HDD 403. After the apparatus 400 has been powered on, CPU 400 is capable of reading instructions from RAM 401 and executing these instructions.
  • the instructions form one computer program that causes CPU 400 to perform some or all of the steps of the methods disclosed in relation with Fig. 2.
  • Any and all steps of these methods may be implemented in software by execution of a set of instructions or program by a programmable computing machine, such as a PC ⁇ Personal Computer), a DSP ⁇ Digital Signal Processor) or a microcontroller; or else implemented in hardware by a machine or a dedicated component, such as an FPGA ⁇ Field-Programmable Gate Array) or an ASIC ⁇ Application-Specific Integrated Circuit).
  • a programmable computing machine such as a PC ⁇ Personal Computer
  • DSP Digital Signal Processor
  • microcontroller or else implemented in hardware by a machine or a dedicated component, such as an FPGA ⁇ Field-Programmable Gate Array) or an ASIC ⁇ Application-Specific Integrated Circuit.
  • the apparatus 40 also comprises at least one communication interface, for example a first wireless communication interface 404 and a wired communication interface 405.
  • the images which form the general training set S and/or the specific category training set D may be stored in the ROM but they may be also, at least partially, obtained from one communication interface by querying, for example, an external database or by using a search engine.
  • the reference codebook VCB and/or the distributions of visual words and/or the visual words may be stored in the ROM or in external storing means. In the latter case, a communication interface is used to read or store them.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention concerns a method for detecting visual words which are representative of a specifier image category from a reference codebook obtained from visual local descriptors extracted from training images, said method comprising the folio wing steps: - forming (1) a general training set (S) which gathers images belonging to multiple image categories, - forming (2) a specific category training set (D) which gathers images belonging to the specifier image category to detect, - obtaining (3) a reference codebook from visual local descriptors extracted from images of said general training sets (S), - Computing (4) a distribution of visual words of the reference codebook over the specifier category training set and a distribution of visual words of the reference codebook over the general training set (S), and - determining (5) which visual words of the reference codebook are representative of the specific image category by comparing the two distributions of visual words. The present invention concerns also an apparatus, a computer program and an information storage means.

Description

Method and apparatus for detecting visual words which are representative of a specific image category
The present invention concerns a method for detecting visual words which are representative of a specific image category from a visual codebook obtained from visual local descriptors extracted from training images.
Because of the rapid growth of networks and the fast evolution of content sharing solutions, people are getting more and more easy access to a variety of numerical documents in an 'Anytime, Anywhere'. This fact raises the challenge of data traceability and authentication.
Indeed, a document has cultural and commercial value at the same time. On one hand, from the cultural point of view, the traceability of document can be of great interest in database search. This can be useful in history, human science or any other research domain. It could be interesting, for example, to analyze viewing statistics of a given political speech or documentary film... Thus traceability is an important feature that contributes to the enrichment of documents by making its indexing and retrieval more efficient and exploitable. On the other hand, digital content producers, distributors as well as copyright holders are careful to guarantee the traceability of their content and protect their commercial rights on distributed content. This legal aspect of documents traceability is fundamental.
Without loss of generality, the aim of document traceability (or authentication) can be:
· Copyright enforcement: copyright holders need to know where and how their content is used.
• Monitoring: it consists in supervising a multimedia source (web, magazines, newspapers, TV...) and detect any usage of a given document.
• Copy detection: copy detection consists in identifying the original content. The main difference between monitoring and copy detection is the nature of transformations that are managed. Indeed, monitoring usually handles soft distortions whereas copy detection handles strong attacks. Thus, copy detection is more general than monitoring.
A document may comprise text and/or images and/or video and/or any other multimedia content.
The traceability of document which includes text means detecting text documents which belong to a specific category, i.e. documents which talk about a same subject-matter. This is usually carried out using text-based matching in which a list of words which are representative of the specific category is built. Such a list is usually called list of stop words in the state of the art.
For example, L. Hao ("Automatic Identification of Stop Words in Chinese Text Classification," 2008 International Conference on Computer Science and Software Engineering, 2008, pp. 718-722) builds a list of stop words by gathering words which are both frequent in the text of a document and not correlated with existing document categories. The method implies the collection of large sets for various categories and is limited to build a list of stop words for characterising a specific category of documents from text analysis.
W.J. Wilbur and K. Sirotkin ("The automatic identification of stop words," Journal of information science, vol. 18, 1992, p. 45) identify stop words as words having the same probability of occurring in a pair of related text documents as in random pairs. They collect pairs of related text documents and random pairs of documents and identify words that are almost as frequent in both these two collections. As they consider random pair of documents to detect stop words, their method is unable to detect words which characterise a specific category. It is only able to detect general stop words.
Traceability means traceability of images but also traceability of any other documents which include, for example, text. Traceability is usually carried out using visual-word-based indexing and matching.
A visual word is an index of a reference codebook which gathers a finite set of visual words previously computed from a set of training images.
A set of visual words is standalone, i.e. it may be used independently of the documents from which this set has been computed. Visuals words may thus be used for different purposes such as false positives removal in image or video retrieval, image or video retrieval recall improvement, image or video classification, image or video annotation, image or video segmentation and image or video clustering. This list of applications is not restrictive but is given only to illustrate the broad scope of the present invention in image and more generally in document retrieval.
J. Sivic and A. Zisserman ("Video Google: A Text Retrieval Approach to Object
Matching in Videos," Computer Vision, IEEE International Conference on, Los Alamitos, CA, USA: IEEE Computer Society, 2003, p. 1470) propose a visual-word- based indexing and matching techniques, called the Inverse Document Frequency weighting (IDF), where an histogram is built, each bin of which being relative to a visual word i computed from a video and weighted by a coefficient c(i). The coefficient c(i) is designed to be high if the visual word i is rare in a reference codebook and low if the visual word i is frequent in the reference codebook. This approach is useful to handle visual words that are common in all the images of a video but is inefficient to characterise a specific image category. To make an analogy with text retrieval, the IDF weighting scheme will handle correctly words like "the", "has", "like" but category specific words such as "gene" or "cell" will still have a high weight because these words are rare in a text. This will be problematic if a document, similar to another document, already categorised as belonging to biological domain, is looking for because all the documents, which would contain rare words, would be very similar .
H. Jegou, M. Douze, and C. Schmid, ("On the burstiness of visual elements," Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. 1169-1176) try to overcome the problems of recurrent visual words (burstiness) with a more sophisticated weighting. They modify the scoring scheme to take burstiness into account. Their modification takes place at two different levels: image to image comparison and image to base comparison. In both cases, they are unable to capture the specificity of one image category and they never explicitly build a list of visual words which is relative to a specific image category.
The "Bag of Features" indexation (BoF) is another visual-word-based indexing and matching technique which transposes techniques that have been successfully employed in text retrieval to image retrieval.
The BoF indexation of an image I, as illustrated in Fig. 1, is a process that takes as input the image I and computes a set of visual words VWj it contains, their position in the image I and eventually other relevant information. It implies the detection of interest points on the image, the extraction of patches, the computation of visual descriptors on each patch and their assignment to visual words of a reference codebook.
Detecting interest points means computing a set of points from the input image I. The points are usually characterized by their position but also by their scale. The aim of the interest point detection is to detect points that are stable under global or local transformation. Given the image I, an interest point detector is applied on it and the resulting interest points are stored. There are many different interest point detector algorithms in the state of the art : DoG detector (David G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, 60, 2 (2004), pp. 91-110), Hessian Affme... (T. Tuytelaars and K. Mikolajczyk, "Local invariant feature detectors: a survey," Foundations and Trends® in Computer Graphics and Vision, vol. 3, 2008, pp. 177-280.)
Patch extraction consists in extracting small parts of the image around each interest point. Patch extraction takes into account the positions of the points in the image I (as illustrated in Fig. 1 by dotted lines) but also its scale and eventually other parameters obtained from the interest point detection. The extracted patches pm are then resized to a fixed size.
A visual descriptor Vm is then computed by gathering a set of features that characterize an extracted patch pm. Visual descriptors are usually vector of floats. They are meant to be compared together to determine how close the neighborhood they represent are close to each other. The comparison is often made with the Euclidean distance or the cosine similarity. A visual descriptor Vm should be both robust to transformations and discriminative. Many different visual descriptors are available in the state of the art : SIFT, GLOH, CSLBP (K. Mikolajczyk and C. Schmid, "A Performance Evaluation of Local Descriptors," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, 2005, pp. 1615-1630.)
The assignment consists in finding the visual words VWj of the reference codebook that are the most representative of a visual descriptor Vm. Classical algorithms choose the visual words VWj which visual descriptors Vj are the closest to Vm. We say that the visual descriptor Vm is assigned to the visual word VWj.
Once two images I and J have been indexed using the BoF indexation, a similarity score between them, based on histogram comparison, is computed. For this purpose, a histogram Hi (respectively Hj) of visual words present in the image I (respectively J) is computed and normalized to unit length. Hi (and Hj) has size |c|, where |.| designates the cardinality of the reference codebook. The similarity score between the two images I and J is then usually defined as the scalar product of Hi and Hj. Sometimes, the similarity score is refined by using a geometric consistency check which removes the geometrically incoherent pairs of corresponding image parts.
The BoF indexation is problematic in some specific image categories such as faces of scanned documents where the documents contain essentially texts and few and/or small images because BoF indexing such documents leads to assign very few visual words relative to images. Mainly visual words are relative to text parts and represent black corner on a white background, or black edge on a white background, while visual words assigned to image parts such as eye, nose, upper lip are very common in face images. Thus, during the similarity score computation, a lot of possible corresponding pairs of image parts and text parts are generated and most of these pairs are irrelevant to determine the category of the image included in this document. This is problematic since the cosine similarity is no more relevant to sort the images (false correspondences generate much noise). Given a scanned document I as a query, a set of scanned documents R as references and J a near duplicate of I, j £R, many images in R will have approximately the same score as Hj-Hy. This is also problematic when the similarity score takes into account the spatial position of visual words because it is often possible to find a spatially coherent set of correspondences among the large set of possible correspondences.
This phenomenon reduces both the recall of the algorithm (we are likely to miss the images to find) and its precision (irrelevant images are more likely to be returned). M. Marszalek ("Past the limits of bag-of-features," rapport de these, 2008, Institut Polytechnique de Grenoble) proposes a method for detecting visual words which are representative of a specific image category. The author considers an image database which is split into training sets which gather each image belonging to a specific category. A codebook of visual words for each of this training set is learnt and the codebooks are concatenated to get a general reference codebook of the image database. The method allows thus to cluster an image database into multiple sets of specific image categories and to get visual words which are representative of each of these image categories, but this method does not allow to detect any new specific image category from the image database without a full re -indexation of the whole database.
The problem solved by the present invention is to remedy the above-cited drawbacks.
To this aim, the present invention concerns a method which comprises the following steps:
- forming a general training set which gathers images belonging to multiple image categories,
- forming a specific category training set which gathers images belonging to a specific image category to detect,
- obtaining a reference codebook from visual local descriptors extracted from images of said general training sets,
- computing a distribution of visual words of the reference codebook over the specific category training set and a distribution of visual words of the reference codebook over the general training set, and
- determining which visual words of the reference codebook are representative of the specific image category by comparing the two distributions of visual words.
The method is advantageous compared to the state of the art because visual words which are specific to an image category are detected from a reference codebook without requiring the re-computation of this reference codebook. Thus, when a new specific training set has been formed, visual words representative of a new image category are detected from this reference codebook without computing again neither the distribution of visual words of the reference codebook over the general training set nor the visuals words which are representative of the already identified image categories. Moreover, the computation of visual words for a specific image category, which is usually done off-line because the computing time is long, shall be done only once.
Another advantage is that the computation of a new set of visual words to characterize an image category requires only a collection of images which belong to this category.
Moreover, the method allows specific processing of visual words of the reference codebook during the detection of visual words such as removal (occurrences of some visual words of the reference codebook are rejected systematically) or weighting (visual word of the reference codebook are associated to different weight to emphasize the meaning of some visual words).
According to another aspect, the present invention concerns an apparatus adapted to implement the aforementioned method.
The present invention also concerns, in at least one embodiment, a computer program that can be downloaded from a communication network and/or stored on a medium that can be read by a computer and run by a processor. This computer program comprises instructions for implementing the aforementioned methods in any one of their various embodiments, when said program is run by the processor.
The present invention also concerns an information storage means, storing a computer program comprising a set of instructions that can be run by a processor for implementing the aforementioned methods in any one of their various embodiments, when the stored information is read by a computer and run by a processor.
The characteristics of the invention mentioned above, as well as others, will emerge more clearly from a reading of the following description given in relation to the accompanying figures, amongst which:
Fig. 1 represents a diagram of the BoF indexing of an image,
Fig. 2 represents a diagram of the method according to the invention,
Fig. 3 represents a chronogram of an embodiment of the method,
Fig. 4 schematically represents architecture of an apparatus according to the present invention.
The present invention is about a method for detecting visual words which are representative of a specific image category but is also applicable for detecting words more generally which are representative of a specific document category, words can be seen for example as text words. Generally speaking, the method according to the present invention is applicable any time document matching is needed or when a specific process shall be applied on specific documents or on specific parts of documents of a collection.
Fig. 2 represents a diagram of the method according to the invention.
The method, for detecting visual words which are representative of a specific image category from a reference codebook obtained from visual local descriptors extracted from training images, comprises the following steps:
- forming (1) a general training set S which gathers images belonging to multiple image categories,
- forming (2) a specific category training set D which gathers images belonging to the specific image category to detect,
- obtaining (3) a reference codebook from visual local descriptors extracted from images of said general training sets S,
- computing (4) a distribution of visual words of the reference codebook over the specific category training set and a distribution of visual words of the reference codebook over the general training set S, and
- determining (5) which visual words of the reference codebook are representative of the specific image category by comparing the two distributions of visual words.
The general training set S and the specific category training set D are both relatively large (a typical size is 100000).
The general training set S contains images that belong to different categories and/or obtained from different devices without any preference.
The specific category training set D contains images that belong to a specific image category to detect.
According to an embodiment, the general training set S and the specific category training set D can be built without any loss of generality from an image index of a search engine, from data of any image specialist and/or from a web crawl.
According to a characteristic of the invention, a reference codebook VCB is obtained from images of the training set S using the BoF indexing described in the introductive approach. In the following, it is assumed that the cardinality of the reference codebook VCB equals J, i.e. the reference codebook VCB is formed by J visual words VWj. According to an embodiment, the visual words, which are representative of the specific image category to detect, are visual words which are much more frequent in the specific category training set D than in the general training set S.
According to a preferred embodiment, the distributions of visual words of the reference codebook VCB over the training sets are computed by:
- computing a first histogram Hs(j) for the general training set S. The value of a bin j of Hs(j) is the frequency of a visual word VWj of the reference codebook VCB in this set;
- computing a second histogram Hd j) for the specific category training set D. The value of a bin j of Hd j) is the frequency of a visual word VWj of the reference codebook VCB in this set.
Then, the visual words of the reference codebook which are representative of the specific image category are determined by computing for each bin j of the histograms, a score r j) which indicates the likelihood for each of the visual words of the reference codebook VCB to be representative of the specific image category to detect.
According to an embodiment, the two histograms Hs and Hd are compared bin to bin. For each visual word VWj of the reference codebook VWj, a score r j) is computed. For instance, r j) is defined by: r(J) = This quantity induces a
Figure imgf000010_0001
ranking of the visual words of the reference codebook VCB according to their "affinity" with the specific image category to detect. The visual word with the highest score r is the visual word that is the most characteristic of the specific image category to detect. The visual words representative of this category are then built thanks to this score r.
According to an embodiment, the visual words which are representative of the specific image category to detect are visual words with a score r j) greater than a given threshold TH or, alternatively, the N (integer) visual words with the highest scores.
Fig. 3 illustrates the former alternative. The cardinality of the reference codebook VCB equals 10 in this example. The black rectangles represent the values of the bin j of Hs and white rectangles represent the values of the bin j of Hd. The threshold TH equals 2. According to this example, the score for the bins 2 and 6 is greater than TH. Consequently, the visual words VW2 and VW6 are detected as being representative to the specific image category to detect. According to another aspect, the present invention concerns an apparatus 30 which comprises means for implementing the aforementioned method for detecting visual words which are representative of a specific image category from a reference codebook VCB.
Fig. 4 schematically represents an architecture of the apparatus 40. According to the shown architecture, the apparatus 40 comprises the following components interconnected by a communications bus 410: a processor, microprocessor, microcontroller or CPU {Central Processing Unit) 400; a RAM {Random-Access Memory) 401 ; a ROM {Read-Only Memory) 402; a HDD {Hard-Disk Drive) 403, or any other device adapted to read information stored on storage means.
CPU 400 is capable of executing instructions loaded into RAM 401 from ROM 402 or from an external memory, such as HDD 403. After the apparatus 400 has been powered on, CPU 400 is capable of reading instructions from RAM 401 and executing these instructions. The instructions form one computer program that causes CPU 400 to perform some or all of the steps of the methods disclosed in relation with Fig. 2. Any and all steps of these methods may be implemented in software by execution of a set of instructions or program by a programmable computing machine, such as a PC {Personal Computer), a DSP {Digital Signal Processor) or a microcontroller; or else implemented in hardware by a machine or a dedicated component, such as an FPGA {Field-Programmable Gate Array) or an ASIC {Application-Specific Integrated Circuit).
According to an embodiment, the apparatus 40 also comprises at least one communication interface, for example a first wireless communication interface 404 and a wired communication interface 405.
The images which form the general training set S and/or the specific category training set D may be stored in the ROM but they may be also, at least partially, obtained from one communication interface by querying, for example, an external database or by using a search engine.
The reference codebook VCB and/or the distributions of visual words and/or the visual words may be stored in the ROM or in external storing means. In the latter case, a communication interface is used to read or store them.

Claims

1) A method for detecting visual words which are representative of a specific image category from a reference codebook obtained from visual local descriptors extracted from training images, said method comprising the following steps:
- forming (1) a general training set (S) which gathers images belonging to multiple image categories,
- forming (2) a specific category training set (D) which gathers images belonging to the specific image category to detect,
- obtaining (3) a reference codebook from visual local descriptors extracted from images of said general training sets (S),
- computing (4) a distribution of visual words of the reference codebook over the specific category training set and a distribution of visual words of the reference codebook over the general training set (S), and
- determining (5) which visual words of the reference codebook are representative of the specific image category by comparing the two distributions of visual words.
2) A method according to claim 1, wherein the visual words which are representative of the specific image category to detect are visual words which are much more frequent in the specific category training set (D) than in the general training set (S).
3) A method according to claim 2, in which computing the two distributions comprises:
- computing a first histogram (Hs) for the general training set, the value of a bin of which is the frequency of a visual word of the reference codebook in this set,
- computing a second histogram (Hd) for the specific category training set, the value of a bin of which is the frequency of a visual word of the reference codebook in this set, and
comparing comprises
- computing for each bin of the histograms, a score which indicates the likelihood for each of the visual words of the reference codebook to be representative of the specific image category to detect. 4) A method according to the claim 3, in which the visual words which are representative of the specific image category to detect are visual words with a score higher than a given threshold or a predefined number (N) of visual words with the highest scores.
5) Method according to any one of previous claims, wherein the specific training set (D) gathers images obtained from different devices and or from an image index of a search engine, and/or from a web crawl.
6) An apparatus for detecting visual words which are representative of a specific image category from a reference codebook obtained from visual local descriptors extracted from training images, said apparatus comprising means for:
- forming (1) a general training set (S) which gathers images belonging to multiple image categories,
- forming (2) a specific category training set (D) which gathers images belonging to the specific image category to detect,
- obtaining (3) a reference codebook from visual local descriptors extracted from images of said general training sets (S),
- computing (4) a distribution of visual words of the reference codebook over the specific category training set and a distribution of visual words of the reference codebook over the general training set (S), and
- determining (5) which visual words of the reference codebook are representative of the specific image category by comparing the two distributions of visual words.
7) A computer program characterized in that it comprises program code instructions which can be loaded in a programmable apparatus for implementing the method according to any one of claims 1 to 6, when the program code instructions are run by the programmable apparatus.
8) Information storage means, characterized in that they store a computer program comprising program code instructions which can be loaded in a programmable apparatus for implementing the method according to any one of claims 1 to 6, when the program code instructions are run by the programmable apparatus.
PCT/IB2011/001459 2011-05-18 2011-05-18 Method and apparatus for detecting visual words which are representative of a specific image category WO2012156774A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2011/001459 WO2012156774A1 (en) 2011-05-18 2011-05-18 Method and apparatus for detecting visual words which are representative of a specific image category

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2011/001459 WO2012156774A1 (en) 2011-05-18 2011-05-18 Method and apparatus for detecting visual words which are representative of a specific image category

Publications (1)

Publication Number Publication Date
WO2012156774A1 true WO2012156774A1 (en) 2012-11-22

Family

ID=44544285

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2011/001459 WO2012156774A1 (en) 2011-05-18 2011-05-18 Method and apparatus for detecting visual words which are representative of a specific image category

Country Status (1)

Country Link
WO (1) WO2012156774A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140029856A1 (en) * 2012-07-30 2014-01-30 Microsoft Corporation Three-dimensional visual phrases for object recognition
WO2015017796A2 (en) 2013-08-02 2015-02-05 Digimarc Corporation Learning systems and methods
US9721186B2 (en) 2015-03-05 2017-08-01 Nant Holdings Ip, Llc Global signatures for large-scale image recognition
US9832353B2 (en) 2014-01-31 2017-11-28 Digimarc Corporation Methods for encoding, decoding and interpreting auxiliary data in media signals
US9922270B2 (en) 2014-02-13 2018-03-20 Nant Holdings Ip, Llc Global visual vocabulary, systems and methods
US10042038B1 (en) 2015-09-01 2018-08-07 Digimarc Corporation Mobile devices and methods employing acoustic vector sensors
US10594689B1 (en) 2015-12-04 2020-03-17 Digimarc Corporation Robust encoding of machine readable information in host objects and biometrics, and associated decoding and authentication
US10803272B1 (en) 2016-09-26 2020-10-13 Digimarc Corporation Detection of encoded signals and icons
US10853903B1 (en) 2016-09-26 2020-12-01 Digimarc Corporation Detection of encoded signals and icons
CN113255493A (en) * 2021-05-17 2021-08-13 南京信息工程大学 Video target segmentation method fusing visual words and self-attention mechanism
US11257198B1 (en) 2017-04-28 2022-02-22 Digimarc Corporation Detection of encoded signals and icons
WO2022148372A1 (en) * 2021-01-05 2022-07-14 瞬联软件科技(南京)有限公司 Visual phrase construction method and apparatus based on image feature space and spatial-domain space
US11922532B2 (en) 2020-01-15 2024-03-05 Digimarc Corporation System for mitigating the problem of deepfake media content using watermarking

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1739593A1 (en) * 2005-06-30 2007-01-03 Xerox Corporation Generic visual categorization method and system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1739593A1 (en) * 2005-06-30 2007-01-03 Xerox Corporation Generic visual categorization method and system

Non-Patent Citations (15)

* Cited by examiner, † Cited by third party
Title
BRIAN FULKERSON ET AL: "Localizing Objects with Smart Dictionaries", 12 October 2008, COMPUTER VISION Â ECCV 2008; [LECTURE NOTES IN COMPUTER SCIENCE], SPRINGER BERLIN HEIDELBERG, BERLIN, HEIDELBERG, PAGE(S) 179 - 192, ISBN: 978-3-540-88681-5, XP019109136 *
DAVID G. LOWE: "Distinctive image features from scale-invariant keypoints", INTERNATIONAL JOURNAL OF COMPUTER VISION, vol. 60, no. 2, 2004, pages 91 - 110, XP019216426, DOI: doi:10.1023/B:VISI.0000029664.99615.94
FLORENT PERRONNIN ET AL: "Adapted Vocabularies for Generic Visual Categorization", 1 January 2006, COMPUTER VISION - ECCV 2006 LECTURE NOTES IN COMPUTER SCIENCE;;LNCS, SPRINGER, BERLIN, DE, PAGE(S) 464 - 475, ISBN: 978-3-540-33838-3, XP019036558 *
H. JEGOU, M. DOUZE, C. SCHMID: "On the burstiness of visual elements", COMPUTER VISION AND PATTERN RECOGNITION, 2009. CVPR 2009. IEEE CONFERENCE ON, 2009, pages 1169 - 1176
J. SIVIC, A. ZISSERMAN: "Video Google: A Text Retrieval Approach to Object Matching in Videos", COMPUTER VISION, IEEE INTERNATIONAL CONFERENCE ON, 2003, pages 1470, XP010662565, DOI: doi:10.1109/ICCV.2003.1238663
JIANG HAO ET AL: "Improved bags-of-words algorithm for scene recognition", SIGNAL PROCESSING SYSTEMS (ICSPS), 2010 2ND INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 5 July 2010 (2010-07-05), pages V2 - 279, XP031737959, ISBN: 978-1-4244-6892-8 *
K. MIKOLAJCZYK, C. SCHMID: "A Performance Evaluation of Local Descriptors", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 27, 2005, pages 1615 - 1630
L. HAO: "Automatic Identification of Stop Words in Chinese Text Classification", 2008 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING, 2008, pages 718 - 722, XP031376980
LEI WANG ED - PHILIP TUDDENHAM ET AL: "Toward A Discriminative Codebook: Codeword Selection across Multi-resolution", CVPR '07. IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION; 18-23 JUNE 2007; MINNEAPOLIS, MN, USA, IEEE, PISCATAWAY, NJ, USA, 1 June 2007 (2007-06-01), pages 1 - 8, XP031114604, ISBN: 978-1-4244-1179-5 *
M. MARSZALEK: "rapport de these", 2008, INSTITUT POLYTECHNIQUE DE GRENOBLE, article "Past the limits of bag-of features"
SHILIANG ZHANG ET AL: "Descriptive visual words and visual phrases for image applications", PROCEEDINGS OF THE SEVENTEEN ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM '09, 1 January 2009 (2009-01-01), New York, New York, USA, pages 75, XP055007017, ISBN: 978-1-60-558608-3, DOI: 10.1145/1631272.1631285 *
T. TUYTELAARS, K. MIKOLAJCZYK: "Local invariant feature detectors: a survey", FOUNDATIONS AND TRENDS@ IN COMPUTER GRAPHICS AND VISION, vol. 3, 2008, pages 177 - 280, XP002616097, DOI: doi:10.1561/0600000017
W.J. WILBUR, K. SIROTKIN: "The automatic identification of stop words", JOURNAL OF INFORMATION SCIENCE, vol. 18, 1992, pages 45
WINN J ET AL: "Object Categorization by Learned Universal Visual Dictionary", COMPUTER VISION, 2005. TENTH IEEE INTERNATIONAL CONFERENCE ON BEIJING, CHINA 17-20 OCT. 2005, PISCATAWAY, NJ, USA,IEEE, vol. 2, 17 October 2005 (2005-10-17), pages 1800 - 1807, XP010857031, ISBN: 978-0-7695-2658-4, DOI: 10.1109/ICCV.2005.171 *
WOJCIKIEWICZ ET AL: "Optimizing Visual Vocabularies for Image Classification", 17 November 2009 (2009-11-17), XP055007022 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8983201B2 (en) * 2012-07-30 2015-03-17 Microsoft Technology Licensing, Llc Three-dimensional visual phrases for object recognition
US20140029856A1 (en) * 2012-07-30 2014-01-30 Microsoft Corporation Three-dimensional visual phrases for object recognition
WO2015017796A2 (en) 2013-08-02 2015-02-05 Digimarc Corporation Learning systems and methods
US9832353B2 (en) 2014-01-31 2017-11-28 Digimarc Corporation Methods for encoding, decoding and interpreting auxiliary data in media signals
US11170261B2 (en) 2014-02-13 2021-11-09 Nant Holdings Ip, Llc Global visual vocabulary, systems and methods
US10521698B2 (en) 2014-02-13 2019-12-31 Nant Holdings Ip, Llc Global visual vocabulary, systems and methods
US9922270B2 (en) 2014-02-13 2018-03-20 Nant Holdings Ip, Llc Global visual vocabulary, systems and methods
US10049300B2 (en) 2014-02-13 2018-08-14 Nant Holdings Ip, Llc Global visual vocabulary, systems and methods
US11348678B2 (en) 2015-03-05 2022-05-31 Nant Holdings Ip, Llc Global signatures for large-scale image recognition
US9721186B2 (en) 2015-03-05 2017-08-01 Nant Holdings Ip, Llc Global signatures for large-scale image recognition
US10565759B2 (en) 2015-03-05 2020-02-18 Nant Holdings Ip, Llc Global signatures for large-scale image recognition
US10042038B1 (en) 2015-09-01 2018-08-07 Digimarc Corporation Mobile devices and methods employing acoustic vector sensors
US10594689B1 (en) 2015-12-04 2020-03-17 Digimarc Corporation Robust encoding of machine readable information in host objects and biometrics, and associated decoding and authentication
US11979399B2 (en) 2015-12-04 2024-05-07 Digimarc Corporation Robust encoding of machine readable information in host objects and biometrics, and associated decoding and authentication
US11102201B2 (en) 2015-12-04 2021-08-24 Digimarc Corporation Robust encoding of machine readable information in host objects and biometrics, and associated decoding and authentication
US10853903B1 (en) 2016-09-26 2020-12-01 Digimarc Corporation Detection of encoded signals and icons
US10803272B1 (en) 2016-09-26 2020-10-13 Digimarc Corporation Detection of encoded signals and icons
US11257198B1 (en) 2017-04-28 2022-02-22 Digimarc Corporation Detection of encoded signals and icons
US11922532B2 (en) 2020-01-15 2024-03-05 Digimarc Corporation System for mitigating the problem of deepfake media content using watermarking
WO2022148372A1 (en) * 2021-01-05 2022-07-14 瞬联软件科技(南京)有限公司 Visual phrase construction method and apparatus based on image feature space and spatial-domain space
CN113255493A (en) * 2021-05-17 2021-08-13 南京信息工程大学 Video target segmentation method fusing visual words and self-attention mechanism
CN113255493B (en) * 2021-05-17 2023-06-30 南京信息工程大学 Video target segmentation method integrating visual words and self-attention mechanism

Similar Documents

Publication Publication Date Title
WO2012156774A1 (en) Method and apparatus for detecting visual words which are representative of a specific image category
Sivic et al. Video Google: Efficient visual search of videos
Grauman et al. Efficient image matching with distributions of local invariant features
Alkhawlani et al. Content-based image retrieval using local features descriptors and bag-of-visual words
Jenni et al. Content based image retrieval using colour strings comparison
Jiang et al. Randomized visual phrases for object search
US20160188633A1 (en) A method and apparatus for tracking microblog messages for relevancy to an entity identifiable by an associated text and an image
Chu et al. Image Retrieval Based on a Multi‐Integration Features Model
Letessier et al. Scalable mining of small visual objects
Walia et al. An effective and fast hybrid framework for color image retrieval
Nunes et al. Shape based image retrieval and classification
Vieux et al. Content based image retrieval using bag-of-regions
Kalaiarasi et al. Clustering of near duplicate images using bundled features
Paisitkriangkrai et al. Scalable clip-based near-duplicate video detection with ordinal measure
Úbeda et al. Improving pattern spotting in historical documents using feature pyramid networks
Sun et al. Similar manga retrieval using visual vocabulary based on regions of interest
Ghanmi et al. A new descriptor for pattern matching: application to identity document verification
Bakić et al. Inria IMEDIA2's participation at ImageCLEF 2012 plant identification task
JP6017277B2 (en) Program, apparatus and method for calculating similarity between contents represented by set of feature vectors
Le et al. Improving logo spotting and matching for document categorization by a post-filter based on homography
JP5833499B2 (en) Retrieval device and program for retrieving content expressed by high-dimensional feature vector set with high accuracy
Kavya Feature extraction technique for robust and fast visual tracking: a typical review
Bakheet et al. Content-based image retrieval using brisk and surf as bag-of-visual-words for naïve Bayes classifier
JP5959446B2 (en) Retrieval device, program, and method for high-speed retrieval by expressing contents as a set of binary feature vectors
Bhat et al. An Insight into Content-Based Image Retrieval Techniques, Datasets, and Evaluation Metrics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11743332

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11743332

Country of ref document: EP

Kind code of ref document: A1