WO2012156774A1

WO2012156774A1 - Method and apparatus for detecting visual words which are representative of a specific image category

Info

Publication number: WO2012156774A1
Application number: PCT/IB2011/001459
Authority: WO
Inventors: Simon Dolle; Ayoub Massoudi; Yannick ALLUSSE; Frédéric JAHARD; Alexandre Winter
Original assignee: Ltu Technologies
Priority date: 2011-05-18
Filing date: 2011-05-18
Publication date: 2012-11-22

Abstract

The present invention concerns a method for detecting visual words which are representative of a specifier image category from a reference codebook obtained from visual local descriptors extracted from training images, said method comprising the folio wing steps: - forming (1) a general training set (S) which gathers images belonging to multiple image categories, - forming (2) a specific category training set (D) which gathers images belonging to the specifier image category to detect, - obtaining (3) a reference codebook from visual local descriptors extracted from images of said general training sets (S), - Computing (4) a distribution of visual words of the reference codebook over the specifier category training set and a distribution of visual words of the reference codebook over the general training set (S), and - determining (5) which visual words of the reference codebook are representative of the specific image category by comparing the two distributions of visual words. The present invention concerns also an apparatus, a computer program and an information storage means.

Description

Method and apparatus for detecting visual words which are representative of a specific image category

The present invention concerns a method for detecting visual words which are representative of a specific image category from a visual codebook obtained from visual local descriptors extracted from training images.

Because of the rapid growth of networks and the fast evolution of content sharing solutions, people are getting more and more easy access to a variety of numerical documents in an 'Anytime, Anywhere'. This fact raises the challenge of data traceability and authentication.

Indeed, a document has cultural and commercial value at the same time. On one hand, from the cultural point of view, the traceability of document can be of great interest in database search. This can be useful in history, human science or any other research domain. It could be interesting, for example, to analyze viewing statistics of a given political speech or documentary film... Thus traceability is an important feature that contributes to the enrichment of documents by making its indexing and retrieval more efficient and exploitable. On the other hand, digital content producers, distributors as well as copyright holders are careful to guarantee the traceability of their content and protect their commercial rights on distributed content. This legal aspect of documents traceability is fundamental.

Without loss of generality, the aim of document traceability (or authentication) can be:

· Copyright enforcement: copyright holders need to know where and how their content is used.

• Monitoring: it consists in supervising a multimedia source (web, magazines, newspapers, TV...) and detect any usage of a given document.

• Copy detection: copy detection consists in identifying the original content. The main difference between monitoring and copy detection is the nature of transformations that are managed. Indeed, monitoring usually handles soft distortions whereas copy detection handles strong attacks. Thus, copy detection is more general than monitoring.

A document may comprise text and/or images and/or video and/or any other multimedia content.

The traceability of document which includes text means detecting text documents which belong to a specific category, i.e. documents which talk about a same subject-matter. This is usually carried out using text-based matching in which a list of words which are representative of the specific category is built. Such a list is usually called list of stop words in the state of the art.

For example, L. Hao ("Automatic Identification of Stop Words in Chinese Text Classification," 2008 International Conference on Computer Science and Software Engineering, 2008, pp. 718-722) builds a list of stop words by gathering words which are both frequent in the text of a document and not correlated with existing document categories. The method implies the collection of large sets for various categories and is limited to build a list of stop words for characterising a specific category of documents from text analysis.

W.J. Wilbur and K. Sirotkin ("The automatic identification of stop words," Journal of information science, vol. 18, 1992, p. 45) identify stop words as words having the same probability of occurring in a pair of related text documents as in random pairs. They collect pairs of related text documents and random pairs of documents and identify words that are almost as frequent in both these two collections. As they consider random pair of documents to detect stop words, their method is unable to detect words which characterise a specific category. It is only able to detect general stop words.

Traceability means traceability of images but also traceability of any other documents which include, for example, text. Traceability is usually carried out using visual-word-based indexing and matching.

A visual word is an index of a reference codebook which gathers a finite set of visual words previously computed from a set of training images.

A set of visual words is standalone, i.e. it may be used independently of the documents from which this set has been computed. Visuals words may thus be used for different purposes such as false positives removal in image or video retrieval, image or video retrieval recall improvement, image or video classification, image or video annotation, image or video segmentation and image or video clustering. This list of applications is not restrictive but is given only to illustrate the broad scope of the present invention in image and more generally in document retrieval.

J. Sivic and A. Zisserman ("Video Google: A Text Retrieval Approach to Object

Matching in Videos," Computer Vision, IEEE International Conference on, Los Alamitos, CA, USA: IEEE Computer Society, 2003, p. 1470) propose a visual-word- based indexing and matching techniques, called the Inverse Document Frequency weighting (IDF), where an histogram is built, each bin of which being relative to a visual word i computed from a video and weighted by a coefficient c(i). The coefficient c(i) is designed to be high if the visual word i is rare in a reference codebook and low if the visual word i is frequent in the reference codebook. This approach is useful to handle visual words that are common in all the images of a video but is inefficient to characterise a specific image category. To make an analogy with text retrieval, the IDF weighting scheme will handle correctly words like "the", "has", "like" but category specific words such as "gene" or "cell" will still have a high weight because these words are rare in a text. This will be problematic if a document, similar to another document, already categorised as belonging to biological domain, is looking for because all the documents, which would contain rare words, would be very similar .

H. Jegou, M. Douze, and C. Schmid, ("On the burstiness of visual elements," Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. 1169-1176) try to overcome the problems of recurrent visual words (burstiness) with a more sophisticated weighting. They modify the scoring scheme to take burstiness into account. Their modification takes place at two different levels: image to image comparison and image to base comparison. In both cases, they are unable to capture the specificity of one image category and they never explicitly build a list of visual words which is relative to a specific image category.

The "Bag of Features" indexation (BoF) is another visual-word-based indexing and matching technique which transposes techniques that have been successfully employed in text retrieval to image retrieval.

The BoF indexation of an image I, as illustrated in Fig. 1, is a process that takes as input the image I and computes a set of visual words VWj it contains, their position in the image I and eventually other relevant information. It implies the detection of interest points on the image, the extraction of patches, the computation of visual descriptors on each patch and their assignment to visual words of a reference codebook.

Detecting interest points means computing a set of points from the input image I. The points are usually characterized by their position but also by their scale. The aim of the interest point detection is to detect points that are stable under global or local transformation. Given the image I, an interest point detector is applied on it and the resulting interest points are stored. There are many different interest point detector algorithms in the state of the art : DoG detector (David G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, 60, 2 (2004), pp. 91-110), Hessian Affme... (T. Tuytelaars and K. Mikolajczyk, "Local invariant feature detectors: a survey," Foundations and Trends® in Computer Graphics and Vision, vol. 3, 2008, pp. 177-280.)

Patch extraction consists in extracting small parts of the image around each interest point. Patch extraction takes into account the positions of the points in the image I (as illustrated in Fig. 1 by dotted lines) but also its scale and eventually other parameters obtained from the interest point detection. The extracted patches pm are then resized to a fixed size.

A visual descriptor Vm is then computed by gathering a set of features that characterize an extracted patch pm. Visual descriptors are usually vector of floats. They are meant to be compared together to determine how close the neighborhood they represent are close to each other. The comparison is often made with the Euclidean distance or the cosine similarity. A visual descriptor Vm should be both robust to transformations and discriminative. Many different visual descriptors are available in the state of the art : SIFT, GLOH, CSLBP (K. Mikolajczyk and C. Schmid, "A Performance Evaluation of Local Descriptors," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, 2005, pp. 1615-1630.)

The assignment consists in finding the visual words VWj of the reference codebook that are the most representative of a visual descriptor Vm. Classical algorithms choose the visual words VWj which visual descriptors Vj are the closest to Vm. We say that the visual descriptor Vm is assigned to the visual word VWj.

Once two images I and J have been indexed using the BoF indexation, a similarity score between them, based on histogram comparison, is computed. For this purpose, a histogram Hi (respectively Hj) of visual words present in the image I (respectively J) is computed and normalized to unit length. Hi (and Hj) has size |c|, where |.| designates the cardinality of the reference codebook. The similarity score between the two images I and J is then usually defined as the scalar product of Hi and Hj. Sometimes, the similarity score is refined by using a geometric consistency check which removes the geometrically incoherent pairs of corresponding image parts.

The BoF indexation is problematic in some specific image categories such as faces of scanned documents where the documents contain essentially texts and few and/or small images because BoF indexing such documents leads to assign very few visual words relative to images. Mainly visual words are relative to text parts and represent black corner on a white background, or black edge on a white background, while visual words assigned to image parts such as eye, nose, upper lip are very common in face images. Thus, during the similarity score computation, a lot of possible corresponding pairs of image parts and text parts are generated and most of these pairs are irrelevant to determine the category of the image included in this document. This is problematic since the cosine similarity is no more relevant to sort the images (false correspondences generate much noise). Given a scanned document I as a query, a set of scanned documents R as references and J a near duplicate of I, j £R, many images in R will have approximately the same score as Hj-Hy. This is also problematic when the similarity score takes into account the spatial position of visual words because it is often possible to find a spatially coherent set of correspondences among the large set of possible correspondences.

This phenomenon reduces both the recall of the algorithm (we are likely to miss the images to find) and its precision (irrelevant images are more likely to be returned). M. Marszalek ("Past the limits of bag-of-features," rapport de these, 2008, Institut Polytechnique de Grenoble) proposes a method for detecting visual words which are representative of a specific image category. The author considers an image database which is split into training sets which gather each image belonging to a specific category. A codebook of visual words for each of this training set is learnt and the codebooks are concatenated to get a general reference codebook of the image database. The method allows thus to cluster an image database into multiple sets of specific image categories and to get visual words which are representative of each of these image categories, but this method does not allow to detect any new specific image category from the image database without a full re -indexation of the whole database.

The problem solved by the present invention is to remedy the above-cited drawbacks.

To this aim, the present invention concerns a method which comprises the following steps:

- forming a general training set which gathers images belonging to multiple image categories,

- forming a specific category training set which gathers images belonging to a specific image category to detect,

- obtaining a reference codebook from visual local descriptors extracted from images of said general training sets,

- computing a distribution of visual words of the reference codebook over the specific category training set and a distribution of visual words of the reference codebook over the general training set, and

- determining which visual words of the reference codebook are representative of the specific image category by comparing the two distributions of visual words.

The method is advantageous compared to the state of the art because visual words which are specific to an image category are detected from a reference codebook without requiring the re-computation of this reference codebook. Thus, when a new specific training set has been formed, visual words representative of a new image category are detected from this reference codebook without computing again neither the distribution of visual words of the reference codebook over the general training set nor the visuals words which are representative of the already identified image categories. Moreover, the computation of visual words for a specific image category, which is usually done off-line because the computing time is long, shall be done only once.

Another advantage is that the computation of a new set of visual words to characterize an image category requires only a collection of images which belong to this category.

Moreover, the method allows specific processing of visual words of the reference codebook during the detection of visual words such as removal (occurrences of some visual words of the reference codebook are rejected systematically) or weighting (visual word of the reference codebook are associated to different weight to emphasize the meaning of some visual words).

According to another aspect, the present invention concerns an apparatus adapted to implement the aforementioned method.

The present invention also concerns, in at least one embodiment, a computer program that can be downloaded from a communication network and/or stored on a medium that can be read by a computer and run by a processor. This computer program comprises instructions for implementing the aforementioned methods in any one of their various embodiments, when said program is run by the processor.

The present invention also concerns an information storage means, storing a computer program comprising a set of instructions that can be run by a processor for implementing the aforementioned methods in any one of their various embodiments, when the stored information is read by a computer and run by a processor.

The characteristics of the invention mentioned above, as well as others, will emerge more clearly from a reading of the following description given in relation to the accompanying figures, amongst which:

Fig. 1 represents a diagram of the BoF indexing of an image,

Fig. 2 represents a diagram of the method according to the invention,

Fig. 3 represents a chronogram of an embodiment of the method,

Fig. 4 schematically represents architecture of an apparatus according to the present invention.

The present invention is about a method for detecting visual words which are representative of a specific image category but is also applicable for detecting words more generally which are representative of a specific document category, words can be seen for example as text words. Generally speaking, the method according to the present invention is applicable any time document matching is needed or when a specific process shall be applied on specific documents or on specific parts of documents of a collection.

Fig. 2 represents a diagram of the method according to the invention.

The method, for detecting visual words which are representative of a specific image category from a reference codebook obtained from visual local descriptors extracted from training images, comprises the following steps:

- forming (1) a general training set S which gathers images belonging to multiple image categories,

- forming (2) a specific category training set D which gathers images belonging to the specific image category to detect,

- obtaining (3) a reference codebook from visual local descriptors extracted from images of said general training sets S,

- computing (4) a distribution of visual words of the reference codebook over the specific category training set and a distribution of visual words of the reference codebook over the general training set S, and

- determining (5) which visual words of the reference codebook are representative of the specific image category by comparing the two distributions of visual words.

The general training set S and the specific category training set D are both relatively large (a typical size is 100000).

The general training set S contains images that belong to different categories and/or obtained from different devices without any preference.

The specific category training set D contains images that belong to a specific image category to detect.

According to an embodiment, the general training set S and the specific category training set D can be built without any loss of generality from an image index of a search engine, from data of any image specialist and/or from a web crawl.

According to a characteristic of the invention, a reference codebook VCB is obtained from images of the training set S using the BoF indexing described in the introductive approach. In the following, it is assumed that the cardinality of the reference codebook VCB equals J, i.e. the reference codebook VCB is formed by J visual words VWj. According to an embodiment, the visual words, which are representative of the specific image category to detect, are visual words which are much more frequent in the specific category training set D than in the general training set S.

According to a preferred embodiment, the distributions of visual words of the reference codebook VCB over the training sets are computed by:

- computing a first histogram Hs(j) for the general training set S. The value of a bin j of Hs(j) is the frequency of a visual word VWj of the reference codebook VCB in this set;

- computing a second histogram Hd j) for the specific category training set D. The value of a bin j of Hd j) is the frequency of a visual word VWj of the reference codebook VCB in this set.

Then, the visual words of the reference codebook which are representative of the specific image category are determined by computing for each bin j of the histograms, a score r j) which indicates the likelihood for each of the visual words of the reference codebook VCB to be representative of the specific image category to detect.

According to an embodiment, the two histograms Hs and Hd are compared bin to bin. For each visual word VWj of the reference codebook VWj, a score r j) is computed. For instance, r j) is defined by: r(J) = This quantity induces a

ranking of the visual words of the reference codebook VCB according to their "affinity" with the specific image category to detect. The visual word with the highest score r is the visual word that is the most characteristic of the specific image category to detect. The visual words representative of this category are then built thanks to this score r.

According to an embodiment, the visual words which are representative of the specific image category to detect are visual words with a score r j) greater than a given threshold TH or, alternatively, the N (integer) visual words with the highest scores.

Fig. 3 illustrates the former alternative. The cardinality of the reference codebook VCB equals 10 in this example. The black rectangles represent the values of the bin j of Hs and white rectangles represent the values of the bin j of Hd. The threshold TH equals 2. According to this example, the score for the bins 2 and 6 is greater than TH. Consequently, the visual words VW2 and VW6 are detected as being representative to the specific image category to detect. According to another aspect, the present invention concerns an apparatus 30 which comprises means for implementing the aforementioned method for detecting visual words which are representative of a specific image category from a reference codebook VCB.

Fig. 4 schematically represents an architecture of the apparatus 40. According to the shown architecture, the apparatus 40 comprises the following components interconnected by a communications bus 410: a processor, microprocessor, microcontroller or CPU {Central Processing Unit) 400; a RAM {Random-Access Memory) 401 ; a ROM {Read-Only Memory) 402; a HDD {Hard-Disk Drive) 403, or any other device adapted to read information stored on storage means.

CPU 400 is capable of executing instructions loaded into RAM 401 from ROM 402 or from an external memory, such as HDD 403. After the apparatus 400 has been powered on, CPU 400 is capable of reading instructions from RAM 401 and executing these instructions. The instructions form one computer program that causes CPU 400 to perform some or all of the steps of the methods disclosed in relation with Fig. 2. Any and all steps of these methods may be implemented in software by execution of a set of instructions or program by a programmable computing machine, such as a PC {Personal Computer), a DSP {Digital Signal Processor) or a microcontroller; or else implemented in hardware by a machine or a dedicated component, such as an FPGA {Field-Programmable Gate Array) or an ASIC {Application-Specific Integrated Circuit).

According to an embodiment, the apparatus 40 also comprises at least one communication interface, for example a first wireless communication interface 404 and a wired communication interface 405.

The images which form the general training set S and/or the specific category training set D may be stored in the ROM but they may be also, at least partially, obtained from one communication interface by querying, for example, an external database or by using a search engine.

The reference codebook VCB and/or the distributions of visual words and/or the visual words may be stored in the ROM or in external storing means. In the latter case, a communication interface is used to read or store them.

Claims

1) A method for detecting visual words which are representative of a specific image category from a reference codebook obtained from visual local descriptors extracted from training images, said method comprising the following steps:

- forming (1) a general training set (S) which gathers images belonging to multiple image categories,

- forming (2) a specific category training set (D) which gathers images belonging to the specific image category to detect,

- obtaining (3) a reference codebook from visual local descriptors extracted from images of said general training sets (S),

- computing (4) a distribution of visual words of the reference codebook over the specific category training set and a distribution of visual words of the reference codebook over the general training set (S), and

2) A method according to claim 1, wherein the visual words which are representative of the specific image category to detect are visual words which are much more frequent in the specific category training set (D) than in the general training set (S).

3) A method according to claim 2, in which computing the two distributions comprises:

- computing a first histogram (Hs) for the general training set, the value of a bin of which is the frequency of a visual word of the reference codebook in this set,

- computing a second histogram (Hd) for the specific category training set, the value of a bin of which is the frequency of a visual word of the reference codebook in this set, and

comparing comprises

- computing for each bin of the histograms, a score which indicates the likelihood for each of the visual words of the reference codebook to be representative of the specific image category to detect. 4) A method according to the claim 3, in which the visual words which are representative of the specific image category to detect are visual words with a score higher than a given threshold or a predefined number (N) of visual words with the highest scores.

5) Method according to any one of previous claims, wherein the specific training set (D) gathers images obtained from different devices and or from an image index of a search engine, and/or from a web crawl.

6) An apparatus for detecting visual words which are representative of a specific image category from a reference codebook obtained from visual local descriptors extracted from training images, said apparatus comprising means for:

7) A computer program characterized in that it comprises program code instructions which can be loaded in a programmable apparatus for implementing the method according to any one of claims 1 to 6, when the program code instructions are run by the programmable apparatus.

8) Information storage means, characterized in that they store a computer program comprising program code instructions which can be loaded in a programmable apparatus for implementing the method according to any one of claims 1 to 6, when the program code instructions are run by the programmable apparatus.