WO2009158135A2 - Statistical approach to large-scale image annotation - Google Patents
Statistical approach to large-scale image annotation Download PDFInfo
- Publication number
- WO2009158135A2 WO2009158135A2 PCT/US2009/045764 US2009045764W WO2009158135A2 WO 2009158135 A2 WO2009158135 A2 WO 2009158135A2 US 2009045764 W US2009045764 W US 2009045764W WO 2009158135 A2 WO2009158135 A2 WO 2009158135A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- images
- annotating
- recited
- language models
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/35—Categorising the entire scene, e.g. birthday party or wedding scene
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/10—Recognition assisted with metadata
Definitions
- automatic image annotation which annotates images with keywords automatically.
- automatic image annotation is either classification-based or probabilistic modeling-based.
- Classification-based methods attempt to associate words or concepts by learning classifiers (e.g., Bayes point machine, support vector machine, etc.).
- probabilistic modeling methods attempt to infer the correlations or joint probabilities between images and the annotations (e.g., translation model, cross-media relevance model, continuous relevance model, etc.).
- classification-based and probabilistic-based image annotation algorithms are able to annotate small scale image databases, they are generally incapable of annotating large-scale databases with realistic images (e.g., digital pictures).
- a method of annotating an image may comprise compiling visual features and textual information from a number of images, hashing the images visual features, and clustering the images based on their hash values. Statistical language models are then built from the clustered images and the image is annotated using one of the statistical language models.
- a computer readable storage medium comprising computer executable instructions that when executed by a processor may perform a method comprising crawling a large-scale image database to gather images and their corresponding textual information. Visual information is then extracted from the images using a gray block methodology and the extracted images are reduced by employing a projection matrix. The reduced visual information is hashed and the images are clustered according to their hash values.
- an item record data structure is embodied on a computer readable media, the data structure consists of a digital image and a textual annotation corresponding to the digital image.
- the textual annotation is associated with the digital image by compiling visual features and textual information from a number of images, hashing the images visual features, and clustering the images based on the hash value.
- Statistical language models are then built based on the clustered images and the digital image is annotated using one of the statistical language models.
- Figure 1 is a block diagram illustrating one implementation of a large-scale image annotation technique.
- Figure 2 is a diagram illustrating how images and its accompanying annotations may be collected using a web crawler and archived in a database.
- Figure 3 is a block diagram illustrating how a digital image's visual features may be reduced, the reduced features grouped into clusters, and then based on the clusters a statistical language model developed.
- Figure 4 is a block diagram depicting an illustrative method of annotating a personal image.
- Figure 5 is a flow diagram depicting an illustrative method of annotating a web image.
- images in the "real world image database” are typically grouped into clusters according to the images' similarities. Then for a given query image, the most similar image cluster is selected and the "best description" associated with the image cluster is selected to annotate the query image. While these conventional imaging annotation algorithms are capable of annotating most images, there is significant room for improvement.
- This disclosure relates to various statistical approaches to large-scale image annotation. These statistical approaches can annotate personal images which generally have limited or no annotations and web-based images, which generally have noisy and incomplete annotations.
- an image annotation technique leverages large-scale web-based image databases to model a nearly unlimited number of semantic concepts.
- FIG. 1 illustrates one implementation of the large-scale image annotation technique 100.
- a large-scale image database 102 is crawled and both visual features and textual information are extracted and indexed as structural data 104 (i.e., training set).
- structural data 104 i.e., training set.
- the complexity of the image data is reduced by projecting the high-dimensional image features onto a sub-space with lower dimensionality while maintaining the majority of the image's information 106.
- an efficient hash-based clustering algorithm is applied to the training set and the images with the same hash codes are grouped into "clusters" 108.
- a statistical language model SLM
- the query image is selected 114 and its visual features (e.g., color, texture, geometric features, etc.) and textural features (e.g., titles, key words, URL's, surrounding text, etc.) are extracted 116.
- the query image's features are hashed 118 and a language model is selected 120 based on the words with the maximum joint probability with the query image.
- the image is then annotated 122 based on the text, title, annotations, and/or key word(s) associated with the selected language model 122.
- images 202 along with their text, title, annotations, and/or keywords 204 are collected from the Internet using a web crawler and archived in a database 206.
- images 202 along with their text, title, annotations, and/or keywords 204 are collected from the Internet using a web crawler and archived in a database 206.
- images 202 may be collected, as large sample sizes assure a good correlation between the visual models and the query image.
- approximately 2.4 million high quality web images with meaningful descriptions were collected from online photo forums (e.g., GOOGLE IMAGESTM, YAHOO IMAGE SEARCH TM, and the University of Washington image data set, to name a few).
- annotated images may be collected randomly from the
- any type of image can be collected so long as it is annotated with some form of text, title, annotation, or key words.
- the images and associated text or key words are then indexed in a database.
- the images 202 and text 204 can be indexed (e.g., key word, text string, image features, to name a few).
- the images are sorted and grouped by the key word or text 204 associated with the image 202. For example, if there are a number of images that contain sunsets, those images can be indexed and grouped together 208.
- One goal of dimension reduction is to reduce the complexity of the image data while maintaining as much of the original information as possible.
- a second goal of dimension reduction is to reduce noise and value drifting by omitting the least significant dimensions. Both of these goals are achieved in the following illustrative technique.
- an image's visual features should generally represent its content, its structure, and be robust to variations in the image itself (e.g., scale, color, storage format, to name a few). Accordingly, a gray block methodology may be employed.
- the gray block features may appear as small thumbnails of the original image.
- the gray block methodology maintains the images primary content and structure, and is invariant to scale change.
- Each feature vector is the mean of many individual pixels, so the methodology is robust to variances in pixel values. Moreover, since each vector feature is based on the image's luminance, the methodology is also robust to color changes.
- each collected image is divided into 8 by 8 pixel blocks and for each block the average luminescence "L" is calculated, at block 302.
- the K-th dimensional value of each feature may be calculated as:
- B k corresponds to block k
- N k is the number of pixels in B k
- L(i,j) is the pixel luminance at coordinates i, j.
- the image may be partitioned into a 7 x 7 gray block, a 9 x 9 gray block, or any other suitable number of feature vectors.
- the high-dimensional features may then be projected into a subspace with much lower dimensionality while maintaining most of the image's information, at block 304.
- the image's dimensions are reduced by employing a projection matrix "A".
- PCA Clustering by Hashing
- Clustering is the classification of objects into classes, categories, or partitions based on a high degree of similarity between object members.
- a hash-based clustering algorithm is applied to the training set, at block 306.
- the K-dimensional feature vector is transformed into a K-bit binary string, which becomes the images hash code.
- the K-bit string is constrained to no more than 32 bits, although other bit string sizes may, such as 64 bits, may be employed.
- the images with the same 32 bit hash code are then grouped into "clusters", at block 308.
- a statistical language model may be developed to model the textual information from the images in each cluster, at block 310.
- Unigram models and modified bigram models may be constructed to calculate single word probabilities and conditional word probabilities for each of the image clusters.
- personal images may lack textual information or annotation, and are accordingly annotated by employing a probabilistic approach.
- the query image may be annotated by selecting keyword(s), a phrase, or text with the maximum joint probability with the query or target image, as illustrated below in equation (4).
- Unigram models assume that a particular piece of text or key word is generated by each term independently. Accordingly, unigram models calculate the probability that a specific keyword, phrase, or text is associated with the query image.
- p(w/c) is the unigram word probability (i.e., probability that a keyword, phrase, or terms "w” occurs in an image cluster "c")
- p(I/c) is the visual similarity between the query image "I” and the image cluster "c”
- p(c) is the prior probability of cluster "c”, which is often initialized uniformly without knowing the prior information in advance.
- the first keyword appears in five images and second keyword appears in two images; there is a two in seven chance (29%) that second keyword should be associated with the query image and a five in seven chance (71%) that the first key word should be associated with the query image. Accordingly, since the first keyword has a greater probability than the second keyword that it is associated with the query image (i.e., 71% versus 29%); the first keyword is used to annotate the query image.
- the image cluster whose visual features are the most similar to the query image is selected and its keyword, phrase, and/or terms are used to annotate the query image.
- the number of words in a cluster is limited because of the small number of images in a cluster. Therefore, when there are a limited number of words, the unigram model may be smoothed using Bayesian models using Dirichlet priors.
- p(w/C) is the unigram probability of a specific keyword "w” occurring in a standard corpus "C".
- the typical web image contains noisy and incomplete textual information. Accordingly, a two step probabilistic model may be employed to annotate the web images.
- p(n,I) is the probability that keyword, phrase, and/or term "n” is associated with web image "I”
- p(n/c) is the probability that term "n” is associated with image cluster "c”
- p(I/c) is the probability that web image "I” is associated with image cluster "c”.
- the new annotations "w*" are acquired and ranked by determining the average conditional probability p(w,I/n*) for each candidate annotation.
- the candidate annotations with highest average conditional probabilities may then be selected to annotate the web image.
- w * arg max w ⁇ p(w,I/n ) ⁇ (7)
- w arg max n ⁇ c p(w/c)p(n /w, c)p(I/c)p(n */I, c)p(c) ⁇
- p(n /w,c) is the bigram word probability (i.e., average conditional probability that each keyword, terms, or annotation "n * " is associated with image cluster "c” given that "w” is already associated with “c”).
- p(n /w,c) is the bigram word probability (i.e., average conditional probability that each keyword, terms, or annotation "n * " is associated with image cluster "c” given that "w” is already associated with “c”).
- the exemplary image annotation technique is efficient and does not introduce noisy information.
- FIG. 4 illustrates an illustrative method for annotating personal images 400 according to one implementation.
- the term "personal image” is to be interpreted broadly and is generally any image without textural information such as keyword(s), labels, textual information, etc.
- the personal image can be downloaded from a website, retrieved from a computing device (e.g., personal computer, digital camera, picture phone, personal digital assistant, to name a few), scanned from hard copy, or acquired form any other source of digital images, at block 402.
- a computing device e.g., personal computer, digital camera, picture phone, personal digital assistant, to name a few
- the personal image i.e., query image
- its visual features may be extracted using a gray block technique, at block 404.
- the query image is divided into 8 x 8 blocks and for each block the average luminance "L" is calculated.
- the query image may be partitioned into a 7 x 7 gray block, a 9 x 9 gray block, or any other suitable number of gray blocks.
- the vector image may then be reduced by employing a projection matrix.
- the projection matrix "A” is determined by performing principle components analysis (PCA) on the feature matrix.
- PCA principle components analysis
- the image vectors are then ranked and the vectors corresponding to the largest Eigen values are retained to form the projection matrix A.
- an efficient hash-based clustering algorithm may be performed on the query image, at block 406.
- the mean value of the image vector is calculated "mean k " and for values above mean k the image vector is assigned a value of 1 and for values below mean k the image vector is assigned a value of 0. This transforms the K-dimensional image vector into a K-bit binary string, which becomes the query images hash code.
- the query image's hash code is then compared to the hash codes of the various image clusters.
- the cluster with the same hash code as the query image is selected, at block 408
- the annotation of the selected cluster is used to annotate the query image, at block 410.
- cluster models may be selected that are both textually similar to the web images textual information and are visually similar to the web image.
- Figure 5 shows an illustrative method for annotating web images
- web image is to be interpreted broadly and is generally any image with textural information such as keyword(s), labels, textual information, etc.
- the web image could be downloaded from an Internet website, retrieved from a computing device (e.g., personal computer, digital camera, picture phone, personal digital assistant, to name a few), scanned from hard copy, or retrieved from any other source of digital images, at block 502.
- a computing device e.g., personal computer, digital camera, picture phone, personal digital assistant, to name a few
- the image's visual features are extracted using a gray block technique and the vector image is reduced by employing a projection matrix, at block 504.
- the associated textual features are recorded in a database or other form of archive.
- the query image's hash value is calculated by using the mean value of the image vector "mean k " and for values above mean k the image vector is assigned a value of 1, and for values below mean k the image vector is assigned a value of 0. This transforms the K-dimensional image vector into a K-bit binary string, which becomes the query images hash code, at block 506.
- a two-step probabilistic model is used to annotate web images.
- the available texts "n” may be ranked based on the probability that query image "7" is associated with the image cluster "c" (i.e., p(I/c)) and the text n is associated with the cluster c (i.e., p(n/c)).
- the lowest ranked words are discarded and the highest ranked words serve as the candidate annotations n*, at block 508.
- the new candidate annotations "w " are acquired and ranked by computing the average conditional probability P(WjZn 1 ) for each candidate annotation.
- the candidate annotations "w " with the highest average conditional probabilities are selected to annotate the web image, at block 510.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Processing Or Creating Images (AREA)
- Image Analysis (AREA)
Abstract
Statistical approaches to large-scale image annotation are described. Generally, the annotation technique includes compiling visual features and textual information from a number of images, hashing the images visual features, and clustering the images based on their hash values. An example system builds statistical language models from the clustered images and annotates the image by applying one of the statistical language models.
Description
Statistical Approach to Large-scale image Annotation
BACKGROUND
[0001] With the advent of inexpensive digital cameras, camera phones, and other imaging devices, the number of digital images being taken and posted on the internet has grown dramatically. However, to use these images they must be identified and organized so that they may be browsed, searched, or retrieved. [0002] One solution is manual image annotation in which a person manually enters descriptive text or keywords when the image is taken, uploaded, or registered. Although manual image annotations are generally very accurate (e.g., people generally select accurate descriptions), manual image annotation is time consuming and consequently many digital images are not annotated. In addition, manual image annotation can be subjective in that the person annotating the image may disregard the key features of an image (e.g., people typically annotate images based on the person in the image, when the image is taken, or the location of the image).
[0003] Another solution is automatic image annotation which annotates images with keywords automatically. Generally, automatic image annotation is either classification-based or probabilistic modeling-based. Classification-based methods attempt to associate words or concepts by learning classifiers (e.g., Bayes point machine, support vector machine, etc.). While probabilistic modeling methods attempt to infer the correlations or joint probabilities between images and
the annotations (e.g., translation model, cross-media relevance model, continuous relevance model, etc.).
[0004] While classification-based and probabilistic-based image annotation algorithms are able to annotate small scale image databases, they are generally incapable of annotating large-scale databases with realistic images (e.g., digital pictures).
[0005] Moreover, these image annotation algorithms are generally incapable of annotating all the various types of realistic images. For example, many personal images do not contain textual information while web images may include incomplete or erroneous textual information. While current image annotation algorithms are capable of annotating personal image or web images, these algorithms are typically incapable of annotating both types of images. [0006] Furthermore, in large-scale collections of realistic images the number of concepts that can be applied as annotation tags across numerous images is nearly unlimited, and depends on the annotation strategy. Therefore, to annotate large-scale realistic image collections the annotation method should be able to handle the unlimited concepts and themes that may occur in numerous images. [0007] Lastly, given the sizeable number of images being generated everyday, the annotation method must be fast and efficient. For example, approximately one million digital images are uploaded to the FLICKR™ image sharing website each day. To annotate one million images per day, approximately ten images per second must be annotated. Since the best image annotation
algorithm annotates an image in about 1.4 seconds, it is incapable of annotating the large number of images that are generated daily.
[0008] Accordingly, there is a need for a large-scale image annotation technique that can annotate all types of real-life images, containing an unlimited number of visual concepts, and that can annotate images in near real time.
SUMMARY
[0009] This summary is provided to introduce simplified concepts relating to automated image annotation, which is further described below in the Detailed Description. This summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in determining the scope of the claimed subject matter.
[0010] In one aspect, a method of annotating an image may comprise compiling visual features and textual information from a number of images, hashing the images visual features, and clustering the images based on their hash values. Statistical language models are then built from the clustered images and the image is annotated using one of the statistical language models. [0011] In another aspect, a computer readable storage medium comprising computer executable instructions that when executed by a processor may perform a method comprising crawling a large-scale image database to gather images and their corresponding textual information. Visual information is then extracted from the images using a gray block methodology and the extracted images are reduced by employing a projection matrix. The reduced visual information is hashed and
the images are clustered according to their hash values. One or more statistical language models are built from the clustered images and a query image is annotated using one or more of the statistical language models. [0012] In yet another aspect, an item record data structure is embodied on a computer readable media, the data structure consists of a digital image and a textual annotation corresponding to the digital image. The textual annotation is associated with the digital image by compiling visual features and textual information from a number of images, hashing the images visual features, and clustering the images based on the hash value. Statistical language models are then built based on the clustered images and the digital image is annotated using one of the statistical language models.
[0013] While described individually, the foregoing aspects are not mutually exclusive and any number of aspects may be present in a given implementation.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items. [0015] Figure 1 is a block diagram illustrating one implementation of a large-scale image annotation technique.
[0016] Figure 2 is a diagram illustrating how images and its accompanying annotations may be collected using a web crawler and archived in a database.
[0017] Figure 3 is a block diagram illustrating how a digital image's visual features may be reduced, the reduced features grouped into clusters, and then based on the clusters a statistical language model developed.
[0018] Figure 4 is a block diagram depicting an illustrative method of annotating a personal image.
[0019] Figure 5 is a flow diagram depicting an illustrative method of annotating a web image.
DETAILED DESCRIPTION
[0020] In a theoretically ideal situation, given a well-annotated image database of unlimited scale, image annotation is relatively straightforward. For a given query image an exact duplicate is found in the image database and that image's annotation is propagated to the query image.
[0021] However, in the "real world" image databases are generally limited in scale and contain many inaccurate descriptions. Accordingly, images in the "real world image database" are typically grouped into clusters according to the images' similarities. Then for a given query image, the most similar image cluster is selected and the "best description" associated with the image cluster is selected to annotate the query image. While these conventional imaging annotation
algorithms are capable of annotating most images, there is significant room for improvement.
[0022] This disclosure relates to various statistical approaches to large-scale image annotation. These statistical approaches can annotate personal images which generally have limited or no annotations and web-based images, which generally have noisy and incomplete annotations. In one implementation, an image annotation technique leverages large-scale web-based image databases to model a nearly unlimited number of semantic concepts.
[0023] Figure 1 illustrates one implementation of the large-scale image annotation technique 100. First, a large-scale image database 102 is crawled and both visual features and textual information are extracted and indexed as structural data 104 (i.e., training set). The complexity of the image data is reduced by projecting the high-dimensional image features onto a sub-space with lower dimensionality while maintaining the majority of the image's information 106. Then an efficient hash-based clustering algorithm is applied to the training set and the images with the same hash codes are grouped into "clusters" 108. Once the images have been clustered into groups 110, a statistical language model (SLM) is developed to model the textual information from the images in each cluster 112. [0024] To annotate an image, the query image is selected 114 and its visual features (e.g., color, texture, geometric features, etc.) and textural features (e.g., titles, key words, URL's, surrounding text, etc.) are extracted 116. The query image's features are hashed 118 and a language model is selected 120 based on the
words with the maximum joint probability with the query image. The image is then annotated 122 based on the text, title, annotations, and/or key word(s) associated with the selected language model 122.
Collecting Images from the Web
[0025] Referring to Figure 2, in one implementation images 202 along with their text, title, annotations, and/or keywords 204 are collected from the Internet using a web crawler and archived in a database 206. In general, as many images as possible may be collected, as large sample sizes assure a good correlation between the visual models and the query image. For example, in one implementation, approximately 2.4 million high quality web images with meaningful descriptions were collected from online photo forums (e.g., GOOGLE IMAGES™, YAHOO IMAGE SEARCH ™, and the University of Washington image data set, to name a few).
[0026] Alternatively, annotated images may be collected randomly from the
Internet or other sources and assembled into an image collection. Generally, any type of image can be collected so long as it is annotated with some form of text, title, annotation, or key words.
[0027] The images and associated text or key words are then indexed in a database. There are many ways in which the images 202 and text 204 can be indexed (e.g., key word, text string, image features, to name a few). In one implementation, the images are sorted and grouped by the key word or text 204
associated with the image 202. For example, if there are a number of images that contain sunsets, those images can be indexed and grouped together 208.
Dimension Reduction
[0028] Traditional clustering algorithms are time consuming and computationally inefficient because digital images are generally complex (e.g., highly dimensional). Accordingly, the exemplary technique employs a compact representation of the collected images to achieve fast and efficient image clustering.
[0029] One goal of dimension reduction is to reduce the complexity of the image data while maintaining as much of the original information as possible. A second goal of dimension reduction is to reduce noise and value drifting by omitting the least significant dimensions. Both of these goals are achieved in the following illustrative technique.
[0030] Referring to Figure 3, an image's visual features should generally represent its content, its structure, and be robust to variations in the image itself (e.g., scale, color, storage format, to name a few). Accordingly, a gray block methodology may be employed. The gray block features may appear as small thumbnails of the original image. The gray block methodology maintains the images primary content and structure, and is invariant to scale change. Each feature vector is the mean of many individual pixels, so the methodology is robust to variances in pixel values. Moreover, since each vector feature is based on the image's luminance, the methodology is also robust to color changes.
[0031] In one implementation, each collected image is divided into 8 by 8 pixel blocks and for each block the average luminescence "L" is calculated, at block 302. The K-th dimensional value of each feature may be calculated as:
[0032] fk = ~k ∑ KUS) k = l,2,...,n2 (l)
[0033] Where Bk corresponds to block k, Nk is the number of pixels in Bk and L(i,j) is the pixel luminance at coordinates i, j. Thus, the image is represented by vector F1 = (fh f2, f3, ...,fn*n)T- In alternate implementations, the image may be partitioned into a 7 x 7 gray block, a 9 x 9 gray block, or any other suitable number of feature vectors.
[0034] The high-dimensional features may then be projected into a subspace with much lower dimensionality while maintaining most of the image's information, at block 304. In one implementation, the image's dimensions are reduced by employing a projection matrix "A".
G1 = AF1 (2)
[0035] To determine the projection matrix A, principle components analysis
(PCA) is performed on the feature matrix of a sufficiently large image collection. The image vectors may then be ranked and the vectors corresponding to the largest Eigen values retained to form the projection matrix A. It should be noted that the projection matrix is generally the same for most of the gray block images. Although an image may lose some information through this technique, it has been shown that high precision and the fast cluster grouping are achieved.
Clustering by Hashing
[0036] Clustering is the classification of objects into classes, categories, or partitions based on a high degree of similarity between object members. In one implementation, a hash-based clustering algorithm is applied to the training set, at block 306. Such hash code generation is essentially a vector quantization process. Since the final quantized vector has K-bits, the method in which the bits are allocated to each dimension is important. In one implementation, for image vectors with values above "mean^" the image vector has a value of "1" and for image vectors with values below "meank" the image vector has a value of "0": [0037] Hi;k = 1 if Gik >/= meank (3)
= 0 if Gik < mean^ where meank is the mean value of dimension K. By employing this technique, the K-dimensional feature vector is transformed into a K-bit binary string, which becomes the images hash code.
[0038] In one implementation, the K-bit string is constrained to no more than 32 bits, although other bit string sizes may, such as 64 bits, may be employed. The images with the same 32 bit hash code are then grouped into "clusters", at block 308.
Building a Statistical Language Model
[0039] Once the images have been clustered into groups, a statistical language model (SLM) may be developed to model the textual information from the images in each cluster, at block 310. Unigram models and modified bigram
models may be constructed to calculate single word probabilities and conditional word probabilities for each of the image clusters.
[0040] In general, personal images may lack textual information or annotation, and are accordingly annotated by employing a probabilistic approach.
Specifically, the query image may be annotated by selecting keyword(s), a phrase, or text with the maximum joint probability with the query or target image, as illustrated below in equation (4).
[0041] Unigram models assume that a particular piece of text or key word is generated by each term independently. Accordingly, unigram models calculate the probability that a specific keyword, phrase, or text is associated with the query image.
[0042] w* = arg maxw {p(w,I)} (4)
= arg maxw {Σcp(w/c)p(I/c)p(c)}
[0043] In equation (4), p(w/c) is the unigram word probability (i.e., probability that a keyword, phrase, or terms "w" occurs in an image cluster "c"), p(I/c) is the visual similarity between the query image "I" and the image cluster "c", and p(c) is the prior probability of cluster "c", which is often initialized uniformly without knowing the prior information in advance. [0044] For example, if there are ten images in a cluster and two keywords are associated with that cluster. If the first keyword appears in five images and second keyword appears in two images; there is a two in seven chance (29%) that second keyword should be associated with the query image and a five in seven
chance (71%) that the first key word should be associated with the query image. Accordingly, since the first keyword has a greater probability than the second keyword that it is associated with the query image (i.e., 71% versus 29%); the first keyword is used to annotate the query image.
[0045] In an alternate implementation, the image cluster whose visual features are the most similar to the query image is selected and its keyword, phrase, and/or terms are used to annotate the query image.
[0046] Generally, the number of words in a cluster is limited because of the small number of images in a cluster. Therefore, when there are a limited number of words, the unigram model may be smoothed using Bayesian models using Dirichlet priors.
,0047] Pf lw W = £ψi±Jεύ!i£) (5)
[0048] Here, p(w/C) is the unigram probability of a specific keyword "w" occurring in a standard corpus "C".
[0049] In general, the typical web image contains noisy and incomplete textual information. Accordingly, a two step probabilistic model may be employed to annotate the web images.
[0050] First, available texts "n" are ranked using equation (6), and the lowest ranked words, which may be noisy, are discarded. The highest ranked words are then used as candidate annotations "n * ".
[0051] n = arg maxn{p(n,I)} (6)
= arg maxn{Σcp(n/c)p(I/c)p(c)}
[0052] In equation (6), p(n,I) is the probability that keyword, phrase, and/or term "n" is associated with web image "I", p(n/c) is the probability that term "n" is associated with image cluster "c", and p(I/c) is the probability that web image "I" is associated with image cluster "c".
[0053] Next, the new annotations "w*" are acquired and ranked by determining the average conditional probability p(w,I/n*) for each candidate annotation. The candidate annotations with highest average conditional probabilities may then be selected to annotate the web image. [0054] w* = arg maxw{p(w,I/n )} (7) w = arg maxn{∑cp(w/c)p(n /w, c)p(I/c)p(n */I, c)p(c)}
[0055] In equation (7), p(n /w,c) is the bigram word probability (i.e., average conditional probability that each keyword, terms, or annotation "n*" is associated with image cluster "c" given that "w" is already associated with "c"). [0056] For example, if a web image was a picture of the sky with clouds and was annotated with "sky". Clusters with the annotations "sky" and "clouds" would have a high probability that the annotations correlate to the image. While clusters with the annotations "water" and "sky" would have a lower probability and accordingly be discarded.
Annotating Images
[0057] Since only a small number of clusters models are typically used to compute the joint probabilities, the exemplary image annotation technique is efficient and does not introduce noisy information.
[0058] For personal image annotation, cluster models are selected which are visually similar to the images. Accordingly, the personal images are annotated based on the closest visual image model and textual similarity is not considered. [0059] Figure 4 illustrates an illustrative method for annotating personal images 400 according to one implementation. The term "personal image" is to be interpreted broadly and is generally any image without textural information such as keyword(s), labels, textual information, etc. The personal image can be downloaded from a website, retrieved from a computing device (e.g., personal computer, digital camera, picture phone, personal digital assistant, to name a few), scanned from hard copy, or acquired form any other source of digital images, at block 402.
[0060] Once the personal image (i.e., query image) has been selected, its visual features may be extracted using a gray block technique, at block 404. In one implementation, the query image is divided into 8 x 8 blocks and for each block the average luminance "L" is calculated. The query image is then represented as a K-th vector based on the average luminance values F1 = (fh f2, f3, ...,fn*n)T- In an alternate implementation, the query image may be partitioned
into a 7 x 7 gray block, a 9 x 9 gray block, or any other suitable number of gray blocks.
[0061] The vector image may then be reduced by employing a projection matrix. The projection matrix "A" is determined by performing principle components analysis (PCA) on the feature matrix. The image vectors are then ranked and the vectors corresponding to the largest Eigen values are retained to form the projection matrix A.
[0062] Next, an efficient hash-based clustering algorithm may be performed on the query image, at block 406. In one implementation, the mean value of the image vector is calculated "meank" and for values above meank the image vector is assigned a value of 1 and for values below meank the image vector is assigned a value of 0. This transforms the K-dimensional image vector into a K-bit binary string, which becomes the query images hash code.
[0063] The query image's hash code is then compared to the hash codes of the various image clusters. The cluster with the same hash code as the query image is selected, at block 408
[0064] Finally, the annotation of the selected cluster is used to annotate the query image, at block 410.
[0065] For Web images cluster models may be selected that are both textually similar to the web images textual information and are visually similar to the web image. Figure 5 shows an illustrative method for annotating web images
500 according to one implementation. The term "web image" is to be interpreted
broadly and is generally any image with textural information such as keyword(s), labels, textual information, etc. Like the personal image, the web image could be downloaded from an Internet website, retrieved from a computing device (e.g., personal computer, digital camera, picture phone, personal digital assistant, to name a few), scanned from hard copy, or retrieved from any other source of digital images, at block 502.
[0066] Once the web image (i.e., query image) has been selected, the image's visual features are extracted using a gray block technique and the vector image is reduced by employing a projection matrix, at block 504. The associated textual features are recorded in a database or other form of archive. [0067] The query image's hash value is calculated by using the mean value of the image vector "meank" and for values above meank the image vector is assigned a value of 1, and for values below meank the image vector is assigned a value of 0. This transforms the K-dimensional image vector into a K-bit binary string, which becomes the query images hash code, at block 506. [0068] A two-step probabilistic model is used to annotate web images.
First, the available texts "n" may be ranked based on the probability that query image "7" is associated with the image cluster "c" (i.e., p(I/c)) and the text n is associated with the cluster c (i.e., p(n/c)). The lowest ranked words are discarded and the highest ranked words serve as the candidate annotations n*, at block 508. [0069] The new candidate annotations "w " are acquired and ranked by computing the average conditional probability P(WjZn1 ) for each candidate
annotation. The candidate annotations "w " with the highest average conditional probabilities are selected to annotate the web image, at block 510.
CONCLUSION
[0070] Although implementations have been described in language specific to structural features and/or methodological acts, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as illustrative forms of implementing the claimed invention.
Claims
1. A method of annotating an image comprising: compiling visual features and textual information from a plurality of images (104, 504); hashing the plurality of visual features, and clustering the plurality of images based on the hash value (108, 306, 308); building one or more statistical language models based on the clustered images (110, 112); and annotating the image using one or more of the statistical language models (122).
2. A method of annotating an image as recited in Claim 1, wherein the plurality of images are gathered by crawling one or more large-scale image databases.
3. A method of annotating an image as recited in Claim 1, wherein hashing the plurality of visual features comprises a vector quantization process in which the visual features are transformed into a binary string.
4. A method of annotating an image as recited in Claim 1, wherein the images with the same hash code are grouped into clusters.
5. A method of annotating an image as recited in Claim 1, wherein the one or more statistical language models is a unigram model.
6. A method of annotating an image as recited in Claim 1, wherein the one or more statistical language models is a bigram model.
7. A method of annotating an image as recited in Claim 1, wherein the image is a personal image, and the image is annotated by selecting words with a maximum joint probability between the image and the clustered images.
8. A method of annotating an image as recited in Claim 1, wherein the image is a web image, and the image is annotated by a two step probabilistic modeling technique.
9. A method of annotating an image as recited in Claim 1, further comprising extracting visual information from the plurality of images by using a gray block methodology.
10. A method of annotating an image as recited in Claim 9, wherein the gray block methodology comprises:
partitioning the image into equal size blocks, measuring an average luminescence for each block, and representing the image as a vector.
11. A method of annotating an image as recited in Claim 9, further comprising reducing the visual information of the plurality of images by employing a projection matrix.
12. A computer readable storage medium comprising computer executable instructions that when executed by a processor perform the method of one of Claims 1-11.
13. A data structure embodied on a computer readable media to represent an item in an item catalog, the data structure comprising: a digital image (202); and a textual annotation corresponding to the digital image (204), the textual annotation being associated with the digital image by: compiling visual features and textual information from a plurality of images (104, 504); hashing the plurality of visual features, and clustering the plurality of images based on the hash value (108, 306, 308); building one or more statistical language models based on the clustered images (110, 112); and annotating the image using one or more of the statistical language models (122).
14. A data structure embodied on a computer readable media to represent an item in an item catalog as recited in Claim 13, wherein the plurality of images are gathered by crawling one or more large-scale image databases.
15. A data structure embodied on a computer readable media to represent an item in an item catalog as recited in Claim 13, further comprising extracting visual information from the plurality of images by using a gray block methodology.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP09770670.9A EP2291765A4 (en) | 2008-05-30 | 2009-05-30 | Statistical approach to large-scale image annotation |
CN200980131159.4A CN102112987B (en) | 2008-05-30 | 2009-05-30 | Statistical approach to large-scale image annotation |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/130,943 US8150170B2 (en) | 2008-05-30 | 2008-05-30 | Statistical approach to large-scale image annotation |
US12/130,943 | 2008-05-30 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2009158135A2 true WO2009158135A2 (en) | 2009-12-30 |
WO2009158135A3 WO2009158135A3 (en) | 2010-04-15 |
Family
ID=41379902
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2009/045764 WO2009158135A2 (en) | 2008-05-30 | 2009-05-30 | Statistical approach to large-scale image annotation |
Country Status (4)
Country | Link |
---|---|
US (2) | US8150170B2 (en) |
EP (1) | EP2291765A4 (en) |
CN (1) | CN102112987B (en) |
WO (1) | WO2009158135A2 (en) |
Families Citing this family (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8150170B2 (en) | 2008-05-30 | 2012-04-03 | Microsoft Corporation | Statistical approach to large-scale image annotation |
KR100889026B1 (en) * | 2008-07-22 | 2009-03-17 | 김정태 | Searching system using image |
US8463053B1 (en) | 2008-08-08 | 2013-06-11 | The Research Foundation Of State University Of New York | Enhanced max margin learning on multimodal data mining in a multimedia database |
US8429173B1 (en) | 2009-04-20 | 2013-04-23 | Google Inc. | Method, system, and computer readable medium for identifying result images based on an image query |
CN101576932B (en) * | 2009-06-16 | 2012-07-04 | 阿里巴巴集团控股有限公司 | Close-repetitive picture computer searching method and device |
US8275771B1 (en) | 2010-02-26 | 2012-09-25 | Google Inc. | Non-text content item search |
CN102193946A (en) * | 2010-03-18 | 2011-09-21 | 株式会社理光 | Method and system for adding tags into media file |
US20130091437A1 (en) * | 2010-09-03 | 2013-04-11 | Lester F. Ludwig | Interactive data visulization utilizing hdtp touchpad hdtp touchscreens, advanced multitouch, or advanced mice |
KR101165357B1 (en) * | 2011-02-14 | 2012-07-18 | (주)엔써즈 | Apparatus and method for generating image feature data |
US9239848B2 (en) * | 2012-02-06 | 2016-01-19 | Microsoft Technology Licensing, Llc | System and method for semantically annotating images |
US8849047B2 (en) | 2012-07-10 | 2014-09-30 | Facebook, Inc. | Methods and systems for determining image similarity |
US9424279B2 (en) | 2012-12-06 | 2016-08-23 | Google Inc. | Presenting image search results |
US20140176661A1 (en) * | 2012-12-21 | 2014-06-26 | G. Anthony Reina | System and method for surgical telementoring and training with virtualized telestration and haptic holograms, including metadata tagging, encapsulation and saving multi-modal streaming medical imagery together with multi-dimensional [4-d] virtual mesh and multi-sensory annotation in standard file formats used for digital imaging and communications in medicine (dicom) |
IL226219A (en) * | 2013-05-07 | 2016-10-31 | Picscout (Israel) Ltd | Efficient image matching for large sets of images |
CN104217205B (en) * | 2013-05-29 | 2018-05-18 | 华为技术有限公司 | A kind of method and system for identifying User Activity type |
US9754177B2 (en) * | 2013-06-21 | 2017-09-05 | Microsoft Technology Licensing, Llc | Identifying objects within an image |
US10533850B2 (en) | 2013-07-12 | 2020-01-14 | Magic Leap, Inc. | Method and system for inserting recognized object data into a virtual world |
US9384213B2 (en) | 2013-08-14 | 2016-07-05 | Google Inc. | Searching and annotating within images |
US10319035B2 (en) | 2013-10-11 | 2019-06-11 | Ccc Information Services | Image capturing and automatic labeling system |
US9275132B2 (en) | 2014-05-12 | 2016-03-01 | Diffeo, Inc. | Entity-centric knowledge discovery |
US10013436B1 (en) | 2014-06-17 | 2018-07-03 | Google Llc | Image annotation based on label consensus |
CN104572940B (en) * | 2014-12-30 | 2017-11-21 | 中国人民解放军海军航空工程学院 | A kind of image automatic annotation method based on deep learning and canonical correlation analysis |
US11275747B2 (en) * | 2015-03-12 | 2022-03-15 | Yahoo Assets Llc | System and method for improved server performance for a deep feature based coarse-to-fine fast search |
US10068380B2 (en) * | 2016-11-17 | 2018-09-04 | Adobe Systems Incorporated | Methods and systems for generating virtual reality environments from electronic documents |
US10262236B2 (en) | 2017-05-02 | 2019-04-16 | General Electric Company | Neural network training image generation system |
WO2018226888A1 (en) | 2017-06-06 | 2018-12-13 | Diffeo, Inc. | Knowledge operating system |
EP3750103A4 (en) * | 2018-02-06 | 2021-11-17 | HRL Laboratories, LLC | Machine vision system for recognizing novel objects |
CN108984726B (en) * | 2018-07-11 | 2022-10-04 | 黑龙江大学 | Method for performing title annotation on image based on expanded sLDA model |
US11625557B2 (en) | 2018-10-29 | 2023-04-11 | Hrl Laboratories, Llc | Process to learn new image classes without labels |
US11218496B2 (en) * | 2020-01-24 | 2022-01-04 | Bishop Fox | Application of computer visual classification to security events |
WO2022060350A1 (en) * | 2020-09-15 | 2022-03-24 | Intel Corporation | Facilitating improved use of stochastic associative memory |
CN112712121B (en) * | 2020-12-30 | 2023-12-05 | 浙江智慧视频安防创新中心有限公司 | Image recognition model training method, device and storage medium |
CN115248831B (en) * | 2021-04-28 | 2024-03-15 | 马上消费金融股份有限公司 | Labeling method, labeling device, labeling system, labeling equipment and readable storage medium |
Family Cites Families (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB0015233D0 (en) * | 2000-06-21 | 2000-08-16 | Canon Kk | Indexing method and apparatus |
US7028253B1 (en) | 2000-10-10 | 2006-04-11 | Eastman Kodak Company | Agent for integrated annotation and retrieval of images |
US6970860B1 (en) | 2000-10-30 | 2005-11-29 | Microsoft Corporation | Semi-automatic annotation of multimedia objects |
US7231381B2 (en) | 2001-03-13 | 2007-06-12 | Microsoft Corporation | Media content search engine incorporating text content and user log mining |
FR2827977B1 (en) * | 2001-07-30 | 2003-10-03 | Inst Nat Rech Inf Automat | PROCESS FOR PROCESSING DIGITAL IMAGES, ESPECIALLY SATELLITE IMAGES |
US7043474B2 (en) | 2002-04-15 | 2006-05-09 | International Business Machines Corporation | System and method for measuring image similarity based on semantic meaning |
US7394947B2 (en) | 2003-04-08 | 2008-07-01 | The Penn State Research Foundation | System and method for automatic linguistic indexing of images by a statistical modeling approach |
US20060020597A1 (en) | 2003-11-26 | 2006-01-26 | Yesvideo, Inc. | Use of image similarity in summarizing a collection of visual images |
US7551780B2 (en) * | 2005-08-23 | 2009-06-23 | Ricoh Co., Ltd. | System and method for using individualized mixed document |
US9171202B2 (en) * | 2005-08-23 | 2015-10-27 | Ricoh Co., Ltd. | Data organization and access for mixed media document system |
US8005831B2 (en) * | 2005-08-23 | 2011-08-23 | Ricoh Co., Ltd. | System and methods for creation and use of a mixed media environment with geographic location information |
US7711047B2 (en) * | 2005-12-21 | 2010-05-04 | Microsoft Corporation | Determining intensity similarity in low-light conditions using the Poisson-quantization noise model |
US7698332B2 (en) | 2006-03-13 | 2010-04-13 | Microsoft Corporation | Projecting queries and images into a similarity space |
US7647331B2 (en) * | 2006-03-28 | 2010-01-12 | Microsoft Corporation | Detecting duplicate images using hash code grouping |
US8010534B2 (en) * | 2006-08-31 | 2011-08-30 | Orcatec Llc | Identifying related objects using quantum clustering |
US7729531B2 (en) * | 2006-09-19 | 2010-06-01 | Microsoft Corporation | Identifying repeated-structure elements in images |
US8073196B2 (en) * | 2006-10-16 | 2011-12-06 | University Of Southern California | Detection and tracking of moving objects from a moving platform in presence of strong parallax |
CN100437582C (en) * | 2006-10-17 | 2008-11-26 | 浙江大学 | Image content semanteme marking method |
JP2008146602A (en) * | 2006-12-13 | 2008-06-26 | Canon Inc | Document retrieving apparatus, document retrieving method, program, and storage medium |
US7711668B2 (en) * | 2007-02-26 | 2010-05-04 | Siemens Corporation | Online document clustering using TFIDF and predefined time windows |
US7797265B2 (en) * | 2007-02-26 | 2010-09-14 | Siemens Corporation | Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters |
US7657507B2 (en) * | 2007-03-02 | 2010-02-02 | Microsoft Corporation | Pseudo-anchor text extraction for vertical search |
US7761466B1 (en) * | 2007-07-30 | 2010-07-20 | Hewlett-Packard Development Company, L.P. | Hash-based image identification |
US8126274B2 (en) * | 2007-08-30 | 2012-02-28 | Microsoft Corporation | Visual language modeling for image classification |
US8150170B2 (en) | 2008-05-30 | 2012-04-03 | Microsoft Corporation | Statistical approach to large-scale image annotation |
-
2008
- 2008-05-30 US US12/130,943 patent/US8150170B2/en not_active Expired - Fee Related
-
2009
- 2009-05-30 WO PCT/US2009/045764 patent/WO2009158135A2/en active Application Filing
- 2009-05-30 CN CN200980131159.4A patent/CN102112987B/en active Active
- 2009-05-30 EP EP09770670.9A patent/EP2291765A4/en not_active Ceased
-
2012
- 2012-02-28 US US13/406,804 patent/US8594468B2/en active Active
Non-Patent Citations (7)
Also Published As
Publication number | Publication date |
---|---|
WO2009158135A3 (en) | 2010-04-15 |
US8150170B2 (en) | 2012-04-03 |
CN102112987A (en) | 2011-06-29 |
US8594468B2 (en) | 2013-11-26 |
CN102112987B (en) | 2015-03-04 |
US20090297050A1 (en) | 2009-12-03 |
EP2291765A2 (en) | 2011-03-09 |
US20120155774A1 (en) | 2012-06-21 |
EP2291765A4 (en) | 2016-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8150170B2 (en) | Statistical approach to large-scale image annotation | |
Wang et al. | Annotating images by mining image search results | |
Bhagat et al. | Image annotation: Then and now | |
CN105912611B (en) | A kind of fast image retrieval method based on CNN | |
US8908997B2 (en) | Methods and apparatus for automated true object-based image analysis and retrieval | |
EP2565804B1 (en) | Text-based searching of image data | |
US7814040B1 (en) | System and method for image annotation and multi-modal image retrieval using probabilistic semantic models | |
US20110085739A1 (en) | System and method for similarity search of images | |
JP2006510114A (en) | Representation of content in conceptual model space and method and apparatus for retrieving it | |
US20100226582A1 (en) | Assigning labels to images in a collection | |
EP2321787A2 (en) | Annotating images | |
Weyand et al. | Visual landmark recognition from internet photo collections: A large-scale evaluation | |
KR101976081B1 (en) | Method, system and computer program for semantic image retrieval based on topic modeling | |
CN114461839A (en) | Multi-mode pre-training-based similar picture retrieval method and device and electronic equipment | |
CN116756363A (en) | Strong-correlation non-supervision cross-modal retrieval method guided by information quantity | |
Magalhães et al. | An information-theoretic framework for semantic-multimedia retrieval | |
Morsillo et al. | Mining the web for visual concepts | |
Magalhaes et al. | Exploring multimedia in a keyword space | |
Overell et al. | MMIS at ImageCLEF 2008: Experiments combining Different Evidence Sources. | |
Derakhshan et al. | A Review of Methods of Instance-based Automatic Image Annotation | |
Carvalho et al. | Attributing semantics to personal photographs | |
Ghosh et al. | A tutorial review of automatic image tagging technique using text mining | |
Dube | An Architecture for Retrieval and Annotation of Images from Big Image Datasets | |
Kaur et al. | Content Based Image Retrieval Using Color Mean with Feature Classification Using Naïve Bayes. | |
Dai | Semantic tolerance-based image representation for large image/video retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200980131159.4 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09770670 Country of ref document: EP Kind code of ref document: A2 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009770670 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |