US8447119B2 - Method and system for image classification - Google Patents

Method and system for image classification Download PDF

Info

Publication number
US8447119B2
US8447119B2 US12/818,156 US81815610A US8447119B2 US 8447119 B2 US8447119 B2 US 8447119B2 US 81815610 A US81815610 A US 81815610A US 8447119 B2 US8447119 B2 US 8447119B2
Authority
US
United States
Prior art keywords
image
vector
coding
local
descriptors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US12/818,156
Other versions
US20110229045A1 (en
Inventor
Kai Yu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Laboratories America Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Laboratories America Inc filed Critical NEC Laboratories America Inc
Priority to US12/818,156 priority Critical patent/US8447119B2/en
Publication of US20110229045A1 publication Critical patent/US20110229045A1/en
Application granted granted Critical
Publication of US8447119B2 publication Critical patent/US8447119B2/en
Assigned to NEC CORPORATION reassignment NEC CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NEC LABORATORIES AMERICA, INC.
Assigned to NEC CORPORATION reassignment NEC CORPORATION CORRECTIVE ASSIGNMENT TO CORRECT THE REMOVE 8538896 AND ADD 8583896 PREVIOUSLY RECORDED ON REEL 031998 FRAME 0667. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT. Assignors: NEC LABORATORIES AMERICA, INC.
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • G06V10/763Non-hierarchical techniques, e.g. based on statistics of modelling distributions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects

Definitions

  • the invention relates to method and system for image classification.
  • Image classification including object recognition and scene classification, remains to be a major challenge to the computer vision community. Perhaps one of the most significant developments in the last decade is the application of local features to image classification, including the introduction of “bag-of-visual-words” representation.
  • VQ vector quantization
  • a further extension is to incorporate the spatial information of local descriptors in an image, by partition images into regions in different locations and scales and compute region-based histograms, instead of computing the global histogram for the entire image. These region-based histograms are concatenated to form the feature vector for the image. Then nonlinear SVM is applied for classification. This approach is called “spatial pyramid matching kernel” (SPMK) method. SPMK is regarded the state-of-the-art method for image classification.
  • SVMs use pyramid matching kernels, biologically-inspired models, and KNN methods.
  • SPM spatial pyramid matching
  • the recent improvements were often achieved by combining different types of local descriptors, without any fundamental change of the underlying classification method.
  • Nonlinear SVMs scale at least quadratically to the size of training data, which makes it nontrivial to handle large-scale training data. It is thus necessary to design algorithms that are computationally more efficient.
  • methods and systems for image classification coding an image by nonlinearly mapping an image descriptor to form a high-dimensional sparse vector; spatially pooling each local region to form an image-level feature vector using a probability kernel incorporating a similarity metric of local descriptors; and classifying the image.
  • a method for image classification includes nonlinearly mapping one or more descriptors of an image to form a high-dimensional sparse vector using Super-Vector nonlinear coding; spatial pooling each local region by aggregating codes of the descriptors in each local region to form a single vector, and concatenating vectors of different regions to form the image-level feature vector using probability kernel incorporating the similarity metric of local descriptors; and image classifying by normalizing image-level feature vector using linear SVMs.
  • a system for image classification includes means for coding descriptor of an image by nonlinearly mapping to form a high-dimensional sparse vector using Super-Vector nonlinear coding method; means for spatial pooling each local region by aggregating the codes of all the descriptors in each local region to form a single vector, and concatenating vectors of different regions to form the image-level feature vector using probability kernel incorporating the similarity metric of local descriptors; and means for image classifying by normalizing image-level feature vector using linear SVMs.
  • a method for image classification includes extracting local image descriptors from a grid of locations in an image; nonlinearly coding extracted image descriptors to form a high-dimensional sparse vector; spatial pooling each image by partitioning into regions in different scales and locations, aggregating the codes of the descriptors in each region to form a single vector, and concatenating vectors of different regions to form the image-level feature vector; and linear classifying image-level feature vector.
  • the system for image classification includes means for extracting local image descriptors from a grid of locations in an image; means for nonlinearly coding extracted image descriptors to form a high-dimensional sparse vector; means for spatial pooling each image by partitioning into regions in different scales and locations, aggregating the codes of all the descriptors in each region to form a single vector, means for concatenating vectors of different regions to form the image-level feature vector; and means for linear classifying image-level feature vector.
  • Image classification can be done using local visual descriptors.
  • the system is more scalable in computation, transparent in classification, and greater accuracy than conventional systems.
  • the overall image classification framework enjoys a linear training complexity, and also a great interpretability that is missing from conventional systems.
  • FIG. 1 is a flow chart showing image classification method.
  • FIG. 2 shows an exemplary system to perform image classification.
  • FIG. 1 is a flow chart showing image classification method.
  • the method receives an input image in 110 .
  • the method performs a descriptor extraction in 120 .
  • This operation extracts local image descriptors, such as SIFT, SURF, or any other local features, from a grid of locations in the image.
  • the image is represented as a set of descriptor vectors with their 2D location coordinates.
  • the method performs nonlinear coding in 130 . Each descriptor of an image is nonlinearly mapped to form a high-dimensional sparse vector.
  • the invention propose a novel nonlinear coding method called Super-Vector (SV) coding, which enjoys better theoretical properties than Vector Quantization (VQ) coding.
  • the method performs spatial pooling where each image is partitioned into regions in different scales and locations. For each region, the codes of all the descriptors in it are aggregated to form a single vector, then vectors of different regions are concatenated to form the image-level feature vector.
  • a probability kernel incorporating the similarity metric of local descriptors can be used in one embodiment as described in detail below.
  • the process performs linear classification in 150 .
  • the image-level feature vector is normalized and fed into a classifier to detect an object such as a cat in 160 .
  • Linear SVMs which scale linearly to the size of training data, are used in the method. In contrast, the previous state-of-the-art systems used nonlinear SVMs which requires quadratic or higher-order computational complexity for training.
  • the descriptor coding enjoys appealing theoretical properties. It is interested in learning a smooth nonlinear function ⁇ (x) defined on a high dimensional space R d .
  • the question is, how to derive a good coding scheme (or nonlinear mapping) ⁇ (x) such that ⁇ (x) can be well approximated by a linear function on it, namely w T ⁇ (x). Assumption here is that ⁇ (x) should be sufficiently smooth.
  • VQ Vector Quantization
  • v * ⁇ ( x ) arg ⁇ ⁇ min v ⁇ ⁇ C ⁇ ⁇ x - v ⁇ , where P ⁇ P is the Euclidean norm (2-norm).
  • ⁇ (x) is ⁇ Lipschitz derivative smooth if for all x,x′ ⁇ R d :
  • ⁇ ⁇ ( x ) [ 0 , ... ⁇ , 0 ⁇ d + 1 ⁇ ⁇ dim . , s , ( x - v ) T ⁇ d + 1 ⁇ dim . , 0 , ... ⁇ , 0 ⁇ d + 1 ⁇ d ⁇ ⁇ i ⁇ ⁇ m . ] T ( 3 )
  • SV coding may achieve a lower function approximation error than VQ coding. It should be noted that the popular bag-of-features image classification method essentially employs VQ to obtain histogram representations. The proposed SV coding is a simple extension of VQ, and may lead to a better approach to image classification.
  • Each image can be represented as a set of descriptor vectors x that follows an image-specific distribution, represented as a probability density function p(x) with respect to an image independent back-ground measure d ⁇ (x). Let's first ignore the spacial locations of x, and address the spacial pooling later.
  • a kernel-based method for image classification is based on a kernel on the probability distributions over x ⁇ , K:P ⁇ P R.
  • K:P ⁇ P R A well-known example is the Bhattacharyya kernel:
  • K b ⁇ ( p , q ) ⁇ ⁇ ⁇ p ⁇ ( x ) 1 2 ⁇ q ⁇ ( x ) 1 2 ⁇ ⁇ d ⁇ ⁇ ( x ) .
  • KL Kullback Leibler
  • K ( X , X ′ ⁇ ) 1 NN ′ ⁇ ⁇ x ⁇ X ⁇ ⁇ x ′ ⁇ X ′ ⁇ ⁇ p ⁇ ( x ) - 1 2 ⁇ q ⁇ ( x ′ ) - 1 2 ⁇ ⁇ ⁇ ( x , x ′ ) ( 4 )
  • N and N′ are the sizes of the descriptor sets from two images.
  • X k is the subset of X fallen into the k-th cluster.
  • weighting by histogram p k is equivalent to treating density p(x) as piece-wise constant around each VQ basis, under a specific choice of background measure ⁇ (x) that equalizes different partitions.
  • This representation is not sensitive to the choice of background measure ⁇ (x), which is image independent.
  • each image be evenly partitioned into 1 ⁇ 1, 2 ⁇ 2, and 3 ⁇ 1 blocks, respectively in 3 different levels. Based on which block each descriptor comes from, the whole set X of an image is then organized into three levels of subsets: X 11 1 , X 11 2 , X 12 2 , X 21 2 , X 22 2 , X 11 3 , X 12 3 , and X 13 3 . Then the pooling operation introduced in the last subsection can be applied to each of the subsets.
  • the above equation provides an interesting insight to the classification process: a patch-level pattern matching is operated everywhere in the image, and the responses are then aggregated together to generate the score indicating how likely a particular category of objects is present. This observation is well-aligned with the biologically-inspired vision models, like Convolution Neural Networks and HMAX model, which mostly employ feed-forward pattern matching for object recognition.
  • the classification model enjoys the advantages of interpretability and computational scalability.
  • Eq. (5) suggests that one can compute a response map based on g(x), which visualizes where the classifier focuses on in the image. Since the proposed method naturally requires a linear classifier, it enjoys a training scalability which is linear to the number of training images, while nonlinear kernel-based methods suffer quadratic or higher complexity.
  • the classification model is more related to local coordinate coding (LCC), which points out that in some cases a desired sparsity of ⁇ (x) should come from a locality of the coding scheme.
  • LCC local coordinate coding
  • the proposed SV coding leads to a highly sparse representation ⁇ (x), as defined by Eq. (2), which activates those coordinates associated to the neighborhood of x.
  • the computation of SV coding is much simpler than sparse coding approaches.
  • the method can be further improved by considering a soft assignment of x to bases C.
  • the underlying interpretation of ⁇ (x) ⁇ w T ⁇ (x) is the approximation ⁇ ( x ) ⁇ ( v * ( x ))+ ⁇ ( v * ( x )) T ( x ⁇ v * ( x )) which essentially uses the unknown function's Taylor expansion at a nearby location v * (x) to interpolate ⁇ (x).
  • One natural idea to improve this is using several neighbors in C instead of the nearest one. Let's consider a soft K-means that computes p k (x), the posterior probability of cluster assignment for x. Then the function approximation can be handled as the expectation
  • the invention may be implemented in hardware, firmware or software, or a combination of the three.
  • the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
  • the computer preferably includes a processor, random access memory (RAM), a program memory (preferably a writable read-only memory (ROM) such as a flash ROM) and an input/output (I/O) controller coupled by a CPU bus.
  • RAM random access memory
  • program memory preferably a writable read-only memory (ROM) such as a flash ROM
  • I/O controller coupled by a CPU bus.
  • the computer may optionally include a hard drive controller which is coupled to a hard disk and CPU bus. Hard disk may be used for storing application programs, such as the present invention, and data. Alternatively, application programs may be stored in RAM or ROM.
  • I/O controller is coupled by means of an I/O bus to an I/O interface.
  • I/O interface receives and transmits data in analog or digital form over communication links such as a serial link, local area network, wireless link, and parallel link.
  • a display, a keyboard and a pointing device may also be connected to I/O bus.
  • separate connections may be used for I/O interface, display, keyboard and pointing device.
  • Programmable processing system may be preprogrammed or it may be programmed (and reprogrammed) by downloading a program from another source (e.g., a floppy disk, CD-ROM, or another computer).
  • the system of FIG. 2 receives images to be classified. Each image is represented by a set of local descriptors with their spatial coordinates.
  • the descriptor can be SIFT, or any other local features, computed from image patches at locations on a 2D grid.
  • the images is processed by a descriptor coding module where each descriptor of an image is nonlinearly mapped to form a high-dimensional sparse vector.
  • a nonlinear coding method called vector machine coding can be used, which is an extension of Vector Quantization (VQ) coding.
  • VQ Vector Quantization
  • the codes of all the descriptors in it are aggregated to form a single vector, then vectors of different regions are concatenated to form the image-level feature vector.
  • This step is based on a novel probability kernel incorporating the similarity metric of local descriptors.
  • the image-level feature vector is normalized and fed into a classifier. Linear SVMs, which scale linearly to the size of training data, is used in this step.
  • Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein.
  • the inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.

Abstract

Methods and systems are disclosed for image classification coding an image by nonlinearly mapping an image descriptor to form a high-dimensional sparse vector; spatially pooling each local region to form an image-level feature vector using a probability kernel incorporating a similarity metric of local descriptors; and classifying the image.

Description

The application claims priority to U.S. Provisional Application Ser. No. 61/314,386 filed Mar. 16, 2010, the content of which is incorporated by reference.
BACKGROUND
The invention relates to method and system for image classification.
Image classification, including object recognition and scene classification, remains to be a major challenge to the computer vision community. Perhaps one of the most significant developments in the last decade is the application of local features to image classification, including the introduction of “bag-of-visual-words” representation.
One conventional approach applies probabilistic generative models with the objective towards understanding the semantic content of images. Typically those models extend topic models on bag-of-word representation by further considering the spatial information of visual words.
Certain existing approaches apply vector quantization (VQ) coding on local image descriptors, for example SIFT features or SURF features, and then average pooling to obtain the so-called “bag-of-visual-words” representation, which is fed into a nonlinear classifier based on SVMs using Chi-square or intersection kernel.
A further extension is to incorporate the spatial information of local descriptors in an image, by partition images into regions in different locations and scales and compute region-based histograms, instead of computing the global histogram for the entire image. These region-based histograms are concatenated to form the feature vector for the image. Then nonlinear SVM is applied for classification. This approach is called “spatial pyramid matching kernel” (SPMK) method. SPMK is regarded the state-of-the-art method for image classification.
It is known that SVMs use pyramid matching kernels, biologically-inspired models, and KNN methods. Over the past years, the nonlinear SVM method using spatial pyramid matching (SPM) kernels seems to be dominant among the top performers in various image classification benchmarks, including Caltech-101, PASCAL, and TRECVID. The recent improvements were often achieved by combining different types of local descriptors, without any fundamental change of the underlying classification method. In addition to the demand for more accurate classifiers, one has to develop more practical methods. Nonlinear SVMs scale at least quadratically to the size of training data, which makes it nontrivial to handle large-scale training data. It is thus necessary to design algorithms that are computationally more efficient.
SUMMARY
In one aspect, methods and systems are disclosed for image classification coding an image by nonlinearly mapping an image descriptor to form a high-dimensional sparse vector; spatially pooling each local region to form an image-level feature vector using a probability kernel incorporating a similarity metric of local descriptors; and classifying the image.
In another aspect, a method for image classification includes nonlinearly mapping one or more descriptors of an image to form a high-dimensional sparse vector using Super-Vector nonlinear coding; spatial pooling each local region by aggregating codes of the descriptors in each local region to form a single vector, and concatenating vectors of different regions to form the image-level feature vector using probability kernel incorporating the similarity metric of local descriptors; and image classifying by normalizing image-level feature vector using linear SVMs.
In a related aspect, a system for image classification includes means for coding descriptor of an image by nonlinearly mapping to form a high-dimensional sparse vector using Super-Vector nonlinear coding method; means for spatial pooling each local region by aggregating the codes of all the descriptors in each local region to form a single vector, and concatenating vectors of different regions to form the image-level feature vector using probability kernel incorporating the similarity metric of local descriptors; and means for image classifying by normalizing image-level feature vector using linear SVMs.
In yet another aspect, a method for image classification includes extracting local image descriptors from a grid of locations in an image; nonlinearly coding extracted image descriptors to form a high-dimensional sparse vector; spatial pooling each image by partitioning into regions in different scales and locations, aggregating the codes of the descriptors in each region to form a single vector, and concatenating vectors of different regions to form the image-level feature vector; and linear classifying image-level feature vector.
In another related aspect, the system for image classification includes means for extracting local image descriptors from a grid of locations in an image; means for nonlinearly coding extracted image descriptors to form a high-dimensional sparse vector; means for spatial pooling each image by partitioning into regions in different scales and locations, aggregating the codes of all the descriptors in each region to form a single vector, means for concatenating vectors of different regions to form the image-level feature vector; and means for linear classifying image-level feature vector.
Advantages of the preferred embodiments may include one or more of the following. Image classification can be done using local visual descriptors. The system is more scalable in computation, transparent in classification, and greater accuracy than conventional systems. The overall image classification framework enjoys a linear training complexity, and also a great interpretability that is missing from conventional systems.
BRIEF DESCRIPTION OF THE DRAWING
FIG. 1 is a flow chart showing image classification method.
FIG. 2 shows an exemplary system to perform image classification.
DESCRIPTION
FIG. 1 is a flow chart showing image classification method. As shown in FIG. 1, the method receives an input image in 110. Next, the method performs a descriptor extraction in 120. This operation extracts local image descriptors, such as SIFT, SURF, or any other local features, from a grid of locations in the image. As a result, the image is represented as a set of descriptor vectors with their 2D location coordinates. Next, the method performs nonlinear coding in 130. Each descriptor of an image is nonlinearly mapped to form a high-dimensional sparse vector. The invention propose a novel nonlinear coding method called Super-Vector (SV) coding, which enjoys better theoretical properties than Vector Quantization (VQ) coding. Next, in 140, the method performs spatial pooling where each image is partitioned into regions in different scales and locations. For each region, the codes of all the descriptors in it are aggregated to form a single vector, then vectors of different regions are concatenated to form the image-level feature vector. A probability kernel incorporating the similarity metric of local descriptors can be used in one embodiment as described in detail below. Next, the process performs linear classification in 150. The image-level feature vector is normalized and fed into a classifier to detect an object such as a cat in 160. Linear SVMs, which scale linearly to the size of training data, are used in the method. In contrast, the previous state-of-the-art systems used nonlinear SVMs which requires quadratic or higher-order computational complexity for training.
Next, one embodiment of the descriptor coding is described. In this embodiment, the coding method enjoys appealing theoretical properties. It is interested in learning a smooth nonlinear function ƒ(x) defined on a high dimensional space Rd. The question is, how to derive a good coding scheme (or nonlinear mapping) φ(x) such that ƒ(x) can be well approximated by a linear function on it, namely wTφ(x). Assumption here is that ƒ(x) should be sufficiently smooth.
In a general unsupervised learning setting, where a set of bases C⊂Rd, called codebook or dictionary, is employed to approximate any x, namely,
x v C γ v ( x ) v ,
where γ(x)=[γv(x)]vεC is the coefficients, and sometimes Σvγv(x)=1. By restricting the cardinality of nonzeros of γ(x) to be 1 and γv(x)≧0, the Vector Quantization (VQ) method is obtained as:
v * ( x ) = arg min v C x - v ,
where P·P is the Euclidean norm (2-norm). The VQ method uses the coding γv(x)=1 if v=v*(x) and γv(x)=0 otherwise. ƒ(x) is β Lipschitz derivative smooth if for all x,x′εRd:
f ( x ) - f ( x ) - f ( x ) T ( x - x ) β 2 x - x 2 .
It immediately implies the following simple function approximation bound via VQ coding: for all xεRd:
f ( x ) - f ( v * ( x ) ) - f ( v * ( x ) ) T ( x - v * ( x ) ) β 2 x - v * ( x ) 2 . ( 1 )
This bounds simply states that one can approximate ƒ(x) by ƒ(v*(x))+∇ƒ(v*(x))T(x−v*(x)), and the approximation error is upper bounded by the quality of VQ. It further suggests that the function approximation can be improved by learning the codebook C to minimize this upper bound. One way is the K-means algorithm
C = arg min C { x min v C x - v 2 } .
Eq. (1) also suggests that the approximation to ƒ(x) can be expressed as a linear function on a nonlinear coding scheme
ƒ(x)≈g(x)≡w Tφ(x),
where φ(x) is called the Super-Vector (SV) coding of x, defined by
φ(x)=[ v(x),γv(x)(x−v)T]vεC T  (2)
where s is a nonnegative constant. It is not difficult to see that
w = [ 1 s f ( v ) , f ( v ) ] v C ,
which can be regarded as unknown parameters to be estimated. Because γv(x)=1 if v=v*(x), otherwise γv(x)=0, the obtained φ(x) a is highly sparse representation, with dimensionality |C|(d+1). For example, if |C|=3 and γ(x)=[0,1,0], then
ϕ ( x ) = [ 0 , , 0 d + 1 dim . , s , ( x - v ) T d + 1 dim . , 0 , , 0 d + 1 d i m . ] T ( 3 )
wTφ(x) provides a piece-wise linear function to approximate a nonlinear function ƒ(x), while with VQ coding φ(x)=[γv(x)]vεC T, the same formulation wTφ(x) gives a piece-wise constant approximation. SV coding may achieve a lower function approximation error than VQ coding. It should be noted that the popular bag-of-features image classification method essentially employs VQ to obtain histogram representations. The proposed SV coding is a simple extension of VQ, and may lead to a better approach to image classification.
Next, one embodiment of spatial pooling is discussed. Each image can be represented as a set of descriptor vectors x that follows an image-specific distribution, represented as a probability density function p(x) with respect to an image independent back-ground measure dμ(x). Let's first ignore the spacial locations of x, and address the spacial pooling later. A kernel-based method for image classification is based on a kernel on the probability distributions over xεΩ, K:P×P
Figure US08447119-20130521-P00001
R. A well-known example is the Bhattacharyya kernel:
K b ( p , q ) = Ω p ( x ) 1 2 q ( x ) 1 2 μ ( x ) .
Here p(x) and q(x) represent two images as distributions over local descriptor vectors, and μ(x) is the image independent background measure. Bhattacharyya kernel is closely associated with Hellinger distance, defined as Dh(p,q)=2−Kb(p,q), which can be seen as a principled symmetric approximation of the Kullback Leibler (KL) divergence. Despite the popular application of both Bhattacharyya kernel and KL divergence, a significant drawback is the ignorance of the underlying similarity metric of x. In order to avoid this problem, one has to work with very smooth distribution families that are inconvenient to work with in practice. This invention propose a novel formulation that explicitly takes the similarity of x into account:
K s ( p , q ) = Ω Ω p ( x ) 1 2 q ( x ) 1 2 κ ( x , x ) μ ( x ) μ ( x ) = Ω Ω p ( x ) - 1 2 q ( x ) - 1 2 κ ( x , x ) p ( x ) q ( x ) μ ( x ) μ ( x )
where K(x,x′) is a RKHS kernel on Q that reflects the similarity structure of x. In the extreme case where K(x,x′)=δ(x−x′) is the delta-function with respect to μ(•), then the above kernel reduces to the Bhattacharyya kernel.
The system cannot directly observe p(x) from any image, but a set X of local descriptors. Therefore, based on the empirical approximation to Ks(p,q), a kernel between sets of vectors is defined as:
K ( X , X ) = 1 NN x X x X p ( x ) - 1 2 q ( x ) - 1 2 κ ( x , x ) ( 4 )
where N and N′ are the sizes of the descriptor sets from two images.
If κ(x,x′)=(φ(x),φ(x′)), where φ(x) is the SV coding defined in the previous section. It is easy to see that κ(x,x′)=0 if x and x′ fall into different clusters. Then Eq. (4) is presented as follows:
K ( X , X ) = 1 NN k = 1 C x X k x X k p ( x ) - 1 2 q ( x ) - 1 2 κ ( x , x )
where Xk is the subset of X fallen into the k-th cluster. Furthermore, assume that p(x) remains constant within each cluster partition, i.e., p(x) gives rise to a histogram [pk]k=1 |C|, then
K ( X , X ) = 1 NN k = 1 C 1 p k x X k ϕ ( x ) , 1 q k x X k ϕ ( x )
The above kernel can be re-written as an inner product kernel of the form K(X,X′)=(Φ(X),Φ(X′)), where
Φ ( X ) = 1 N k = 1 C 1 p k x X k ϕ ( x ) .
Therefore functions in the reproducing kernel Hilbert space for this kernel has a linear representation ƒ(X)=wTΦ(X). In other words, Φ(X) can be used simply as nonlinear feature vector and then a linear classifier is learned using this feature vector. The effect is equivalent to using nonlinear kernel K(X,X′) between image pairs X and X′.
Finally, weighting by histogram pk is equivalent to treating density p(x) as piece-wise constant around each VQ basis, under a specific choice of background measure μ(x) that equalizes different partitions. This representation is not sensitive to the choice of background measure μ(x), which is image independent. In particular, a change of measure μ(•) (still piece-wise constant in each partition) leads to a rescaling of different components in Φ(X). This means that the space of linear classifier ƒ(x)=wTΦ(X) remains the same.
To incorporate the spatial location information of x, the idea of spatial pyramid matching is applied. Let each image be evenly partitioned into 1×1, 2×2, and 3×1 blocks, respectively in 3 different levels. Based on which block each descriptor comes from, the whole set X of an image is then organized into three levels of subsets: X11 1, X11 2, X12 2, X21 2, X22 2, X11 3, X12 3, and X13 3. Then the pooling operation introduced in the last subsection can be applied to each of the subsets. An image's spatial pyramid representation is then obtained by concatenating the results of local pooling
φs(X)=[Φ(X 11 1),Φ(X 11 2),Φ(X 12 2),Φ(X 21 2),Φ(X 22 2),Φ(X 11 3),Φ(X 12 3),Φ(X 13 3)]
Next, one embodiment of image classification is described. Image classification is done by applying classifiers based on the image representations obtained from the pooling step. It is required to find whether a particular category of objects is contained in an image or not, which can be translated into a binary classification problem. This is performed by applying a linear SVM that employs a hinge loss to learn g(X)=wTΦs(X). It should be noted that the function is nonlinear on X since Φs(X) is a nonlinear operator.
The image-level classification function is closely connected to a real-valued function on local descriptors. Without loss of generality, let's assume that only global pooling is used, which means Φs(X)=Φ(X) in this case.
g ( X ) = w T Φ ( X ) = 1 N k = 1 C 1 p k x X k w T ϕ ( x ) = 1 N k = 1 C 1 p k x X k g ( x ) ( 5 )
where g(x)=wTΦ(x). The above equation provides an interesting insight to the classification process: a patch-level pattern matching is operated everywhere in the image, and the responses are then aggregated together to generate the score indicating how likely a particular category of objects is present. This observation is well-aligned with the biologically-inspired vision models, like Convolution Neural Networks and HMAX model, which mostly employ feed-forward pattern matching for object recognition.
This connection stresses the importance of learning a good coding scheme on local descriptors x, because Φ(x) solely defines the function space of g(x)=wTΦ(x), which consequently determines if the unknown classification function can be well learned. The connection also implies that supervised training of Φ(x) could potentially lead to further improvements.
Furthermore, the classification model enjoys the advantages of interpretability and computational scalability. Once the model is trained, Eq. (5) suggests that one can compute a response map based on g(x), which visualizes where the classifier focuses on in the image. Since the proposed method naturally requires a linear classifier, it enjoys a training scalability which is linear to the number of training images, while nonlinear kernel-based methods suffer quadratic or higher complexity.
The classification model is more related to local coordinate coding (LCC), which points out that in some cases a desired sparsity of φ(x) should come from a locality of the coding scheme. Indeed, the proposed SV coding leads to a highly sparse representation φ(x), as defined by Eq. (2), which activates those coordinates associated to the neighborhood of x. As the result, g(x)=wTφ(x) gives rise to a local linear function (i.e., piece-wise linear) to approximate the unknown nonlinear function ƒ(x). But, the computation of SV coding is much simpler than sparse coding approaches.
The method can be further improved by considering a soft assignment of x to bases C. The underlying interpretation of ƒ(x)≈wTφ(x) is the approximation
ƒ(x)≈ƒ(v *(x))+∇ƒ(v *(x))T(x−v *(x))
which essentially uses the unknown function's Taylor expansion at a nearby location v*(x) to interpolate ƒ(x). One natural idea to improve this is using several neighbors in C instead of the nearest one. Let's consider a soft K-means that computes pk (x), the posterior probability of cluster assignment for x. Then the function approximation can be handled as the expectation
f ( x ) k = 1 C p k ( x ) [ f ( v k ) + f ( v k ) T ( x - v k ) ]
Then the pooling step becomes a computation of the expectation
Φ ( X ) = 1 N [ 1 p k x X p k ( x ) ( x - v k + s ) ] k = 1 C
where
p k = 1 N x X p k ( x ) ,
and s comes from Eq. (2). This approach is different from the image classification using GMM. Basically, those GMM methods consider the distribution kernel, while the inventive method incorporates nonlinear coding into the distribution kernel. Furthermore, the model according to the invention requires the stickiness to VQ—the soft version requires all the components share the same isotropic diagonal covariance. That means a much less number of parameters to estimate, and therefore a significantly higher accuracy can be obtained.
As suggested by Eq. 5, a very unique perspective of this method is the “transparency” of the classification model. Once the image classifier is trained, a real-valued function g(x) is automatically obtained on the local descriptor level. Therefore a response map of g(x) can be visualized on test images.
The invention may be implemented in hardware, firmware or software, or a combination of the three. Preferably the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
By way of example, a block diagram of a computer to support the system is shown in FIG. 2. The computer preferably includes a processor, random access memory (RAM), a program memory (preferably a writable read-only memory (ROM) such as a flash ROM) and an input/output (I/O) controller coupled by a CPU bus. The computer may optionally include a hard drive controller which is coupled to a hard disk and CPU bus. Hard disk may be used for storing application programs, such as the present invention, and data. Alternatively, application programs may be stored in RAM or ROM. I/O controller is coupled by means of an I/O bus to an I/O interface. I/O interface receives and transmits data in analog or digital form over communication links such as a serial link, local area network, wireless link, and parallel link. Optionally, a display, a keyboard and a pointing device (mouse) may also be connected to I/O bus. Alternatively, separate connections (separate buses) may be used for I/O interface, display, keyboard and pointing device. Programmable processing system may be preprogrammed or it may be programmed (and reprogrammed) by downloading a program from another source (e.g., a floppy disk, CD-ROM, or another computer).
The system of FIG. 2 receives images to be classified. Each image is represented by a set of local descriptors with their spatial coordinates. The descriptor can be SIFT, or any other local features, computed from image patches at locations on a 2D grid. In one embodiment, the images is processed by a descriptor coding module where each descriptor of an image is nonlinearly mapped to form a high-dimensional sparse vector. A nonlinear coding method called vector machine coding can be used, which is an extension of Vector Quantization (VQ) coding. Next, the descriptor are provided to a spatial pooling module. For each local region, the codes of all the descriptors in it are aggregated to form a single vector, then vectors of different regions are concatenated to form the image-level feature vector. This step is based on a novel probability kernel incorporating the similarity metric of local descriptors. The image-level feature vector is normalized and fed into a classifier. Linear SVMs, which scale linearly to the size of training data, is used in this step.
Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
The invention has been described herein in considerable detail in order to comply with the patent Statutes and to provide those skilled in the art with the information needed to apply the novel principles and to construct and use such specialized components as are required. However, it is to be understood that the invention can be carried out by specifically different equipment and devices, and that various modifications, both as to the equipment details and operating procedures, can be accomplished without departing from the scope of the invention itself.

Claims (6)

What is claimed is:
1. A computer-implemented method for image classification, comprising:
a. coding an image by nonlinearly mapping an image descriptor to form a high-dimensional sparse vector;
b. spatially pooling each local region to form an image-level feature vector using a probability kernel incorporating a similarity metric of local descriptors; and
c. classifying the image;
wherein the nonlinear mapping of f(x) is approximated by a linear function wTφ(x), where φ(x) is called the Super-Vector (SV) coding of x, defined by

φ(x)=[ v(x),γv(x)(x−v)T]vεC T,
and wherein the spatially pooling forms a modified Bhattacharyya kernel.
2. The method of claim 1, comprising representing the image as a set of descriptor vectors with their 2D location coordinates.
3. The method of claim 1, wherein the descriptor comprises one or more local features.
4. The method of claim 1, wherein the pooling comprises, for each local region, aggregating codes of all the descriptors to form a single vector and concatenating vectors of different regions to form an image-level feature vector.
5. The method of claim 1, comprising performing spatial pyramid matching to incorporate spatial location information.
6. The method of claim 1, comprising applying a linear support vector machine (SVM) to classify the image.
US12/818,156 2010-03-16 2010-06-18 Method and system for image classification Active 2031-07-07 US8447119B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/818,156 US8447119B2 (en) 2010-03-16 2010-06-18 Method and system for image classification

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US31438610P 2010-03-16 2010-03-16
US12/818,156 US8447119B2 (en) 2010-03-16 2010-06-18 Method and system for image classification

Publications (2)

Publication Number Publication Date
US20110229045A1 US20110229045A1 (en) 2011-09-22
US8447119B2 true US8447119B2 (en) 2013-05-21

Family

ID=44647307

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/818,156 Active 2031-07-07 US8447119B2 (en) 2010-03-16 2010-06-18 Method and system for image classification

Country Status (1)

Country Link
US (1) US8447119B2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9424493B2 (en) 2014-10-09 2016-08-23 Microsoft Technology Licensing, Llc Generic object detection in images
CN106203396A (en) * 2016-07-25 2016-12-07 南京信息工程大学 Aerial Images object detection method based on degree of depth convolution and gradient rotational invariance
CN107071858A (en) * 2017-03-16 2017-08-18 许昌学院 A kind of subdivision remote sensing image method for parallel processing under Hadoop
US9858496B2 (en) 2016-01-20 2018-01-02 Microsoft Technology Licensing, Llc Object detection and classification in images
US20180053057A1 (en) * 2016-08-18 2018-02-22 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture
US11134221B1 (en) 2017-11-21 2021-09-28 Daniel Brown Automated system and method for detecting, identifying and tracking wildlife

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120213426A1 (en) * 2011-02-22 2012-08-23 The Board Of Trustees Of The Leland Stanford Junior University Method for Implementing a High-Level Image Representation for Image Analysis
JP6050223B2 (en) 2011-11-02 2016-12-21 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Image recognition apparatus, image recognition method, and integrated circuit
US9412020B2 (en) 2011-11-09 2016-08-09 Board Of Regents Of The University Of Texas System Geometric coding for billion-scale partial-duplicate image search
CN103164713B (en) * 2011-12-12 2016-04-06 阿里巴巴集团控股有限公司 Image classification method and device
JP2015529365A (en) * 2012-09-05 2015-10-05 エレメント,インク. System and method for biometric authentication associated with a camera-equipped device
CN103049760B (en) * 2012-12-27 2016-05-18 北京师范大学 Based on the rarefaction representation target identification method of image block and position weighting
CN103106265B (en) * 2013-01-30 2016-10-12 北京工商大学 Similar image sorting technique and system
US9141885B2 (en) * 2013-07-29 2015-09-22 Adobe Systems Incorporated Visual pattern recognition in an image
CN103499584B (en) * 2013-10-16 2016-02-17 北京航空航天大学 Railway wagon hand brake chain bar loses the automatic testing method of fault
CN103544483B (en) * 2013-10-25 2016-09-14 合肥工业大学 A kind of joint objective method for tracing based on local rarefaction representation and system thereof
CN103927540B (en) * 2014-04-03 2019-01-29 华中科技大学 A kind of invariant feature extraction method based on biological vision hierarchical mode
US9913135B2 (en) 2014-05-13 2018-03-06 Element, Inc. System and method for electronic key provisioning and access management in connection with mobile devices
CN106664207B (en) 2014-06-03 2019-12-13 埃利蒙特公司 Attendance verification and management in relation to mobile devices
EP3204888A4 (en) * 2014-10-09 2017-10-04 Microsoft Technology Licensing, LLC Spatial pyramid pooling networks for image processing
US10339422B2 (en) * 2015-03-19 2019-07-02 Nec Corporation Object detection device, object detection method, and recording medium
US9514391B2 (en) * 2015-04-20 2016-12-06 Xerox Corporation Fisher vectors meet neural networks: a hybrid visual classification architecture
CN105512677B (en) * 2015-12-01 2019-02-01 南京信息工程大学 Classifying Method in Remote Sensing Image based on Hash coding
CN105654122B (en) * 2015-12-28 2018-11-16 江南大学 Based on the matched spatial pyramid object identification method of kernel function
CN106909895B (en) * 2017-02-17 2020-09-22 华南理工大学 Gesture recognition method based on random projection multi-kernel learning
CN107220659B (en) * 2017-05-11 2019-10-25 西安电子科技大学 High Resolution SAR image classification method based on total sparse model
CA3076038C (en) 2017-09-18 2021-02-02 Element Inc. Methods, systems, and media for detecting spoofing in mobile authentication
US11080324B2 (en) * 2018-12-03 2021-08-03 Accenture Global Solutions Limited Text domain image retrieval
KR20220004628A (en) 2019-03-12 2022-01-11 엘리먼트, 인크. Detection of facial recognition spoofing using mobile devices
CN110198473B (en) * 2019-06-10 2021-07-20 北京字节跳动网络技术有限公司 Video processing method and device, electronic equipment and computer readable storage medium
US11507248B2 (en) 2019-12-16 2022-11-22 Element Inc. Methods, systems, and media for anti-spoofing using eye-tracking
CN112001399B (en) * 2020-09-07 2023-06-09 中国人民解放军国防科技大学 Image scene classification method and device based on local feature saliency
CN113326880A (en) * 2021-05-31 2021-08-31 南京信息工程大学 Unsupervised image classification method based on community division
CN113793327B (en) * 2021-09-18 2023-12-26 北京中科智眼科技有限公司 Token-based high-speed rail foreign matter detection method
CN116092701B (en) * 2023-03-07 2023-06-30 南京康尔健医疗科技有限公司 Control system and method based on health data analysis management platform

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050058339A1 (en) * 2003-09-16 2005-03-17 Fuji Xerox Co., Ltd. Data recognition device
US20110044530A1 (en) * 2009-08-19 2011-02-24 Sen Wang Image classification using range information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050058339A1 (en) * 2003-09-16 2005-03-17 Fuji Xerox Co., Ltd. Data recognition device
US20110044530A1 (en) * 2009-08-19 2011-02-24 Sen Wang Image classification using range information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Jianchao Yang,et al., Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification, IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009.
Svetlana Lazebnik, et al., Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006.

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9424493B2 (en) 2014-10-09 2016-08-23 Microsoft Technology Licensing, Llc Generic object detection in images
US9858496B2 (en) 2016-01-20 2018-01-02 Microsoft Technology Licensing, Llc Object detection and classification in images
CN106203396A (en) * 2016-07-25 2016-12-07 南京信息工程大学 Aerial Images object detection method based on degree of depth convolution and gradient rotational invariance
CN106203396B (en) * 2016-07-25 2019-05-10 南京信息工程大学 Aerial Images object detection method based on depth convolution sum gradient rotational invariance
US20180053057A1 (en) * 2016-08-18 2018-02-22 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture
US9946933B2 (en) * 2016-08-18 2018-04-17 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture
CN107071858A (en) * 2017-03-16 2017-08-18 许昌学院 A kind of subdivision remote sensing image method for parallel processing under Hadoop
US11134221B1 (en) 2017-11-21 2021-09-28 Daniel Brown Automated system and method for detecting, identifying and tracking wildlife

Also Published As

Publication number Publication date
US20110229045A1 (en) 2011-09-22

Similar Documents

Publication Publication Date Title
US8447119B2 (en) Method and system for image classification
Zhou et al. Image classification using super-vector coding of local image descriptors
US10796145B2 (en) Method and apparatus for separating text and figures in document images
US8374442B2 (en) Linear spatial pyramid matching using sparse coding
US9978002B2 (en) Object recognizer and detector for two-dimensional images using Bayesian network based classifier
Wu et al. Conformal transformation of kernel functions: A data-dependent way to improve support vector machine classifiers
US9053392B2 (en) Generating a hierarchy of visual pattern classes
US8532399B2 (en) Large scale image classification
US9031331B2 (en) Metric learning for nearest class mean classifiers
JP5373536B2 (en) Modeling an image as a mixture of multiple image models
US8233711B2 (en) Locality-constrained linear coding systems and methods for image classification
US9141885B2 (en) Visual pattern recognition in an image
Kiang et al. An evaluation of self-organizing map networks as a robust alternative to factor analysis in data mining applications
US8428397B1 (en) Systems and methods for large scale, high-dimensional searches
US20170061257A1 (en) Generation of visual pattern classes for visual pattern regonition
US20210110215A1 (en) Information processing device, information processing method, and computer-readable recording medium recording information processing program
De la Torre et al. Multimodal oriented discriminant analysis
Wei et al. Region ranking SVM for image classification
Kumar et al. Semi-supervised robust mixture models in RKHS for abnormality detection in medical images
Han et al. High-order statistics of microtexton for hep-2 staining pattern classification
Yang et al. Projective non-negative matrix factorization with applications to facial image processing
Kumar et al. Kernel generalized Gaussian and robust statistical learning for abnormality detection in medical images
Sohail et al. Classification of ultrasound medical images using distance based feature selection and fuzzy-SVM
US11398057B2 (en) Imaging system and detection method
Hammer et al. How to visualize large data sets?

Legal Events

Date Code Title Description
STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: NEC CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NEC LABORATORIES AMERICA, INC.;REEL/FRAME:031998/0667

Effective date: 20140113

FPAY Fee payment

Year of fee payment: 4

AS Assignment

Owner name: NEC CORPORATION, JAPAN

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE REMOVE 8538896 AND ADD 8583896 PREVIOUSLY RECORDED ON REEL 031998 FRAME 0667. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNOR:NEC LABORATORIES AMERICA, INC.;REEL/FRAME:042754/0703

Effective date: 20140113

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8