WO2016142293A1 - Method and apparatus for image search using sparsifying analysis and synthesis operators - Google Patents
Method and apparatus for image search using sparsifying analysis and synthesis operators Download PDFInfo
- Publication number
- WO2016142293A1 WO2016142293A1 PCT/EP2016/054664 EP2016054664W WO2016142293A1 WO 2016142293 A1 WO2016142293 A1 WO 2016142293A1 EP 2016054664 W EP2016054664 W EP 2016054664W WO 2016142293 A1 WO2016142293 A1 WO 2016142293A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- sparse representation
- operator
- triplet
- similarity metric
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5838—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using colour
Definitions
- This invention relates to a method and an apparatus for image search, and more particularly, to a method and an apparatus for image search using sparsifying analysis and synthesis operators.
- a general image search algorithm can be seen as having various goals including: i) finding correctly matching images given a task-specific search criteria and ii) doing so in a time and resource efficient manner, particularly in the context of large image databases.
- discriminative Mahalanobis metric learning methods have become an important part of the research toolbox.
- Such methods can be seen as applying an explicit linear transform to the image feature vector with the goal of making distance computations between transformed feature vectors better correspond to the search criteria.
- the linear transform can be learned using one of a variety of objectives in order to adapt it to various possible search criteria including image classification, face verification, or image ranking. Common to all these methods is the fact that the learned linear transform is a complete or undercomplete matrix that is constant for all image feature vectors.
- a method for performing image search comprising: accessing at least one of a first feature vector corresponding to a query image and a first sparse representation of the query image, the first sparse representation being based on an analysis operator and the first feature vector; determining a similarity metric between the query image and a second image of an image database using a synthesis operator, responsive to a second sparse representation and one of the first feature vector and the first sparse representation, the second sparse representation being based on the second feature vector corresponding to the second image and the analysis operator; and generating an image search output based on the similarity metric between the query image and the second image.
- the image search output may indicate one of (1) a rank of the second image and (2) whether the second image matches the query image.
- the method for performing image search may receive at least one of the first feature vector and the first sparse representation from a user device via a communication network, and may transmit the image search output to the user device via the communication network.
- the method for performing image search may determine the synthesis operator based on a set of pair-wise constraints, wherein each pair-wise constraint indicates whether a corresponding pair of training images are similar or dissimilar.
- the method for performing image search may determine the synthesis operator based on a set of triplet constraints, wherein each triplet constraint indicates that a first training image of the triplet is more similar to a second training image of the triplet than to a third training image of the triplet.
- the synthesis operator may be trained such that a similarity metric determined for training images corresponding to a pair- wise constraint or a triplet constraint is consistent with what the pair-wise constraint or the triplet constraint indicates.
- an apparatus for performing image search comprising: an input configured to access at least one of a first feature vector corresponding to a query image and a first sparse representation of the query image, the first sparse representation being based on an analysis operator and the first feature vector; and one or more processors configured to: determine a similarity metric between the query image and a second image of an image database using a synthesis operator, responsive to a second sparse representation and one of the first feature vector and the first sparse representation, the second sparse representation being based on the second feature vector corresponding to the second image and the analysis operator, and generate an image search output based on the similarity metric between the query image and the second image.
- the apparatus for performing image search may further comprise a communication interface configured to receive the at least one of the first feature vector and the first sparse representation from a user device via a communication network, and to transmit the image search output to the user device via the communication network.
- the apparatus for performing image search may determine the synthesis operator based on a set of pair-wise constraints, wherein each pair-wise constraint indicates whether a corresponding pair of training images are similar or dissimilar.
- the apparatus for performing image search may determine the synthesis operator based on a set of triplet constraints, wherein each triplet constraint indicates that a first training image of the triplet is more similar to a second training image of the triplet than to a third training image of the triplet.
- the synthesis operator may be trained such that a similarity metric determined for training images corresponding to a pair- wise constraint or a triplet constraint is consistent with what the pair-wise constraint or the triplet constraint indicates.
- the present embodiments also provide a non-transitory computer readable storage medium having stored thereon instructions for performing any of the methods described above.
- FIG. 1 illustrates an exemplary method for performing image search, according to an embodiment of the present principles.
- FIG. 2 illustrates an exemplary method for determining the operator for the similarity function.
- FIG. 3 shows exemplary pictures from an image training database, where similar images are grouped together, and images from different groups are dissimilar.
- FIG. 4 illustrates an exemplary analysis encoding process for generating a sparse code z for vector y, according to an embodiment of the present principles.
- FIG. 5A shows exemplary sparse codes for images in a database, where FIG. 5B is an expanded view of the upper-right portion of FIG. 5A.
- FIG. 6A shows a parametrized hinge loss function
- FIG. 6B shows a continuous hinge loss function
- FIG. 7A illustrates an exemplary learning process for learning operator B using pair- wise constraints for a symmetric similarity metric, according to an embodiment of the present principles
- FIG. 7B illustrates an exemplary learning process for learning operator B using pair-wise constraints for an asymmetric similarity metric, according to an embodiment of the present principles.
- FIG. 8A illustrates an exemplary process for performing image matching, according to an embodiment of the present principles
- FIG. 8B illustrates an exemplary process for performing image ranking for a query image, according to an embodiment of the present principles.
- FIG. 9 illustrates a block diagram of an exemplary system in which multiple user devices are connected to an image search engine according to the present principles.
- FIG. 10 illustrates a block diagram of an exemplary system in which various aspects of the exemplary embodiments of the present principles may be implemented.
- the present principles are directed to image search and provide various features compared to the existing methods.
- the proposed approaches may rely on a correlation metric to compare different items instead of a distance metric as in the majority of earlier works. This enables a more flexible framework than those based on the distance metric while offering computational efficiency.
- the proposed methods may use sparse representations in the proposed correlation metrics. This enables efficient storage of the data items in a database and improves the computation speed when used together with the proposed correlation metrics.
- the proposed methods can also be adapted for use with query items for which the sparse representation is not initially available so that correlation comparison can still be performed quickly while still providing the advantages mentioned above.
- scalars vectors and matrices using, respectively standard, bold, and uppercase-bold typeface (e.g., scalar a, vector a and matrix A).
- v k to denote a vector from a sequence v 1 , v 2 , ⁇ - , v N , and v k to denote the A th coefficient of vector v.
- [a k ] fe (respectively, [ ⁇ 3 ⁇ 4] 3 ⁇ 4 ) denotes concatenation of the vectors a fe (scalars a k ) to form a single column vector.
- FIG. 1 illustrates an exemplary method 100 for performing image search, according to an embodiment of the present principles.
- a query image is input and image search will be performed in an image database to return one or more matching images for the query image.
- a feature vector is calculated for the query image.
- a feature vector of an image contains information describing an image' s important characteristics.
- Image search algorithms usually rely on an image encoding function to compute the feature vector y £ . N from a given image.
- Common image feature construction approaches consist of first densely extracting local descriptors X; £ M d such as SIFT (Scale-invariant feature transform) from multiple resolutions of the input image and then aggregating these descriptors into a single vector y .
- SIFT Scale-invariant feature transform
- Common aggregation techniques include methods based on -means models of the local descriptor distribution, such as bag- of-words and VLAD (Vector of Locally Aggregated Descriptors) encoding, and Fisher encoding, which is based on a GMM (Gaussian Mixture Model) model of the local descriptor distribution.
- VLAD Vector of Locally Aggregated Descriptors
- Fisher encoding which is based on a GMM (Gaussian Mixture Model) model of the local descriptor distribution.
- a compact representation is calculated for the feature vector.
- a compact representation of a given data is a point of interest since these representations provide a better understanding of the underlying structures in the data.
- Compact representation can be any representation that represents original vectors by smaller data.
- Compact representation can be obtained by linear projection on a subspace resulting in smaller vectors than the original data size, or can be sparse representation, for example, obtained using a synthesis model and an analysis model as described below.
- vector x is called the representation of vector y in dictionary D.
- This representation is often more useful when the representation x has only few non-zero entries, i.e., when x is sparse.
- x E (y, D)
- the encoder function E () enforces sparsity on x while keeping the distance to the original data vector, d(y, Dx), sufficiently small.
- a common example of such an encoder function is the lasso regression defined as
- the regression parameter ⁇ in Eq. (2) defines the tradeoff between the sparsity and the distance.
- the output vector z contains essential information on y.
- the analysis operators can be very useful if the output, z , is known to be sparse. However unlike synthesis representations, given the vector z the original vector y is often not unique. Hence one can distinguish two types of utilizing analysis operators and sparsity. The first one is finding a vector close to y s that would have a sparse output vector (or sparse code) with A, where y s represents a vector for which Ay s is sparse and y s and y are as close as possible.
- distance or similarity measures are calculated between the query image and database images at step 140.
- the measures can be calculated using the feature vectors of the query image and database images, or using compact representations of the query image and the database images, for example, using a Mahalanobis metric.
- images are ranked at step 150.
- One or more matching images are then output at step 160.
- M (or M r M) is the Mahalanobis metric transformation matrix.
- M (or M r M) is the Mahalanobis metric transformation matrix.
- the Mahalanobis metric can also be used in nearest-neighbor-based classification methods.
- a set of labeled image feature vectors ⁇ ⁇ y, e ⁇ 1 , .. . ,C ⁇ , belonging to one of C classes is used as a classifier.
- the class label assigned to it is that of the nearest ; under the Mahalanobis metric,
- FIG. 2 illustrates an exemplary method 200 for determining the operator for the similarity function.
- a training set is input at step 210, which may be a database with annotations, for example, indicating whether pictures are similar or dissimilar.
- the database imposes constraints on the similarity function at step 220, for example, if two pictures are indicated as similar in the training database, the learned similarity function should provide a high similarity score between these two pictures.
- the operator for the similarity function can be learned at step 230. In the following, the similarity constraints and various learning methods are described in further detail.
- each constraint is defined by a pair of data points and an indicator variable as
- FIG.3 shows exemplary pictures from an image training database, where similar images are grouped together, and images from different groups are dissimilar. Particularly, pictures in the same row (310, 320, 330, or 340) are grouped together in FIG.3. The pair- wise constraints between two images within a group are set to 1, and the pair- wise constraints between two images from different groups are set to -1.
- the task of matching can be described as determining whether a given query data belongs to a cluster in a dataset. For example, in face recognition systems, the given facial picture of a person is compared to other facial data of the same person within the database to perform verification. It is also possible to perform matching between two given data points even though these points belong to a cluster different from the observed clusters in the database.
- a more informative set of constraints are defined by a triplet of data points as
- the task of ranking can be defined as finding a function, S(vi,v 2 ) given the constraints > sucn mat f° r an Y given triplet of items (qi,qi,q-i) obeying s* 1 ⁇ 2i > 1 ⁇ 2) > S* ( ⁇ 7 3 )> tne f unct i on s 0 satisfies the condition S(y 9l ,y 3 ⁇ 4 ) > S(y ?1 .y 3 ⁇ 4 ).
- Ranking enables sorting the database items based on the similarity to a query item and it is an essential part of applications such as data search and retrieval. An example for this application can be seen as image based search from a large database of images based on specific similarity criteria.
- the analysis encoder computes Ay, for example, using liner projection.
- the analysis encoder generates sparse code z using a non-line sparsifying function.
- the non-linear sparsifying function can be, for example, but not limited to, hard thresholding, soft thresholding, or a function to select some values to zero and modify other values.
- the non-linear sparsify function can also be a step function or a sigmoid function.
- the processed vector is then output as sparse code z.
- FIG. 5A shows exemplary sparse codes for images in a database
- FIG. 5B is an expanded view of the upper-right portion of FIG. 5A.
- a dark pixel indicates "0" in the vector
- a gray pixel indicates the magnitude of the non-zero in the vector.
- the sparse code z based on analysis operator A can be used with a synthesis operator B to generate a similarity metric.
- the synthesis operator B here applies to sparse vectors.
- the sparse code of a query image When the sparse code of a query image is needed, it is computed online while sparse codes of the database images can be computed offline beforehand without affecting the speed of the comparison.
- an asymmetric similarity function as described in Eq. (19) without requiring the sparse representation of the query image can be very useful since skipping the computation of the sparse code can provide significant speed improvement.
- B p 1 (21) in which function S; (B) is set as either S sm (Z ( , Z/) or S asm ( i, z ; ) as defined in Eq. (18) or (19) respectively.
- the function IQ in Eq. (21) is a function that penalizes the incorrectly estimated similarities in the training set, i.e., when Yp$i pjp is negative.
- FIG. 7A illustrates an exemplary learning process 700A for learning operator B using pair-wise constraints for a symmetric similarity metric, according to an embodiment of the present principles.
- the set of annotations, ⁇ ⁇ is also input to the learning process.
- analysis encoder (710) can generate sparse codes z lt and z 2i , respectively.
- a penalty function (730) for example, as described in Eq. (21)
- the penalty function sets a large value when the estimated similarity metric does not match the annotated result.
- the penalty function is accumulated over the training vector pairs, and the synthesis operator that minimizes the penalty function, i.e., the synthesis operator that provides the closest similarity metric to the annotation results is chosen as the solution B sm .
- FIG. 7B illustrates an exemplary asymmetric learning process 700B for learning operator B using pair-wise constraints for an asymmetric similarity metric, according to an embodiment of the present principles.
- the input of the learning process includes many training vector pairs, ⁇ yii,y 2 i ⁇ i, and the annotation set ⁇ yj;.
- analysis encoder 750
- a similarity function 760
- Sasm(yu,y2i) yi;Bz2i ⁇
- the solution B asm to the penalty function is output as the synthesis operator.
- the learning process when triplet constraints are used for training is similar to process 700A or 700B.
- the input now includes training vector triplets ⁇ y ⁇ , y 2i , Yzi ⁇ u where y-Lj and y 2 j are more similar than y l£ and y 3 j are.
- the analysis encoder generates sparse codes ⁇ 1 ⁇ , ⁇ 2 ;, ⁇ 3 ⁇ for each training vector triplet y 1 j,y 2 j,y 3 j, respectively.
- the similarity function is applied to z li: z 2i , and to ⁇ - ⁇ , ⁇ to get S S m n.y2i) and 5 sm (y l£ ,y 3£ ), respectively.
- the penalty function takes S sm (y li ,y 2i ) and Ssm ⁇ > ⁇ 3 ⁇ as input, and penalizes when S sm ⁇ y li ,y 2 i) indicates less similarity than
- the analysis encoder For the asymmetric learning process, the analysis encoder generates sparse codes z 2i and z 3i for training vectors y 2i and y 3i , respectively.
- the similarity function is applied to Yii> z 2i, and to Yii> z 3i to g et SasmiyivYzi) an d , respectively.
- the penalty function takes S asm (y li ,y 2i ) and S asm (y li ,y 3i ) as input, and penalizes when Sasm(yn > y2i indicates less similarity than ⁇ , ⁇ .
- FIG. 8A illustrates an exemplary process 800A for performing image matching, according to an embodiment of the present principles.
- Two input images are represented by feature vectors y x andy 2 , respectively (810).
- Analysis encoder (820) is used to sparsify vectors y 1 and y 2 to generate sparse codes z x and z 2 , respectively.
- a similarity metric (830) can be calculated based on the sparse codes, for example, using the similarity function as in Eq. (18) using the symmetric operator B sm . Based on whether the similarity metric exceeds a threshold or not, i.e., indicates a high similarity or not, the image matching process decides whether the two input images are matching or not.
- FIG. 8B illustrates an exemplary process 800B for performing image ranking for a query image, according to an embodiment of the present principles.
- feature vectors y q and y 1 , . . . , y n are generated (850), respectively.
- Analysis encoder (860) is used to sparsify vectors y q and y lt . . .
- a post processing step to the encoder function can be added that adds an extra entry to vector z that is equal to 1 , which would further improve the flexibility of the proposed encoding algorithms.
- a pre-processing step to the encoder function can be added that adds an extra entry to the vector y that is equal to 1, which would further improve the flexibility of the proposed matching and ranking algorithms.
- the process of computing the sparse codes z , . . . , z n can be performed offline.
- the corresponding encoding functions can be pre-computed offline and the sparse codes can be stored.
- the penalty function . can be selected as the continuous hinge function as shown in FIG. 5B.
- the functions ⁇ ( . ) and ⁇ ( . ) are regularization functions for the matrices A and B .
- Some examples to the regularization functions for the operators are functions enforcing normalized rows, a sparse structure or a diagonal structure on the operators.
- the objective function in Eq. (25) is non-linear and non-convex, it can still be minimized using off-the-shelf optimization methods such as stochastic gradient descent.
- FIG. 9 illustrates an exemplary system 900 that has multiple user devices connected to an image search engine according to the present principles.
- one or more user devices (910, 920, and 930) can communicate with image search engine 960 through network 940.
- the image search engine is connected to multiple users, and each user may communicate with the image search engine through multiple user devices.
- the user interface devices may be remote controls, smart phones, personal digital assistants, display devices, computers, tablets, computer terminals, digital video recorders, or any other wired or wireless devices that can provide a user interface.
- the image search engine 960 may implement various methods as discussed above.
- Image database 950 contains one or more databases that can be used as a data source for searching images that match a query image or for training the parameters.
- a user device may request, through network 940, a search to be performed by image search engine 960 based on a query image.
- the image search engine 960 Upon receiving the request, the image search engine 960 returns one or more matching images and/or their rankings.
- the image database 950 provides the matched image(s) to the requesting user device or another user device (for example, a display device).
- the user device may send the query image directly to the image search engine.
- the user device may process the query image and send a signal representative of the query image.
- the user device may perform feature extraction on the query image and send the feature vector to the search engine.
- the user device may further perform sparsifying function and send the sparse representation of the query image to the image search engine.
- the image search may also be implemented in a user device itself. For example, a user may decide to use a family photo as a query image, and to search other photos in his smartphone with the same family members.
- FIG. 10 illustrates a block diagram of an exemplary system 1000 in which various aspects of the exemplary embodiments of the present principles may be implemented.
- System 1000 may be embodied as a device including the various components described below and is configured to perform the processes described above. Examples of such devices, include, but are not limited to, personal computers, laptop computers, smartphones, tablet computers, digital multimedia set top boxes, digital television receivers, personal video recording systems, connected home appliances, and servers.
- System 1000 may be communicatively coupled to other similar systems, and to a display via a communication channel as shown in FIG. 10 and as known by those skilled in the art to implement the exemplary video system described above.
- the system 1000 may include at least one processor 1010 configured to execute instructions loaded therein for implementing the various processes as discussed above.
- Processor 1010 may include embedded memory, input output interface and various other circuitries as known in the art.
- the system 1000 may also include at least one memory 1020 (e.g., a volatile memory device, a non-volatile memory device).
- System 1000 may additionally include a storage device 1040, which may include non-volatile memory, including, but not limited to, EEPROM, ROM, PROM, RAM, DRAM, SRAM, flash, magnetic disk drive, and/or optical disk drive.
- the storage device 1040 may comprise an internal storage device, an attached storage device and/or a network accessible storage device, as non-limiting examples.
- System 1000 may also include an image search engine 1030 configured to process data to provide image matching and ranking results.
- Image search engine 1030 represents the module(s) that may be included in a device to perform the image search functions.
- Image search engine 1030 may be implemented as a separate element of system 1000 or may be incorporated within processors 1010 as a combination of hardware and software as known to those skilled in the art.
- Program code to be loaded onto processors 1010 to perform the various processes described hereinabove may be stored in storage device 1040 and subsequently loaded onto memory 1020 for execution by processors 1010.
- one or more of the processor(s) 1010, memory 1020, storage device 1040 and image search engine 1030 may store one or more of the various items during the performance of the processes discussed herein above, including, but not limited to a query image, the analysis operator, synthesis operator, sparse codes, equations, formula, matrices, variables, operations, and operational logic.
- the system 1000 may also include communication interface 1050 that enables communication with other devices via communication channel 1060.
- the communication interface 1050 may include, but is not limited to a transceiver configured to transmit and receive data from communication channel 1060.
- the communication interface may include, but is not limited to, a modem or network card and the communication channel may be implemented within a wired and/or wireless medium.
- the various components of system 1000 may be connected or communicatively coupled together using various suitable connections, including, but not limited to internal buses, wires, and printed circuit boards.
- the exemplary embodiments according to the present principles may be carried out by computer software implemented by the processor 1010 or by hardware, or by a combination of hardware and software.
- the exemplary embodiments according to the present principles may be implemented by one or more integrated circuits.
- the memory 1020 may be of any type appropriate to the technical environment and may be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory and removable memory, as non-limiting examples.
- the processor 1010 may be of any type appropriate to the technical environment, and may encompass one or more of microprocessors, general purpose computers, special purpose computers and processors based on a multi-core architecture, as non-limiting examples.
- the implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program).
- An apparatus may be implemented in, for example, appropriate hardware, software, and firmware.
- the methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs”), and other devices that facilitate communication of information between end-users.
- PDAs portable/personal digital assistants
- the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
- Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
- Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
- Receiving is, as with “accessing”, intended to be a broad term.
- Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory).
- “receiving” is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
- implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted.
- the information may include, for example, instructions for performing a method, or data produced by one of the described implementations.
- a signal may be formatted to carry the bitstream of a described embodiment.
- Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal.
- the formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream.
- the information that the signal carries may be, for example, analog or digital information.
- the signal may be transmitted over a variety of different wired or wireless links, as is known.
- the signal may be stored on a processor-readable medium.
Landscapes
- Engineering & Computer Science (AREA)
- Library & Information Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
In a particular implementation, images are represented by feature vectors, whose sparse representations are computed using an analysis operator. The sparse representations of the images and a synthesis operator are used to efficiently compute similarity metrics between the images. When the sparse representation of the query image is readily available, a symmetric similarity metric is calculated using the sparse representation of the query image and database images. Otherwise, when the sparse representation of the query image is not available, an asymmetric similarity metric can be calculated using the feature vector of the query image and sparse representations of the database images. Given pair-wise constraints or triplet constraints on similarity, the synthesis operator can be computed using an optimization problem based on a penalty function. Also, the synthesis operator can be learned jointly with the analysis operator.
Description
Method and Apparatus for Image Search Using Sparsifying Analysis and Synthesis Operators
TECHNICAL FIELD [1] This invention relates to a method and an apparatus for image search, and more particularly, to a method and an apparatus for image search using sparsifying analysis and synthesis operators.
BACKGROUND
[2] With the increasing size of image collections and the related difficulty in manually annotating them, automated image search and comparison methods have become crucial when searching for relevant images in a large collection. Many systems that currently exist enable such search approaches, including commercial web search engines that admit an image as the query and return a ranked list of relevant web images; copyright infringement detection methods that are robust to image manipulations such as cropping, rotation, mirroring and picture-in-picture artifacts; semantic search systems that enable querying of an unannotated private image collection based on visual concepts (e.g., cat); object detection systems that are robust to the image background content; automatic face verification methods; and vision-based navigation systems used, for example, as part of the control mechanism of self-driving cars.
[3] A general image search algorithm can be seen as having various goals including: i) finding correctly matching images given a task-specific search criteria and ii) doing so in a time and resource efficient manner, particularly in the context of large image databases. In addressing the first goal, discriminative Mahalanobis metric learning methods have become an important part of the research toolbox. Such methods can be seen as applying an explicit linear transform to the image feature vector with the goal of making distance computations between transformed feature vectors better correspond to the search criteria. The linear transform can be learned using one of a variety of objectives in order to adapt it to various possible search criteria including image classification, face verification, or image ranking. Common to all these methods is the fact that the learned linear transform is a complete or undercomplete matrix that is constant for all image feature vectors. [4] Concerning the second goal of image search systems, that of time and resource
efficiency of the search process, one recent successful method represents all the database images using a product quantizer that enables both a very compact representation and a very efficient comparison to query features. Methods based on sparse coding have also been attempted, as sparse coding can be seen as a generalization of vector quantization. The general aim of these methods is to produce a very compact representation of the input feature vector consisting, for example, of codeword indices and possibly accompanying weights.
SUMMARY
[5] According to a general aspect, a method for performing image search is presented, comprising: accessing at least one of a first feature vector corresponding to a query image and a first sparse representation of the query image, the first sparse representation being based on an analysis operator and the first feature vector; determining a similarity metric between the query image and a second image of an image database using a synthesis operator, responsive to a second sparse representation and one of the first feature vector and the first sparse representation, the second sparse representation being based on the second feature vector corresponding to the second image and the analysis operator; and generating an image search output based on the similarity metric between the query image and the second image.
[6] The image search output may indicate one of (1) a rank of the second image and (2) whether the second image matches the query image.
[7] The method for performing image search may receive at least one of the first feature vector and the first sparse representation from a user device via a communication network, and may transmit the image search output to the user device via the communication network.
[8] The method for performing image search may determine the synthesis operator based on a set of pair-wise constraints, wherein each pair-wise constraint indicates whether a corresponding pair of training images are similar or dissimilar. [9] The method for performing image search may determine the synthesis operator based on a set of triplet constraints, wherein each triplet constraint indicates that a first training image of the triplet is more similar to a second training image of the triplet than to a third training image of the triplet.
[10] The synthesis operator may be trained such that a similarity metric determined for training images corresponding to a pair- wise constraint or a triplet constraint is consistent with
what the pair-wise constraint or the triplet constraint indicates.
[11] The similarity metric may be determined as Ssm(zq, z2) = z(? rBrBz2, wherein Ssm() is the similarity metric, zq is the first sparse representation of the query image, z2 is the second sparse representation of the second image in the image database, and B is the synthesis operator.
[12] The similarity metric may also be determined as Sasm(yq, z2) = yq rBz2 , wherein Sasm() is the similarity metric, is the first feature vector of the query image, z2 is the second sparse representation of the second image in the image database, and B is the synthesis operator. [13] According to another general aspect, an apparatus for performing image search is presented, comprising: an input configured to access at least one of a first feature vector corresponding to a query image and a first sparse representation of the query image, the first sparse representation being based on an analysis operator and the first feature vector; and one or more processors configured to: determine a similarity metric between the query image and a second image of an image database using a synthesis operator, responsive to a second sparse representation and one of the first feature vector and the first sparse representation, the second sparse representation being based on the second feature vector corresponding to the second image and the analysis operator, and generate an image search output based on the similarity metric between the query image and the second image. [14] The apparatus for performing image search may further comprise a communication interface configured to receive the at least one of the first feature vector and the first sparse representation from a user device via a communication network, and to transmit the image search output to the user device via the communication network.
[15] The apparatus for performing image search may determine the synthesis operator based on a set of pair-wise constraints, wherein each pair-wise constraint indicates whether a corresponding pair of training images are similar or dissimilar.
[16] The apparatus for performing image search may determine the synthesis operator based on a set of triplet constraints, wherein each triplet constraint indicates that a first training image of the triplet is more similar to a second training image of the triplet than to a third training image of the triplet.
[17] The synthesis operator may be trained such that a similarity metric determined for training images corresponding to a pair- wise constraint or a triplet constraint is consistent with what the pair-wise constraint or the triplet constraint indicates.
[18] The similarity metric may be determined as Ssm(zq, z2) = ZqrBrBz2, wherein Ssm() is the similarity metric, zq is the first sparse representation of the query image, z2 is the second sparse representation of the second image in the image database, and B is the synthesis operator.
[19] The similarity metric may also be determined as Sasm(yq, z2) = yq TBz2 , wherein Sasm() is the similarity metric, is the first feature vector of the query image, z2 is the second sparse representation of the second image in the image database, and B is the synthesis operator.
[20] The present embodiments also provide a non-transitory computer readable storage medium having stored thereon instructions for performing any of the methods described above.
BRIEF DESCRIPTION OF THE DRAWINGS
[21] FIG. 1 illustrates an exemplary method for performing image search, according to an embodiment of the present principles.
[22] FIG. 2 illustrates an exemplary method for determining the operator for the similarity function.
[23] FIG. 3 shows exemplary pictures from an image training database, where similar images are grouped together, and images from different groups are dissimilar.
[24] FIG. 4 illustrates an exemplary analysis encoding process for generating a sparse code z for vector y, according to an embodiment of the present principles.
[25] FIG. 5A shows exemplary sparse codes for images in a database, where FIG. 5B is an expanded view of the upper-right portion of FIG. 5A.
[26] FIG. 6A shows a parametrized hinge loss function, and FIG. 6B shows a continuous hinge loss function.
[27] FIG. 7A illustrates an exemplary learning process for learning operator B using pair- wise constraints for a symmetric similarity metric, according to an embodiment of the present
principles, and FIG. 7B illustrates an exemplary learning process for learning operator B using pair-wise constraints for an asymmetric similarity metric, according to an embodiment of the present principles.
[28] FIG. 8A illustrates an exemplary process for performing image matching, according to an embodiment of the present principles, and FIG. 8B illustrates an exemplary process for performing image ranking for a query image, according to an embodiment of the present principles.
[29] FIG. 9 illustrates a block diagram of an exemplary system in which multiple user devices are connected to an image search engine according to the present principles. [30] FIG. 10 illustrates a block diagram of an exemplary system in which various aspects of the exemplary embodiments of the present principles may be implemented.
DETAILED DESCRIPTION
[31] The present principles are directed to image search and provide various features compared to the existing methods. First, the proposed approaches may rely on a correlation metric to compare different items instead of a distance metric as in the majority of earlier works. This enables a more flexible framework than those based on the distance metric while offering computational efficiency. Secondly, the proposed methods may use sparse representations in the proposed correlation metrics. This enables efficient storage of the data items in a database and improves the computation speed when used together with the proposed correlation metrics. Thirdly, the proposed methods can also be adapted for use with query items for which the sparse representation is not initially available so that correlation comparison can still be performed quickly while still providing the advantages mentioned above.
[32] In the present application, we denote scalars, vectors and matrices using, respectively standard, bold, and uppercase-bold typeface (e.g., scalar a, vector a and matrix A). We use vk to denote a vector from a sequence v1, v2, ~- , vN, and vk to denote the A th coefficient of vector v. We let [ak]fe (respectively, [<¾]¾) denotes concatenation of the vectors afe (scalars ak) to form a single column vector. The transpose of a vector v is denoted by vT (similarly for a matrix). Finally, we use ¾ to denote the Jacobian matrix with (i, j)-th entry a*/ [33] FIG. 1 illustrates an exemplary method 100 for performing image search, according to
an embodiment of the present principles. At step 1 10, a query image is input and image search will be performed in an image database to return one or more matching images for the query image. At step 120, a feature vector is calculated for the query image.
[34] A feature vector of an image contains information describing an image' s important characteristics. Image search algorithms usually rely on an image encoding function to compute the feature vector y £ .N from a given image. Common image feature construction approaches consist of first densely extracting local descriptors X; £ Md such as SIFT (Scale-invariant feature transform) from multiple resolutions of the input image and then aggregating these descriptors into a single vector y . Common aggregation techniques include methods based on -means models of the local descriptor distribution, such as bag- of-words and VLAD (Vector of Locally Aggregated Descriptors) encoding, and Fisher encoding, which is based on a GMM (Gaussian Mixture Model) model of the local descriptor distribution.
[35] At step 130, a compact representation is calculated for the feature vector. In many applications, a compact representation of a given data is a point of interest since these representations provide a better understanding of the underlying structures in the data. Compact representation can be any representation that represents original vectors by smaller data. Compact representation can be obtained by linear projection on a subspace resulting in smaller vectors than the original data size, or can be sparse representation, for example, obtained using a synthesis model and an analysis model as described below.
[36] Synthesis Model and Dictionary Learning
is called the synthesis model, and vector x is called the representation of vector y in dictionary D. This representation is often more useful when the representation x has only few non-zero entries, i.e., when x is sparse. For a given dictionary D, one can obtain a sparse representation of data vector y by using an encoder function so that x = E (y, D) . The encoder function E () enforces sparsity on x while keeping the distance to the original
data vector, d(y, Dx), sufficiently small. A common example of such an encoder function is the lasso regression defined as
E(y; D) = argmin ||y Dx||? + λ||χ|| ι
The regression parameter λ in Eq. (2) defines the tradeoff between the sparsity and the distance.
[38] When many data points in a training set are available, , e 1, one can try to find a common dictionary that can sparsely represent all the points in the set. This procedure is called dictionary learning, commonly used in machine learning and signal processing. In general the dictionary learning methods can be posed as an optimization problem:
x1 , . . . , X7, D = argmin 6(yi , Dx1, . . . , 7-,Dxr) + φ(χ 1 , . .. , χτ) + ψ(ϋ)
,Π (4) in which function δ() enforces a small distance between each data vector yi and its sparse representation Dx;, the regularization function φ() enforces sparse representations and the regularization function ψ() enforces certain structures within the dictionary D such as unit norm columns.
[39] Analysis Model and Analysis Operator Learning
[40] As an alternative to representing a data vector as a sum of few essential components as in the synthesis model, one can inspect specific properties of the data vector which is called the analysis model. A linear analysis operator, A, can be applied to data point y as in z = Ay(5)
so that the output vector z contains essential information on y. The analysis operators can be very useful if the output, z , is known to be sparse. However unlike synthesis representations, given the vector z the original vector y is often not unique. Hence one can distinguish two types of utilizing analysis operators and sparsity. The first one is finding a vector close to ys that would have a sparse output vector (or sparse code) with A, where ys represents a vector for which Ays is sparse and ys and y are as close as possible. A common optimization method for this purpose is
y = argmin ||y— y.s ||2 + ^||Ay|| 1
y (6) which is similar to lasso regression used in sparse coding. Similar to dictionary learning, analysis operators can also be learned for large datasets to be used in tasks such as denoising or deconvolution. [41] The second possible approach is simply finding a sparse code z close to Ays. This is more useful for applications in which z is more of interest than y. Since Ay is already known, this operation is often much simpler than solving Eqs. (2) and (6).
[42] Referring back to FIG. 1, distance or similarity measures are calculated between the query image and database images at step 140. The measures can be calculated using the feature vectors of the query image and database images, or using compact representations of the query image and the database images, for example, using a Mahalanobis metric. Based on the distance or similarity measures, images are ranked at step 150. One or more matching images are then output at step 160.
[43] The aim of Mahalanobis metric learning is to learn a task- specific distance metric function
£>M(yi . 2) = ( i - 2)r 7,M (y1 - y2) ^ for comparing two feature vectors y1 and y2 , where M (or MrM) is the Mahalanobis metric transformation matrix. Such learned distances have been used extensively to adapt feature vectors to various image search related tasks. For example, the Mahalanobis metric can be used in a face verification problem where the feature vectors yi and y2 are extracted from images of faces and the aim is to learn M so that the test
specifies whether the face images are of the same person.
[44] The Mahalanobis metric can also be used in nearest-neighbor-based classification methods. In this case, a set of labeled image feature vectors { ^ y, e { 1 , .. . ,C}}, belonging to one of C classes is used as a classifier. Given a new test feature y, the class label assigned to it is that of the nearest ; under the Mahalanobis metric,
γ,- s.t. i = argnrnn DM(y,y ) .
(9)
Extensions of this approach can also use a voting scheme based on the closest n features.
[45] Another task that can be addressed by Mahalanobis metric learning is that of image ranking. In this case, we are given a query feature vector yq and we wish to sort a database of features { j£ so that
DM (yq, jij ) < DM (iq, jij ) . 10) In this case, matrix M is learned so that the resulting ranking operation corresponds to human visual perception.
[46] In the present application, we consider the problem of efficiently comparing and ranking items taken from a large set. An example of such a setup can be the comparison of images or a set of image features. In the following, we propose several optimization approaches that make use of both the sparse representations and some given constraints to determine a better estimate for the similarity function that is adapted to the tasks of verification and ranking.
[47] FIG. 2 illustrates an exemplary method 200 for determining the operator for the similarity function. A training set is input at step 210, which may be a database with annotations, for example, indicating whether pictures are similar or dissimilar. The database imposes constraints on the similarity function at step 220, for example, if two pictures are indicated as similar in the training database, the learned similarity function should provide a high similarity score between these two pictures. Given the constraints on similarity between pictures, the operator for the similarity function can be learned at step 230. In the following, the similarity constraints and various learning methods are described in further detail.
[48] Constraints
[49] Let / represent one of many items that we would like to compare. Let it also be given that every data item / is associated with a data vector y G M.N . For the example of comparing images, the items represent the images to be compared whereas the data vectors are features extracted from each image for easier processing. We denote S* (7i , /2) as the similarity function that tracks the human perception for every item pair ( , ) , and we aim to design a similarity function that is close to S*(/i J2) . From the training images, a set of constraints over S*() and a training set /,·, / = l, r (with corresponding data vectors y, e T) can be obtained. In one embodiment, we consider two types of constraints each related to a different task of interest.
[50] Pair- wise Constraints and Matching:
[51] In a simpler case, each constraint is defined by a pair of data points and an indicator variable as
for a constant sc that tracks human perception so that the variable yp is 1 if two data points are sufficiently similar (or in the same cluster) and— 1 if not. Such pairwise constraints are relevant to a task such as matching. Without loss of generality, we define the task of matching as finding a function, S(vi,V2), given the constraints {Cpair,p}™=1, such that for any given pair of query items and their corresponding data vectors, (y9l,y?2) , the function S() satisfies S(y9l,y¾) >0 if S*(Igi,Igi)≥sc and S(yil)y¾) <0 otherwise.
[52] FIG.3 shows exemplary pictures from an image training database, where similar images are grouped together, and images from different groups are dissimilar. Particularly, pictures in the same row (310, 320, 330, or 340) are grouped together in FIG.3. The pair- wise constraints between two images within a group are set to 1, and the pair- wise constraints between two images from different groups are set to -1.
[53] The task of matching can be described as determining whether a given query data belongs to a cluster in a dataset. For example, in face recognition systems, the given facial picture of a person is compared to other facial data of the same person within the database to perform verification. It is also possible to perform matching between two given data points even though these points belong to a cluster different from the observed clusters in the database.
[54] Triplet Constraints and Ranking:
[55] A more informative set of constraints are defined by a triplet of data points as
{Ctripiet,p}™_i = {ip,jp,kp I S*(Iip,Ijp) > S*( ip,Ikp)} (13) That is, in one triplet, item Iip is more similar to item Ijp than to item Ikp. The constraints over triplets provide more information on the similarity function and are useful for tasks such as ranking. The task of ranking can be defined as finding a function, S(vi,v2) given the constraints
> sucn mat f°r anY given triplet of items (qi,qi,q-i) obeying s*½i>½) > S*( <73)> tne function s0 satisfies the condition S(y9l,y¾) > S(y?1.y¾). Ranking enables sorting the database items based on the similarity to a query item and it is an essential part of applications such as data search and retrieval. An example for this application can be seen as image based search from a large database of images based on specific similarity criteria.
[56] As discussed above, we can define an analysis operator A £ M XW such that sparse code z can be obtained from Ay. FIG. 4 illustrates an exemplary analysis encoding process 400 (z = E(A, y)) for generating a sparse code z for vector y, according to an embodiment of the present principles. At step 410, given vector y and analysis operator A, the analysis encoder computes Ay, for example, using liner projection. At step 420, the analysis encoder generates sparse code z using a non-line sparsifying function. The non-linear sparsifying function can be, for example, but not limited to, hard thresholding, soft thresholding, or a function to select some values to zero and modify other values. The non-linear sparsify function can also be a step function or a sigmoid function. The processed vector is then output as sparse code z.
[57] A related application, entitled "Method and Apparatus for Image Search Using Sparsifying Analysis Operators" (Attorney Docket No. PF140323), the teachings of which are specifically incorporated herein by reference, describes different approaches of computing analysis operator A. [58] FIG. 5A shows exemplary sparse codes for images in a database, where FIG. 5B is an expanded view of the upper-right portion of FIG. 5A. In the visualization of the sparse coefficients (vertical axis) of the first 2000 images (horizontal axis), a dark pixel (510) indicates "0" in the vector, and a gray pixel (520) indicates the magnitude of the non-zero in the vector. We can see that most pixels in FIG. 5 are dark (i.e., coefficients are sparse), and thus, our methods successfully learned sparse representation. Therefore a similarity function can be computed very fast given the sparse representation.
[59] Synthesis- Analysis Similarity
[60] The sparse code z based on analysis operator A can be used with a synthesis operator B to generate a similarity metric. Different from the dictionary D applied to the original vectors as in Eq. (1), the synthesis operator B here applies to sparse vectors.
[61] In the cases that the sparse codes Zf are readily available or the encoder function E(A, y) is easy to compute, we propose a symmetric similarity function of the form
Ssm(zi, zy-) = zi TBTBz; (18) where sparse vectors Zj, ζ;· correspond to data vectors y y;- , respectively, and matrix B is the synthesis operator to be learned for the given constraints and the task. The similarity function in Eq. (18) can be seen as estimating the similarity between and y- through the associated
vectors ι = Bz^ = Bz which are synthesized from the compact representations Zj, z for the particular task.
[62] However for some of the query items, the corresponding sparse codes may not be available. For such cases, we also propose an asymmetric similarity function of the form
When the sparse code of a query image is needed, it is computed online while sparse codes of the database images can be computed offline beforehand without affecting the speed of the comparison. Thus, an asymmetric similarity function as described in Eq. (19) without requiring the sparse representation of the query image can be very useful since skipping the computation of the sparse code can provide significant speed improvement.
[63] In Eqs. (18) and (19), we use an analysis operator to get sparse representation, and then apply a synthesis operator when calculating the similarity, we call such a similarity metric as a synthesis-analysis similarity. [64] The similarity functions in Eqs. (18) and (19) are similar to what is proposed in an article by G. Chechik, V. Sharma, U. Shalit, and S. Bengio, entitled "Large scale online learning of image similarity through ranking," Journal of Machine Learning Research, JMLR, pages 1109-1135, 2010 (hereinafter "Chechik"), however, the difference is that the sparse codes, ∑i, are acquired from the feature vectors by analysis similarity learning. Therefore unlike existing similarity learning methods that compute similarities between feature vectors, using feature vectors or sparse vectors Z; obtained by dictionary learning, the approach we propose in Eqs. (18) and (19) computes similarities between the sparse representations of feature vectors which are computed to enhance the performance for the task. Furthermore the proposed approach does not constrain the sparse codes, for example,
||y - Dx||2 < ε (20) as when dictionary learning is used, which results in more flexibility to improve performance for the task.
[65] Learning Synthesis Operator B
[66] In order to learn B, we can use the pair-wise constraints or triplet constraints as described above. For the pair- wise constraints defined in Eq. (12) given for the verification
task, matrix B can be learned by minimizing the objective function
m
B = argmin £ i!(Yp S;-„i, (B))
B p=1 (21) in which function S; (B) is set as either Ssm (Z(, Z/) or Sasm ( i, z; ) as defined in Eq. (18) or (19) respectively. The function IQ in Eq. (21) is a function that penalizes the incorrectly estimated similarities in the training set, i.e., when Yp$ipjp is negative.
[67] Two possible examples for the penalty function are the hinge loss functions as shown in FIG. 6A and FIG. 6B. In particular, FIG. 6A shows a parametrized hinge loss function l(x) = min(— (x— ), 0) , and FIG. 6B shows a continuous hinge loss function, l(x) = log(exp(— ax) + 1) /a.
[68] FIG. 7A illustrates an exemplary learning process 700A for learning operator B using pair-wise constraints for a symmetric similarity metric, according to an embodiment of the present principles. The input of the learning process includes many training vector pairs, {y-Li, y2i}i, which are marked as similar or dissimilar (Yi = 1 or —1). The set of annotations, { Υί}ί , is also input to the learning process. For a pair of vectors, x i and y2 i , analysis encoder (710) can generate sparse codes zlt and z2i, respectively. Subsequently, a similarity function (720) can be used to calculate the similarity between and 2j , for example, as Ssm (yii> y2i) = ziiBTBz2i · Using the estimated similarity metric Ssm ( 1j, 2i) and annotated similarity result γι , a penalty function (730), for example, as described in Eq. (21), can be applied. Generally the penalty function sets a large value when the estimated similarity metric does not match the annotated result. The penalty function is accumulated over the training vector pairs, and the synthesis operator that minimizes the penalty function, i.e., the synthesis operator that provides the closest similarity metric to the annotation results is chosen as the solution Bsm.
[69] For the ranking task and the triplet constraints as defined in Eq. (13), an optimum synthesis operator B that best fits these sets of constraints can be learned with an objective function of the form
m
B = argmin £ ^(SWi (B) - S<p,t(, (B))
» P= I (22) in which the functions Sip 7p (B) and ('{) are defined the same as in Eq. (21).
[70] FIG. 7B illustrates an exemplary asymmetric learning process 700B for learning
operator B using pair-wise constraints for an asymmetric similarity metric, according to an embodiment of the present principles. Similar to method 700A, the input of the learning process includes many training vector pairs, {yii,y2i}i, and the annotation set {yj;. For a vector y2i, analysis encoder (750) can generate sparse code z2i. Subsequently, a similarity function (760) can be used to calculate the similarity between xi andy2i, for example, as
Sasm(yu,y2i) = yi;Bz2i · Using the estimated similarity metric Sasm(yii>y2i) and annotated similarity result γί; a penalty function (770), for example, as described in Eq. (21), can be applied. The solution Basm to the penalty function is output as the synthesis operator.
[71] The learning process when triplet constraints are used for training is similar to process 700A or 700B. However, the input now includes training vector triplets {y^, y2i, Yzi^u where y-Lj and y2j are more similar than yl£ and y3j are. For the symmetric learning process, the analysis encoder generates sparse codes ζ1ί,ζ2;,ζ3ί for each training vector triplet y1j,y2j,y3j, respectively. The similarity function is applied to zli: z2i, and to ζ-^,ζ^ to get SSm n.y2i) and 5sm(yl£,y3£), respectively. The penalty function takes Ssm(yli,y2i) and SsmΥιί>Υ3ί as input, and penalizes when Ssm{yli,y2i) indicates less similarity than
[72] For the asymmetric learning process, the analysis encoder generates sparse codes z2i and z3i for training vectors y2i and y3i, respectively. The similarity function is applied to Yii>z2i, and to Yii>z3i to get SasmiyivYzi) and
, respectively. The penalty function takes Sasm(yli,y2i) and Sasm(yli,y3i) as input, and penalizes when Sasm(yn>y2i indicates less similarity than Ξ^^γ^,γ^ .
[73] FIG. 8A illustrates an exemplary process 800A for performing image matching, according to an embodiment of the present principles. Two input images are represented by feature vectors yxandy2, respectively (810). Analysis encoder (820) is used to sparsify vectors y1 and y2 to generate sparse codes zx and z2 , respectively. Then a similarity metric (830) can be calculated based on the sparse codes, for example, using the similarity function as in Eq. (18) using the symmetric operator Bsm. Based on whether the similarity metric exceeds a threshold or not, i.e., indicates a high similarity or not, the image matching process decides whether the two input images are matching or not. In another embodiment, when an asymmetric operator Basm is used to computer similarity (830), sparse code z2 is generated (820) for feature vector y2, then the similarity metric is calculated (830) using y-L and z2 based on the asymmetric operator Basm, for example, as described in Eq. (19).
[74] FIG. 8B illustrates an exemplary process 800B for performing image ranking for a query image, according to an embodiment of the present principles. For the query image and the database images, feature vectors yq and y1, . . . , yn are generated (850), respectively. Analysis encoder (860) is used to sparsify vectors yq and ylt . . . , yn to generate sparse codes zq and zlt . . . , zn, respectively. Then a similarity metric (870) can be calculated based on the sparse codes for the pair of { zq , Zj }, i = 1, ... , n, for example, using the symmetric similarity function as in Eq. (18). Based on the values of the similarity metrics, we can decide the order of similarity to the query image among the database images and images are ranked (880). In another embodiment, when an asymmetric operator Basm is used to computer similarity (870), sparse codes z1, . . . , zn are generated (860) for feature vectors y-_, . . . , yn, respectively, then the similarity metric (870) is calculated for the pair of {yq, Z; }, i = 1, ... , n, for example, using the asymmetric operator Basm, for example, as described in Eq. (19).
[75] A post processing step to the encoder function can be added that adds an extra entry to vector z that is equal to 1 , which would further improve the flexibility of the proposed encoding algorithms. A pre-processing step to the encoder function can be added that adds an extra entry to the vector y that is equal to 1, which would further improve the flexibility of the proposed matching and ranking algorithms.
[76] It should be noted that for image databases, the process of computing the sparse codes z , . . . , zn can be performed offline. Thus, for feature vectors yi, . . . , yn representing images from the databases, the corresponding encoding functions can be pre-computed offline and the sparse codes can be stored.
[77] Joint Learning of Analysis and Synthesis Operators
[78] Instead of learning analysis operator A and synthesis operator B sequentially as described above, it is also possible to learn these operators jointly on the training set. Furthermore it is also possible to learn other parameters jointly.
[79] In one embodiment of such a system, given two feature vectors y^, y;- G RN representing two images respectively, we present a similarity metric of the form
Si yi. yj) = z Bzj (23) in which the sparse code Zj, Zj are obtained using the soft thresholding function acting on each entry of a vector with the set of thresholds λ as
Zf = softC Ayi), z;- = soft(Ay;). (24)
[80] The matrices A, B and the parameter vector λ are all learned from a training dataset of pairs (z¾ , ζ^- ), p = 1, ... , m selected from a training set of size T with each pair labeled as similar or dissimilar (yp = 1 or — 1) by minimizing an objective function of the form Α, , λ = argmin Y £ (ypz£Bz ) + ψ( A) + φ( Β)
p = l
s. t. zt = soft( Ayt), t = 1, ... , T (25)
A
where the penalty function . ) can be selected as the continuous hinge function as shown in FIG. 5B. The functions ψ( . ) and φ( . ) are regularization functions for the matrices A and B . Some examples to the regularization functions for the operators are functions enforcing normalized rows, a sparse structure or a diagonal structure on the operators. Even though the objective function in Eq. (25) is non-linear and non-convex, it can still be minimized using off-the-shelf optimization methods such as stochastic gradient descent.
[81] Even though the use of correlation based similarity function with sparse vectors as in Eq. (23) is proposed in earlier works by Chechik, making use of analysis operators to obtain the sparse codes as in Eq. (24) to be used in this similarity function is new with multiple advantages. Firstly, as compared to computing a sparse representation in a dictionary, the operation as described in Eq. (24) is computationally much simpler and faster. Secondly, the use of an analysis operator also enables an asymmetric system with a similarity function in the form of
S( y£, y;) = y1 T(ATBz;) (26) such that the comparison is even faster for a new item.
[82] FIG. 9 illustrates an exemplary system 900 that has multiple user devices connected to an image search engine according to the present principles. In FIG. 9, one or more user devices (910, 920, and 930) can communicate with image search engine 960 through network 940. The image search engine is connected to multiple users, and each user may communicate with the image search engine through multiple user devices. The user interface devices may be remote controls, smart phones, personal digital assistants, display devices, computers, tablets, computer terminals, digital video recorders, or any other wired or wireless devices that can provide a user interface.
[83] The image search engine 960 may implement various methods as discussed above. Image database 950 contains one or more databases that can be used as a data source for searching images that match a query image or for training the parameters.
[84] In one embodiment, a user device may request, through network 940, a search to be performed by image search engine 960 based on a query image. Upon receiving the request, the image search engine 960 returns one or more matching images and/or their rankings. After the search result is generated, the image database 950 provides the matched image(s) to the requesting user device or another user device (for example, a display device).
[85] When a user device sends a search request to the image search engine, the user device may send the query image directly to the image search engine. Alternatively, the user device may process the query image and send a signal representative of the query image. For example, the user device may perform feature extraction on the query image and send the feature vector to the search engine. Or the user device may further perform sparsifying function and send the sparse representation of the query image to the image search engine. These various embodiments distribute the computations needed for image search between the user device and image search engine in different manners. The embodiment to use may be decided by the user device's computational resources, network capacity, and image search engine computational resources.
[86] The image search may also be implemented in a user device itself. For example, a user may decide to use a family photo as a query image, and to search other photos in his smartphone with the same family members.
[87] FIG. 10 illustrates a block diagram of an exemplary system 1000 in which various aspects of the exemplary embodiments of the present principles may be implemented. System 1000 may be embodied as a device including the various components described below and is configured to perform the processes described above. Examples of such devices, include, but are not limited to, personal computers, laptop computers, smartphones, tablet computers, digital multimedia set top boxes, digital television receivers, personal video recording systems, connected home appliances, and servers. System 1000 may be communicatively coupled to other similar systems, and to a display via a communication channel as shown in FIG. 10 and as known by those skilled in the art to implement the exemplary video system described above.
[88] The system 1000 may include at least one processor 1010 configured to execute
instructions loaded therein for implementing the various processes as discussed above. Processor 1010 may include embedded memory, input output interface and various other circuitries as known in the art. The system 1000 may also include at least one memory 1020 (e.g., a volatile memory device, a non-volatile memory device). System 1000 may additionally include a storage device 1040, which may include non-volatile memory, including, but not limited to, EEPROM, ROM, PROM, RAM, DRAM, SRAM, flash, magnetic disk drive, and/or optical disk drive. The storage device 1040 may comprise an internal storage device, an attached storage device and/or a network accessible storage device, as non-limiting examples. System 1000 may also include an image search engine 1030 configured to process data to provide image matching and ranking results.
[89] Image search engine 1030 represents the module(s) that may be included in a device to perform the image search functions. Image search engine 1030 may be implemented as a separate element of system 1000 or may be incorporated within processors 1010 as a combination of hardware and software as known to those skilled in the art. [90] Program code to be loaded onto processors 1010 to perform the various processes described hereinabove may be stored in storage device 1040 and subsequently loaded onto memory 1020 for execution by processors 1010. In accordance with the exemplary embodiments of the present principles, one or more of the processor(s) 1010, memory 1020, storage device 1040 and image search engine 1030 may store one or more of the various items during the performance of the processes discussed herein above, including, but not limited to a query image, the analysis operator, synthesis operator, sparse codes, equations, formula, matrices, variables, operations, and operational logic.
[91] The system 1000 may also include communication interface 1050 that enables communication with other devices via communication channel 1060. The communication interface 1050 may include, but is not limited to a transceiver configured to transmit and receive data from communication channel 1060. The communication interface may include, but is not limited to, a modem or network card and the communication channel may be implemented within a wired and/or wireless medium. The various components of system 1000 may be connected or communicatively coupled together using various suitable connections, including, but not limited to internal buses, wires, and printed circuit boards.
[92] The exemplary embodiments according to the present principles may be carried out by computer software implemented by the processor 1010 or by hardware, or by a combination of
hardware and software. As a non-limiting example, the exemplary embodiments according to the present principles may be implemented by one or more integrated circuits. The memory 1020 may be of any type appropriate to the technical environment and may be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory and removable memory, as non-limiting examples. The processor 1010 may be of any type appropriate to the technical environment, and may encompass one or more of microprocessors, general purpose computers, special purpose computers and processors based on a multi-core architecture, as non-limiting examples. [93] The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users.
[94] Reference to "one embodiment" or "an embodiment" or "one implementation" or "an implementation" of the present principles, as well as other variations thereof, mean that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present principles. Thus, the appearances of the phrase "in one embodiment" or "in an embodiment" or "in one implementation" or "in an implementation", as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
[95] Additionally, this application or its claims may refer to "determining" various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
[96] Further, this application or its claims may refer to "accessing" various pieces of
information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
[97] Additionally, this application or its claims may refer to "receiving" various pieces of information. Receiving is, as with "accessing", intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, "receiving" is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
[98] As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry the bitstream of a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.
Claims
1. A method for performing image search, comprising:
accessing (120, 130) at least one of a first feature vector corresponding to a query image and a first sparse representation of the query image, the first sparse representation being based on an analysis operator and the first feature vector;
determining (140) a similarity metric between the query image and a second image of an image database using a synthesis operator, responsive to a second sparse representation and one of the first feature vector and the first sparse representation, the second sparse representation being based on the second feature vector corresponding to the second image and the analysis operator; and
generating (160) an image search output based on the similarity metric between the query image and the second image.
2. The method according to claim 1, wherein at least one of the first feature vector and the first sparse representation is received from a user device via a communication network, further comprising transmitting the image search output to the user device via the communication network.
3. The method according to any of claims 1-2, wherein the synthesis operator is determined based on a set of pair-wise constraints, wherein each pair-wise constraint indicates whether a corresponding pair of training images are similar or dissimilar.
4. The method according to any of claims 1-3, wherein the synthesis operator is determined based on a set of triplet constraints, wherein each triplet constraint indicates that a first training image of the triplet is more similar to a second training image of the triplet than to a third training image of the triplet.
5. The method according to claim 3 or 4, wherein the synthesis operator is trained such that a similarity metric determined for training images corresponding to a pair-wise constraint or a triplet constraint is consistent with what the pair-wise constraint or the triplet constraint indicates.
6. The method according to any of claims 1-5, wherein the similarity metric is determined as Ssm(zq, z2) = zg rBrBz2, wherein Ssm() is the similarity metric, zq is the first sparse representation of the query image, z2 is the second sparse representation of the second image in the image database, and B is the synthesis operator.
7. The method according to any of claims 1-6, wherein the similarity metric is determined as Sasm(yq, z2) = yq TBz2, wherein Sasm() is the similarity metric, yq is the first feature vector of the query image, z2 is the second sparse representation of the second image in the image database, and B is the synthesis operator.
8. An apparatus for performing image search, comprising:
an input configured to access at least one of a first feature vector corresponding to a query image and a first sparse representation of the query image, the first sparse representation being based on an analysis operator and the first feature vector; and
one or more processors (1130) configured to:
determine a similarity metric between the query image and a second image of an image database using a synthesis operator, responsive to a second sparse representation and one of the first feature vector and the first sparse representation, the second sparse representation being based on the second feature vector corresponding to the second image and the analysis operator, and
generate an image search output based on the similarity metric between the query image and the second image.
9. The apparatus according to claim 8, further comprising:
a communication interface configured to receive the at least one of the first feature vector and the first sparse representation from a user device via a communication network, and to transmit the image search output to the user device via the communication network.
10. The apparatus according to any of claims 8-9, wherein the synthesis operator is determined based on a set of pair-wise constraints, wherein each pair-wise constraint indicates whether a corresponding pair of training images are similar or dissimilar.
11. The apparatus according to any of claims 8-10, wherein the synthesis operator is determined based on a set of triplet constraints, wherein each triplet constraint indicates that a first training image of the triplet is more similar to a second training image of the triplet than to a third training image of the triplet.
12. The apparatus according to claim 10 or 11, wherein the synthesis operator is trained such that a similarity metric determined for training images corresponding to a pair- wise constraint or a triplet constraint is consistent with what the pair-wise constraint or the triplet constraint indicates.
13. The apparatus according to any of claims 8-11, wherein the similarity metric is determined as Ssm(zq, z2) = zg rBrBz2, wherein Ssm() is the similarity metric, zq is the first sparse representation of the query image, z2 is the second sparse representation of the second image in the image database, and B is the synthesis operator.
14. The apparatus according to any of claims 8-13, wherein the similarity metric is determined as Sasm(yq, z2) = q TBz2, wherein Sasm() is the similarity metric, yq is the first feature vector of the query image, z2 is the second sparse representation of the second image in the image database, and B is the synthesis operator.
15. A non-transitory computer readable storage medium having stored thereon instructions for implementing a method according to any of claims 1-7.
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP15305346.7 | 2015-03-06 | ||
EP15305346 | 2015-03-06 | ||
EP15306494 | 2015-09-25 | ||
EP15306494.4 | 2015-09-25 | ||
EP15306770.7A EP3166021A1 (en) | 2015-11-06 | 2015-11-06 | Method and apparatus for image search using sparsifying analysis and synthesis operators |
EP15306770.7 | 2015-11-06 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016142293A1 true WO2016142293A1 (en) | 2016-09-15 |
Family
ID=55521683
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2016/054664 WO2016142293A1 (en) | 2015-03-06 | 2016-03-04 | Method and apparatus for image search using sparsifying analysis and synthesis operators |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2016142293A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111506754A (en) * | 2020-04-13 | 2020-08-07 | 广州视源电子科技股份有限公司 | Picture retrieval method and device, storage medium and processor |
CN112860936A (en) * | 2021-02-19 | 2021-05-28 | 清华大学 | Visual pedestrian re-identification method based on sparse graph similarity migration |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8429168B1 (en) * | 2009-12-15 | 2013-04-23 | Google Inc. | Learning semantic image similarity |
US8515212B1 (en) * | 2009-07-17 | 2013-08-20 | Google Inc. | Image relevance model |
US20130290222A1 (en) * | 2012-04-27 | 2013-10-31 | Xerox Corporation | Retrieval system and method leveraging category-level labels |
-
2016
- 2016-03-04 WO PCT/EP2016/054664 patent/WO2016142293A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8515212B1 (en) * | 2009-07-17 | 2013-08-20 | Google Inc. | Image relevance model |
US8429168B1 (en) * | 2009-12-15 | 2013-04-23 | Google Inc. | Learning semantic image similarity |
US20130290222A1 (en) * | 2012-04-27 | 2013-10-31 | Xerox Corporation | Retrieval system and method leveraging category-level labels |
Non-Patent Citations (3)
Title |
---|
G. CHECHIK; V. SHARMA; U. SHALIT; S. BENGIO: "Large scale online learning of image similarity through ranking", JOURNAL OF MACHINE LEARNING RESEARCH, JMLR, 2010, pages 1109 - 1135 |
LI-WEI KANG ET AL: "Feature-Based Sparse Representation for Image Similarity Assessment", IEEE TRANSACTIONS ON MULTIMEDIA, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 13, no. 5, 1 October 2011 (2011-10-01), pages 1019 - 1030, XP011386926, ISSN: 1520-9210, DOI: 10.1109/TMM.2011.2159197 * |
PABLO SPRECHMANN ET AL: "Efficient Supervised Sparse Analysis and Synthesis Operators", 1 January 2013 (2013-01-01), XP055270405, Retrieved from the Internet <URL:http://papers.nips.cc/paper/5002-supervised-sparse-analysis-and-synthesis-operators.pdf> [retrieved on 20160503] * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111506754A (en) * | 2020-04-13 | 2020-08-07 | 广州视源电子科技股份有限公司 | Picture retrieval method and device, storage medium and processor |
CN111506754B (en) * | 2020-04-13 | 2023-10-24 | 广州视源电子科技股份有限公司 | Picture retrieval method, device, storage medium and processor |
CN112860936A (en) * | 2021-02-19 | 2021-05-28 | 清华大学 | Visual pedestrian re-identification method based on sparse graph similarity migration |
CN112860936B (en) * | 2021-02-19 | 2022-11-29 | 清华大学 | Visual pedestrian re-identification method based on sparse graph similarity migration |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10140549B2 (en) | Scalable image matching | |
US20200193552A1 (en) | Sparse learning for computer vision | |
US9852363B1 (en) | Generating labeled images | |
US10354199B2 (en) | Transductive adaptation of classifiers without source data | |
US9607014B2 (en) | Image tagging | |
WO2016142285A1 (en) | Method and apparatus for image search using sparsifying analysis operators | |
US20160140425A1 (en) | Method and apparatus for image classification with joint feature adaptation and classifier learning | |
US9256617B2 (en) | Apparatus and method for performing visual search | |
US20180341805A1 (en) | Method and Apparatus for Generating Codebooks for Efficient Search | |
JP2017062781A (en) | Similarity-based detection of prominent objects using deep cnn pooling layers as features | |
CN111062871A (en) | Image processing method and device, computer equipment and readable storage medium | |
US20130064444A1 (en) | Document classification using multiple views | |
WO2012100819A1 (en) | Method and system for comparing images | |
US10643063B2 (en) | Feature matching with a subspace spanned by multiple representative feature vectors | |
EP2712453B1 (en) | Image topological coding for visual search | |
CN113434716B (en) | Cross-modal information retrieval method and device | |
CN111914908B (en) | Image recognition model training method, image recognition method and related equipment | |
GB2547760A (en) | Method of image processing | |
US20160307068A1 (en) | Method of clustering digital images, corresponding system, apparatus and computer program product | |
CN104951791A (en) | Data classification method and apparatus | |
US10163000B2 (en) | Method and apparatus for determining type of movement of object in video | |
EP3166022A1 (en) | Method and apparatus for image search using sparsifying analysis operators | |
EP3166021A1 (en) | Method and apparatus for image search using sparsifying analysis and synthesis operators | |
US20230410465A1 (en) | Real time salient object detection in images and videos | |
US20170309004A1 (en) | Image recognition using descriptor pruning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16708983 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16708983 Country of ref document: EP Kind code of ref document: A1 |