WO2013129580A1

WO2013129580A1 - Approximate nearest neighbor search device, approximate nearest neighbor search method, and program

Info

Publication number: WO2013129580A1
Application number: PCT/JP2013/055440
Authority: WO
Inventors: 雅一岩村; 智一佐藤; 浩一黄瀬
Original assignee: 公立大学法人大阪府立大学
Priority date: 2012-02-28
Filing date: 2013-02-28
Publication date: 2013-09-06
Also published as: JPWO2013129580A1

Abstract

An objective of the present invention is to implement an approximate nearest neighbor search rapidly and with high precision in searching by appropriately reducing the number of nearest neighbor candidates. An approximate nearest neighbor search device is applied which comprises: a database storage unit which, when a plurality of points which are represented with vector data is inputted, computes a hash index by applying a hash function to each point, and stores each point in a multi-dimensional hash table by projecting each point in a multi-dimensional space which is segmented into a plurality of regions by the multi-dimensional hash table bins; a search range establishment unit which, when a query is inputted, applies the hash function to the query, establishes a location of the query within the space, establishes estimate values of the distance from the query to each region within the space, and establishes regions to be searched on the basis of the estimate values; and a nearest neighbor establishment unit which calculates the distance from each point within the search region to the query, and computes the nearest point to the query to be the nearest neighbor to the query. The search range establishment unit refers to the index of each region and derives a representative point of the region, establishes the estimate value on the basis of the distance between the query and each representative point, applies a branch and bound technique, excluding the regions which cannot be the regions to be searched, and establishes the regions to be searched.

Description

Approximate nearest neighbor search device, approximate nearest neighbor search method and program thereof

The present invention relates to an approximate nearest neighbor search apparatus, an approximate nearest neighbor search method, and a program thereof, and more specifically, an approximate nearest neighbor search apparatus, an approximate nearest neighbor search method, and an approximate nearest neighbor using fast distance estimation by hashing with a codebook. It relates to a search program.

】 Large-scale data processing is indispensable in recent information processing, and its importance is increasing year by year. One of the factors is hardware. In other words, it has become possible to handle a large amount of data by improving the performance and cost of the computer environment and increasing the capacity of the storage medium, and in other words, it has become possible to process in a realistic time. Other factors are content and needs. In other words, ordinary people themselves create various contents such as photos, videos and music, and Flickr (URL: http://www.flickr.com/) and YouTube (URL: http://www.youtube. In addition to uploading to sites such as com /), there is a desire to quickly find those that match your interests. Development of large-scale data processing technology is an urgent issue because the total amount of data that can be handled by humans, not limited to such photos and videos, continues to increase at a tremendous rate.

One answer to this problem is nearest neighbor search. Nearest neighbor search is a method of searching data (points) that are most similar to a query, that is, the data having the shortest distance (nearest point) from data (points) expressed by vectors. Nearest neighbor search is a basic and effective technique for processing large-scale data, and because it is basic, it has a wide range of applicability to excellent methods. Even those that the inventors are involved in can be applied to object recognition (for example, see Non-Patent Document 1), document image search (for example, Non-Patent Document 2), character recognition (for example, Non-Patent Document 3) and Face recognition (for example, refer nonpatent literature 4) can be mentioned. Both have been shown to operate very fast on relatively large amounts of data. The above-mentioned technique can be said to be an image recognition technique in a broad sense, but the application field of nearest neighbor search is not limited to that, but extends to statistical classification, code logic, data compression, recommendation system, spell checker, and the like.
Considering practical applications, the nearest neighbor search requires high speed. If the speed increases, it will be possible to process a larger amount of data in the same time than before, and it will be possible to use it for applications that have been given up in the past due to processing time constraints.

Approximate nearest neighbor search was created to satisfy this requirement. The approximate nearest neighbor search is a technique that can significantly reduce the processing time by introducing an approximation to the nearest neighbor search and allowing a search error. Since a search error occurs, there is no guarantee that the true nearest neighbor will be found, but it is used for applications that only need to have a certain degree of search accuracy. Improvement in search accuracy and speeding up are conflicting requirements, and it is generally not easy to satisfy both at the same time.

The typical approximate nearest neighbor search method proposed so far is described. As methods using a tree structure, Approximate Nearest Neighbor (or ANN, for example, see Non-Patent Document 5) and Fast Library for Approximate Nearest Neighbors (or FLANN, for example, see Non-Patent Document 6) are known. Further, as a technique using a hash method, Locality Sensitive Hashing (or LSH, for example, see

Non-Patent Documents

7, 8, and 9), Spectral Hashing (or SH, for example, see Non-Patent Document 10), and a technique by Sato et al. Non-Patent Document 11) is known.
Furthermore, randamized kd-tree (for example, see Non-Patent Document 12) and hierarchical k-means (for example, see Non-Patent Document 13) are known as parameter tuning techniques adopted in the FLANN.

In the approximate nearest neighbor search, indexing is performed by dividing the data space into a number of regions in advance. The present invention particularly relates to a technique using a hash. In approximate nearest neighbor search using a hash, the data space is basically linearly divided along a plurality of axes, and data is registered in a hash table for each divided area (bin). Then, a high-speed search is realized by extracting from the hash table the points that have entered the query area or its neighboring areas, and obtaining the nearest neighbors from the extracted points. When a plurality of hash tables are used, a bin product area (bucket) is a unit of division.
In general, the approximate nearest neighbor search is composed of two steps. In the first step, a bin or bucket having a high probability of including the nearest point is specified. A point belonging to this point is called a nearest neighbor candidate. In the second stage, among the points included in the nearest neighbor candidate, the point having the smallest distance from the query is calculated. In other words, the “approximate” element is included only in the first step. Therefore, this processing affects the search accuracy and processing time of the approximate nearest neighbor search.

In identifying the nearest neighbor candidate in the first stage, the processing time required for the distance calculation is greatly reduced by limiting the distance calculation target to the nearest neighbor candidate. However, this is a double-edged sword, and although the speed can be increased by reducing the number of nearest neighbor candidates, the search accuracy decreases. This is because the true nearest neighbor point leaks from the nearest neighbor candidate and the probability of failure in nearest neighbor search increases. Therefore, what is required in the first-stage processing is that the true nearest neighbor point is not leaked while the number of nearest neighbor candidates is reduced. Moreover, it is necessary to select the nearest neighbor candidate at high speed.

The policy taken by the inventors to satisfy these requirements is to use the same distance measure used in the second stage when selecting the nearest neighbor candidates. Thereby, even if the number of nearest neighbor candidates is reduced, the probability that a true nearest neighbor point is included in the nearest neighbors is difficult to decrease. However, such processing usually requires a large calculation cost. In order to solve this problem, an approximate nearest neighbor search method capable of executing the above-described processing at high speed is desired.
The present invention has been made in view of the above circumstances, and provides a new approximate nearest neighbor search method that combines high search accuracy and high speed by appropriately narrowing down the nearest neighbor candidates. Is.

When a plurality of points expressed by vector data are input, a hash index for each point is calculated by applying a hash function for calculating the index of the multidimensional hash table, and a plurality of points are calculated by bins of the multidimensional hash table. A database storage unit that stores each point in a multidimensional hash table by projecting each point to an area corresponding to the hash index in the multidimensional space divided into areas, and when a query is input, Applying the hash function to a query to determine a position of the query in the multidimensional space, determining an estimate of a distance between the query and each region in the space, and determining at least one based on the estimate A search range determination unit for determining a region to be searched, and a distance between each point in the region to be searched and the query, A nearest point determination unit that calculates a nearest point as the nearest point of the query, wherein the search range determination unit obtains a representative point of the region by referring to an index of each region, and a distance between the query and each representative point Approximate nearest neighbor search device, wherein the estimated value is determined based on the above, and the region to be searched is determined by applying a branch and bound method to exclude the region that cannot be the region to be searched I will provide a.
In other words, according to the present invention, when a plurality of points represented by vector data are input, a hash index is calculated by applying a hash function to each point, and multiple points divided into a plurality of regions by an orthonormal basis are obtained. A database storage unit in which each point is registered in a multidimensional hash table by projecting each point to an area corresponding to the index in the dimensional space, and when the query is input, the hash function is applied to the query A search range in which a position of the query in the space is determined, an estimated value of a distance between the query and each area in the space is determined, and one or more areas are determined as search areas based on the estimated value A determination unit; and a nearest point determination unit that calculates a distance between each point in the search area and the query, and calculates a point closest to the query as a nearest neighbor point of the query. Search range determining unit, the position of the representative point representing the area determined with reference to the index of a region in which each point belongs and provides the approximate nearest neighbor searching and wherein the determining the estimated value.

Further, from a different point of view, the present invention applies a hash function for calculating an index of a multidimensional hash table when a plurality of points represented by vector data are input, and calculates a hash index for each point. Each point is stored in the multidimensional hash table by calculating and projecting each point to an area corresponding to the hash index in the multidimensional space divided into a plurality of areas by bins of the multidimensional hash table Accessing the database storage, and when a query is input, applying the hash function to the query to determine the position of the query in the space, and determining the distance between the query and each region in the space A search range determination process for determining an estimated value and determining at least one area to be searched based on the estimated value. And calculating a distance between each point in the region to be searched and the query, and calculating a point closest to the query as the nearest neighbor point of the query, and the search range determining step includes: A region that cannot be a region to be searched by obtaining a representative point of the region with reference to an index, determining the estimated value based on a distance between the query and each representative point, and applying a branch and bound method The approximate nearest neighbor search method is characterized in that the region to be searched for is determined by removing.
In other words, when a plurality of points represented by vector data are input, the computer calculates a hash index by applying a hash function to each point, and the computer is divided into a plurality of regions divided by a normal orthogonal basis. A step of accessing a database storage unit in which each point is registered in a multidimensional hash table by projecting each point to an area corresponding to the index in a dimensional space; and when a query is input, A hash function is applied to determine the position of the query in the space, an estimated value of the distance between the query and each area in the space is determined, and one or more areas are set as search areas based on the estimated value A search range determination step to be determined, a distance between each point in the search area and the query is calculated, and a point closest to the query is set as a nearest neighbor point of the query The search range determining step refers to an index of each area to obtain a position of a representative point representing the area, and estimates the difference between the position of the query and the position of each representative point Provided is an approximate nearest neighbor search method characterized by a value.

Further, from a different point of view, the present invention calculates a hash index for each point by applying a hash function for calculating an index of a multidimensional hash table when a plurality of points represented by vector data are input. Access the database storage unit that stores each point in the multidimensional hash table by projecting each point to the area corresponding to the hash index in the multidimensional space divided into multiple areas by the bin of the multidimensional hash table And when a query is input, the hash function is applied to the query to determine the position of the query in the space, and an estimate of the distance between the query and each region in the space is determined. , Processing as a search range determination unit for determining at least one region to be searched based on the estimated value, and the search The distance between each point in the power region and the query is calculated, and the computer is caused to execute processing as a nearest neighbor point determination unit that calculates the closest point to the query as the nearest neighbor point of the query. A representative point of the region is obtained with reference to the index of the region, the estimated value is determined based on a distance between the query and each representative point, and the region to be searched cannot be applied by applying a branch and bound method. Provided is an approximate nearest neighbor search program characterized in that a region to be searched is determined by excluding a region. Alternatively, the program product is provided.
The inventions from the above three viewpoints are all related to the description of “first improvement: hash-based point-to-bucket distance estimation” in the embodiments described later.

In the approximate nearest neighbor search device of the present invention, the search range determining unit refers to the index of each region to obtain a representative point of that region, and determines the estimated value based on the distance between the query and each representative point. Therefore, without calculating the distance between the query and each region, the estimated value is determined using the index, the region to be searched (search region) is determined based on the estimated value, and the nearest neighbor determining unit determines the distance. The points to be calculated can be narrowed down. Compared to the method of Sato et al. That estimates the bucket-to-bucket distance, the method of the present invention that estimates the point-to-bucket distance can improve the accuracy of distance estimation with the same data structure. Furthermore, the branch and bound method is applied to exclude regions that cannot be search regions, so that the search regions can be determined in a short time.

In the present invention, individual data registered in the database and data of a query (search question) used for searching the database are expressed as at least one point. Each point has an attribute indicating the characteristics of the data or query, and the attribute is expressed by vector data. Hashing is a well-known method for retrieving data at high speed on a memory. The hash function receives a vector data and outputs a scalar value. The scalar value is a discrete value used to refer to a hash table that is a kind of data table, and is called a hash value, a hash index, or simply an index. It can be said that the hash function divides the output space by the discrete values that the index can take. A hash function can also be said to project vector data (points) as an input into an output multidimensional space. In the present invention, each vector data is projected into a multidimensional space. Since the output of one hash function is a scalar value, ν hash functions are used to project vector data onto a vector space of ν dimension (ν is an integer of 2 or more), for example. Each point is composed of ν bases by applying ν hash functions, is divided into a plurality of bins along each base, and any bin is specified by the index of each hash function Projected into space. Since the registration of each point in each bin is represented using a hash table, there are ν hash tables (ν dimensions). However, a single hash function may be used by combining several dimensions. In this case, there are fewer than ν hash tables.
The approximate nearest neighbor search method according to the present invention can be applied to object recognition, document image search, character recognition, face recognition, statistical classification, code logic, data compression, recommendation system, spell checker, and the like.

In this specification, each area divided by one hash function is called a bin. A bin product area generated by a plurality of hash functions, that is, each area divided by a plurality of hash functions is called a bucket.
The approximate nearest neighbor search according to the present invention is realized by a computer processing data while using hardware resources such as a storage device.

It is explanatory drawing which shows distance estimation in the method of Sato et al., Which is a conventional method of approximate nearest neighbor search. In this invention, it is explanatory drawing which shows an example of the distance estimation of a point-to-bucket. In this invention, it is explanatory drawing which shows the example from which the distance estimation of a point-to-bucket differs. It is a graph which shows the relationship between a precision and processing time when the dimension number (nu) of a multidimensional hash is changed in the conventional method of Sato et al. In this invention, it is the 1st graph which shows the relationship between the precision in the optimal parameter, and processing time compared with the conventional method. In this invention, it is the 2nd graph which shows the relationship between the precision in the optimal parameter, and processing time compared with the conventional method. In this invention, it is explanatory drawing which shows the example of the division | segmentation of the data space suitable for distance estimation. In this invention, it is explanatory drawing which shows a mode that a search radius is changed according to the density of a query periphery. It is a graph which shows the accuracy of distance estimation by BDH and k-means BDH of the present invention in comparison with SH of the conventional method with the horizontal axis as the code length and the vertical axis as the correlation coefficient. In this invention, it is a graph which shows the relationship between an estimated distance when a code length is 120 bits, and a true distance. In BDH of this invention and k-means | BDH, it is a graph which shows the average rank of the nearest point in the distance on a hash when code length is changed compared with SH of a conventional method. In this invention, it is a graph which shows the relationship between search precision and processing time about each of BDH, k-means BDH, and k-means BDH P when the search radius R is fixed. In this invention, it is a graph which shows the relationship between the search accuracy of k-means | BDH | PC, and the processing time compared with the conventional method which fixes the search radius R. FIG. In this invention, it is a graph which shows the relationship between the search precision of k-means | BDH | PC, and processing time compared with ANN and SH of the conventional method. In this invention, it is a graph which shows the relationship between the parameter c regarding the number of nearest neighbor candidates, and processing time. In this invention, it is a graph which shows the relationship between the parameter c regarding the number of nearest neighbor candidates, and precision. In this invention, it is a graph which shows the relationship between the parameter c regarding the number of nearest neighbor candidates, and the number of nearest neighbor candidates. In this invention, it is a figure which shows two algorithms which identify the bucket within the search radius R at high speed. In this invention, it is a figure which shows two algorithms which change a search radius adaptively. It is a block diagram which shows the image recognition apparatus as an application example of the approximate nearest neighbor search apparatus of this invention. It is explanatory drawing of LSH which is one method of the conventional approximate nearest neighbor search. It is explanatory drawing of SH which is one method of the conventional approximate nearest neighbor search. It is explanatory drawing which shows the approximate value of distance in the method of Sato et al. Which is one method of the conventional approximate nearest neighbor search. In this invention, it is explanatory drawing which shows an example of distribution of the point in data space. In this invention, it is a graph which shows the result of the comparison experiment of processing time and precision. (SIFT, 1 million points) In this invention, it is a graph which shows the result of the comparison experiment of processing time and precision. (SIFT, 10 million items) In this invention, it is a graph which shows the result of the comparison experiment of processing time and precision. (SIFT, 100 million items) In this invention, it is a graph which shows the result of the comparison experiment of processing time and precision. (GIST, 100,000 points) In this invention, it is a graph which shows the result of the comparison experiment of processing time and precision. (GIST, 1 million points) In this invention, it is a graph which shows the result of the comparison experiment of processing time and precision. (GIST, 10 million items)

Before describing the present invention in detail, a preferred embodiment of the present invention will be described.
In the database storage unit, each point is registered in each of M sets of multidimensional hash tables corresponding to each hash function group using M sets (M is a natural number of 2 or more) of hash functions. The function group may be a combination of a plurality of hash functions. In this way, for example, when ν hash functions (ν> M) are divided into one hash function group without being divided, if the number of bins in each hash table is s, it corresponds to s bins. Since it is necessary to prepare ν sets of data storage areas, the order of the hash table size is O (sν). By dividing the hash function into M sets, the order is suppressed to O (sν ^{/ M} ). Can do. That is, by dividing the hash table, the storage capacity required for the approximate nearest neighbor search process can be saved. O (sν) is a notation method of an approximate calculation amount necessary for solving the problem, and the calculation amount when s is determined is the order of s to the ν power, that is, asν + bs ⁽ ν ^−1). + ... + ls ² + ms + n or less. Here, a, b,..., L, m, and n are constants.
This aspect is related to the description of “second improvement: hash table division” in an embodiment described later.

Further, the M sets of hash tables may be defined so that the number of dimensions of each hash is substantially equal. The basis selection method at this time may be determined so that the sum of the variances of the basis is substantially equal. Specifically, in each of the divided M sets of hash tables, the sum of the variances of the data in the respective base directions and the basis of each set of hash tables divided so that the sum of the variances is as equal as possible. Determine the assignment. In general, the distance in the base direction having a large variance tends to be large, and the distance in the base direction having a small dispersion tends to be small. Therefore, summing the sum of the variances of the hash tables has the effect of aligning the distances calculated for each hash table. The search radius R used for selection of the distance calculation target or censoring of the distance calculation (processing for replacing the distance with a constant value when the distance calculation target point does not exist in the search range) as in the embodiment of the present specification. If the tables are shared, it is considered that each hash table divided into M sets contributes to the determination of the nearest neighbor candidates by equalizing the sum of the variance of each hash table. It is not necessary to set a parameter instead of R, and the calculation can be performed easily.
This aspect is related to the description of “second improvement: hash table division” in an embodiment described later.

The multi-dimensional hash table is a combination of hash tables corresponding to each dimension. Each hash table divides a base corresponding to each dimension into bins, and each bin has a uniform width. In this case, the position error between the point that will be registered in each bin and the representative point that represents each bin is obtained for each bin, and the width of each bin is adjusted so that the sum of these errors becomes smaller May be. In this way, it is possible to reduce the error with respect to the true distance of the estimated value of the distance of each point compared to the equally divided bin width, and to determine the search area with better accuracy. If there is a small number of vector data registered in each bin, the processing time of distance calculation with each point in the search area varies depending on the number of data registered in the bucket of the search area. This variation can be suppressed.
This aspect relates to the description of “third improvement: division of data space suitable for distance estimation” in an embodiment described later.

Further, each hash table calculates the representative point for each cluster by clustering each point into predetermined n clusters, and represents the average distance from each point belonging to each cluster to the representative point of that cluster. The width of each bin may be determined so that the variance is smaller than a predetermined threshold. In this way, by determining the width of each bin so that the variance of each point in the bin is equal to or less than the threshold, the number of points registered in each bin can be leveled within an appropriate range.
This aspect relates to the description of “third improvement: division of data space suitable for distance estimation” in an embodiment described later.

Furthermore, the search range determination unit may determine a probability density function from the distribution of vector data in each base direction, and determine the estimated value using the probability density function for weighting the distance. In this way, appropriate leveling is possible even for complex distributions by using the probability density function.
This aspect relates to the description of “fourth improvement: distance estimation based on probability density function” in an embodiment described later.
Further, the search range determination unit may set a region having a representative point within a range of a search radius R determined in advance centering on a query as the search region. In this way, the search area can be determined by setting the search radius R in advance.
This aspect is related to the description of “fifth improvement: expansion of search radius considering data density around query” in an embodiment described later.

Alternatively, the search range determination unit searches a region having a representative point within the range of the search radius R around the query as the search region, and searches until the number of points included in the search region reaches a predetermined number The radius R may be gradually increased. In this way, the number of points included in the search area, that is, the number of candidates for the nearest neighbor point, becomes a predetermined number even if there is some data registered in the bucket into which the query enters and the buckets around it. Therefore, the accuracy of the approximate nearest neighbor search can be stabilized. In addition, the processing time required for calculating the distance to each point in the search area can be made substantially constant.
This aspect is related to the description of “fifth improvement: expansion of search radius considering data density around query” in an embodiment described later.

Further, the database storage unit projects each point in a ν-dimensional space in which the basis is determined by principal component analysis, and the search range determination unit is configured to determine the distance between the query and each representative point in each base direction. Each distance component is calculated as the estimated value, and in the course of calculating each distance component, a constraint condition is that the sum of the distance components is within a determined search radius R, and a large eigenvalue is calculated in the principal component analysis. The search area may be determined by pruning an area that cannot be the search area by applying a branch and bound method that determines whether the representative point of each area is within the radius R in the order of the bases. By doing this, the regions whose representative points are beyond the radius R are trimmed and excluded from the search region in order from the base direction in which the variance is large, so that it is determined brute force whether each region is within the search radius R. Compared to, the search area can be determined in a short time.
This aspect is related to the description of “first improvement: point-to-bucket distance estimation based on hash” in an embodiment described later.

The database storage unit calculates ν indexes by applying ν (ν is an integer of 2 or more) hash functions to each point, and is composed of ν bases. It is generated by projecting each point into a ν-dimensional space that is divided into bins and each bin is specified by each index, and the registration of each point in each bin is represented using a hash table. May be.
This aspect is related to the description of “first improvement: point-to-bucket distance estimation based on hash” in an embodiment described later.

The database storage unit projects each point in a ν-dimensional space, and the multi-dimensional hash table selects M sets of subspaces spanned by P orthogonal bases among ν orthogonal bases spanning the ν-dimensional space. The subspace is divided into subspaces using the k-means method, and the subspace having a larger variance in each region is set to have a larger number of divisions. The estimated value may be determined so that an estimation error of a square distance between a point existing in each region and a query is minimized in each base direction according to a probability density function obtained from the distribution of vector data in the direction of .
This aspect relates to the description of “sixth improvement” in the embodiment described later.
Preferred embodiments of the present invention include combinations of any of the plurality of embodiments shown here.

<< Hardware Configuration Example for Implementing the Invention >>
Here, an image recognition apparatus will be described as one aspect of a specific application of the approximate nearest neighbor search.
FIG. 20 is a block diagram showing an image recognition apparatus as an application example of the approximate nearest neighbor search apparatus of the present invention. The approximate nearest neighbor search according to the present invention is realized by the computer processing image data using hardware resources such as a storage device on the image recognition apparatus shown in FIG. The hardware of the image recognition apparatus includes, for example, a CPU, a storage device such as a hard disk device storing a program indicating a processing procedure executed by the CPU, a RAM that provides a work area to the CPU, and an input / output that inputs and outputs data. It consists of a circuit. More specifically, the image recognition apparatus may be realized by a personal computer having the above configuration. Or it may be comprised from the microcomputer which is integrated in an apparatus and has the said structure.

In FIG. 20, for example, data of an image taken by a digital camera is input to the image recognition apparatus via communication or a storage medium. The feature point extraction unit 11 is a block that extracts a feature vector from a pattern of an object included in input image data using a known method. In the configuration of FIG. 20, a feature vector represents a local feature of an image, and a plurality of feature vectors each representing a plurality of local features are extracted from one image. As another mode, there is a method of representing one image with one feature vector.
The search range determination unit 13 applies a hash function to each feature vector to calculate an index of a hash table 15h, which will be described later, and performs processing of referring to bins (buckets when a plurality of hash functions are used) included in the hash table 15h. Do.

When registering image data in the image database 15, the CPU attaches an image ID for identifying the image and stores it in the image database 15, and further associates a feature vector extracted from the image with the image ID in a hash table. Register at 15h. Specifically, the hash function is applied to classify each feature vector into one of a plurality of bins and register the bins in the bins. That is, an index is calculated by applying a hash function to each feature vector, and an image database 15 in which each feature vector is registered in a bin or bucket corresponding to the index is created. The hash function used at the time of registration is the same as the hash function used by the search range determination unit 13 for calculating the index.

After the image is registered in the image database 15 as described above, when image data is input as a query, the image recognition apparatus selects the image closest to the query image from among the images stored in the image database 15. Is output as a recognition result. Here, the search for the closest image is realized by comparing feature vectors and searching for the nearest vector for each feature vector of the query. An approximate nearest neighbor search is applied to this nearest neighbor vector search.

When an image as a query is given, the CPU as the feature point extraction unit 11 extracts a feature vector from the query. Hereinafter, the extracted feature vector is called a query vector, and the feature vector registered in the hash table 15h is called a reference vector.
As the search range determination unit 13, the CPU obtains an index by applying the hash function described above to each query vector. Then, the CPU refers to the bin of the hash table 15h specified by the obtained index, and sets the reference vector registered in the bin as the nearest candidate for the query vector. The process of applying a hash function to a query vector to refer to a bin and using the reference vector registered in the bin as the nearest neighbor candidate corresponds to the first stage process of the approximate nearest neighbor search. This is a process of narrowing down the nearest neighbor candidates, that is, the distance calculation target. It can be said that the search range determination unit 13 and the hash table 15h embody the first stage process of the approximate nearest neighbor search. The hash function is determined in advance in consideration of balance so that the number of reference vectors registered in each bin is reduced while the reference vector includes a reference vector having a high probability of being the nearest bin.

When only one reference vector is registered in the reference destination bin, the CPU sets the image ID associated with the reference vector as a recognition result candidate. On the other hand, when a plurality of reference vectors are registered in the reference destination bin, the CPU determines the nearest neighbor vector among the reference vectors as the nearest neighbor point determination unit 17. Specifically, the CPU performs distance calculation between the query vector and the reference vector registered in the reference destination bin. Then, the reference vector closest to the query vector is determined as the nearest neighbor vector. The image ID associated with the reference vector is set as a recognition result candidate. The nearest neighbor determination unit 17 corresponds to a configuration that embodies the second step of the approximate nearest neighbor search.

When a single query is given, the CPU obtains the nearest reference vector for each of a plurality of query vectors extracted from the query, and uses the image ID associated with the reference vector as a recognition result candidate.
As the voting unit 19, the CPU performs voting on image IDs that are candidates for recognition results for each query vector. A voting table 21 is provided for storing the number of votes for each image ID (image 1 to image n) at the time of voting. Thus, the advantage of obtaining the recognition result through the majority process by voting is that, even if several query vectors are associated with the wrong image ID, there is a high possibility that a correct recognition result is finally obtained. Each process of image capturing and feature vector extraction includes error factors associated with geometric distortion, resolution conversion, light and dark changes, and the like. Further, the approximate nearest neighbor search itself includes an error factor to allow a search error as a trade-off with the processing time. Therefore, not all query vectors are associated with the correct image ID. Such a majority process is effective in realizing highly accurate image recognition.
The CPU as the image selection unit 23 refers to the voting table 21 and uses the image related to the image ID that has obtained the maximum number of votes as the final recognition result.

The above is the configuration of the image recognition device. Among these, a search range determination unit 13 that applies a hash function to a query vector, a hash table 15 h that systematizes and stores feature vectors, and a nearest neighbor point determination unit that determines a nearest neighbor vector among reference vectors registered in the same bin. Reference numeral 17 denotes a configuration related to the approximate nearest neighbor search, which can be said to be an element constituting the nearest neighbor search device. Since the image recognition apparatus in FIG. 20 outputs an image as a recognition result, for example, the nearest neighbor determination unit 17 outputs an image ID associated with the nearest neighbor vector. Among these, the component as an approximate nearest neighbor search device determines and outputs the nearest feature vector from the feature vectors registered in the hash table for the input feature vector. A component that stores an image ID associated with a feature vector and a component that outputs an image ID associated with a nearest neighbor vector are not included. Note that the image recognition apparatus in FIG. 20 is an application example of nearest neighbor search, and data handled by the nearest neighbor search apparatus is not limited to feature vectors.

《Conventional representative approximate nearest neighbor search method》
Next, a conventional typical approximate nearest neighbor search method will be described. This will make it easier to understand the embodiments of the present invention described later.
<1. ANN>
One of the most representative methods using a tree structure is Approximate Nearest Neighbor (ANN). ANN is based on binary trees. In constructing a tree, the data space is divided into two equal parts, and division is repeated until one point enters the leaf. When a query is given, the tree is traced, and the distance registered with the data registered in the reached leaf is calculated. If the distance is r, the search area is the area where the closest part of each divided area falls within the radius r / (1 + ε) from the query. ε is an approximation parameter, and if ε = 0, all regions where there is a possibility that a point closer to r exists may be searched, so that a true nearest point can always be obtained.

<2. FLANN>
Fast Library for Approximate Nearest Neighbors (FLANN) is a library that provides an approximate nearest neighbor search method suitable for a given database and its parameter tuning. This library includes an exhaustive search in addition to randamized kd-tree (for example, see Non-Patent Document 12) and hierarchical k-means (for example, see Non-Patent Document 13), which have high performance among the proposed methods. Is adopted.

Randamized kd-tree is a technique for selecting re-neighbor candidates from multiple trees. In a method using a normal kd-tree such as ANN, a tree structure is formed by dividing a space into two parts while sequentially changing elements of data of interest. At this time, in order to obtain high accuracy for high-dimensional data, it is necessary to extend the search range to a leaf having a low probability of structurally including the nearest neighbor point. However, in this case, the processing time required for the tree traversal and a lot of useless distance calculations are performed. Therefore, the principal component analysis is performed in the randamized kd-tree, and the kd-tree is constructed by focusing on only the upper D-dimensional base that contributes to the distance calculation. At this time, a plurality of trees are formed by randomly selecting a base in each hierarchy. Therefore, when searching, even if the search range of each tree is small, it is possible to ensure high accuracy by following a plurality of trees, and higher performance than a normal kd-tree can be obtained.
Hierarchical k-means, as its name suggests, points belonging to each node at each node are clustered by k-means, and the space is divided into clusters at each hierarchy.

<3. LSH>
Locality Sensitive Hashing (LSH) is one of the most representative techniques among approximate nearest neighbor search techniques using a hash. Here, the LSH that can be used in the vector space (refer to, for example, Non-Patent Document 8) related to the present invention will be described. LSH uses a plurality of hash functions to select points that are considered to be in the vicinity of the query, and performs distance calculation only for those points. In other words, the data space is divided into a plurality of randomly generated bases at equal intervals, thereby dividing the space into regions called buckets for indexing.

FIG. 21 is an explanatory diagram of LSH, which is one method of conventional approximate nearest neighbor search. FIG. 21A shows a state in which the data space is equally divided by the bins of the hash table along the directions of _two bases a ₁ and a ₂ that are randomly generated. Each region divided along the axis a ₁ is a bin indexed by the hash function h _j1 related to the base a ₁ , and each region divided along the axis a ₂ is a hash function h _j2 related to the base a ₂ Is a bin indexed by. Each axis shows the index value. A cell-like region where these two types of bins intersect, that is, a product region where bins of each dimension of the two-dimensional hash table intersect is a bucket. The numerical value of each bucket has shown the value of the index of hash function _hj1 and _hj2 .

When searching, a point belonging to the same bucket as the query is the nearest neighbor candidate. However, there is a high possibility that the true nearest neighbor will be leaked from the candidates by this alone, and therefore, this process is repeated several times to increase the number of candidates and improve the accuracy. FIG. 21B shows the state of the search area obtained by three projections.

In LSH of Non-Patent Document 8, a hash function group as shown in the following equation is used.

Where x is an arbitrary point, a _i is a vector in which the value of each dimension element is independently selected from a Gaussian distribution, W is a hash width, and b _i is uniformly selected from the interval [0, W]. It is a real number. And the nearest neighbor candidate is a point p where ∃ (g _j (q) = g _j (p)) j = 1,. The reason why LSH can obtain the approximate nearest neighbor point is because it uses a locality sensitive hash function. A hash function that is sensitive to locality is a hash function that has a high probability of taking the same hash value (index) at points close to each other and a low probability of taking the same hash value at points far away from each other.

As in equation (2), in LSH, a hash function group H _j is _created by combining k hash functions h _ji . This corresponds to a k-dimensional hash table. The gray area in FIG. 21A is a bin product area (common area) obtained by applying two hash functions h _j1 and h _{j2 to} the query when k = 2 (projection space is two-dimensional). In other words, the bucket is obtained by applying the hash function group H _j to the query. This bucket is a target area for distance calculation related to one function group H _j . L (L sets) of such hash function groups H _j are created, and an area obtained by combining the L areas is finally set as a target area for distance calculation with the query. FIG. 21B shows a case where L = 3. The LSH speeds up the processing by reducing the distance calculation target by the above procedure.
Since LSH projects onto a random oblique base that does not depend on the data distribution, it is not efficient from the viewpoint that the projection space holds the distance in the data space.

<4. SH>
An outline of typical spectral hashing (SH) among techniques using a binary code hash function will be described. SH is a technique that is said to provide good performance among those using a hash. SH selects several principal components of the data space from the top and performs projection onto the Hamming space. Then, a candidate having a distance (Hamming distance) in the projected Hamming space that is equal to or smaller than a threshold is set as the nearest neighbor candidate. That is, paying attention only to the upper principal component basis, each sample is converted into a binary code, and the nearest neighbor candidate is selected according to the Hamming distance from the query. The SH encoding assumes a uniform distribution of the data space, divides the space so that the divided area is divided as close to a rectangular parallelepiped as possible, and gives a binary code to each bucket.

FIG. 22 is an explanatory diagram of SH, which is a conventional approximate nearest neighbor search method, and shows a state in which the data space is projected onto a two-dimensional Hamming space composed of two principal component bases pv ₁ and pv ₂ . Is. Each axis has an index indicated by a binary code. The sign of each bucket is a combination of the indices of the axes pv ₁ and pv ₂ . The query belongs to the bucket denoted by reference numeral 111. The gray area represents a search area for a query when the upper limit of the Hamming distance is 1. In other words, in addition to the bucket of reference numeral 111, buckets of

reference numerals

110, 101, and 011 that differ from reference numeral 111 by only one reference code are search areas.
Since SH projects the data space onto the principal component basis, it can be said that the original distance is easily maintained after the projection, but since the distance in the projected space is represented by the Hamming distance, an error from the Euclidean distance occurs. For example, there is a problem that the region 011 far from the bucket 111 onto which the query is projected in FIG.

<5. Sato's method>
Sato et al.'S method (see, for example, Non-Patent Document 11) calculates an approximate value of the distance (bucket distance) from the center of gravity of the bucket containing the query to the center of gravity of each bucket, and uses it as an estimate of the distance of each point from the query This is a method for finding a point with a small distance from the query.
In the method of Sato et al., A data space is projected into a space equally divided by a common division width with respect to an arbitrary orthonormal basis, and this is expressed by a multidimensional hash. This process is equivalent to scalar quantization of data, and the distance on the projection space has a data structure that well reflects the distance (true distance) of the original data space. At the time of search, the nearest neighbor candidate is extracted from the approximate hypersphere region centered on the query by obtaining the distance between the bucket to which the query belongs and the bucket to which each point belongs.

FIG. 23 is an explanatory diagram showing a procedure for determining the distance of each bucket in the method of Sato et al., Which is a conventional method of approximate nearest neighbor search. In FIG. 23, the projection space is divided by two orthonormal bases, the vertical axis and the horizontal axis. That is, the dimension number of hash ν = 2. It is divided into bins with a common division width for the vertical and horizontal bases. The numerical values attached to the vertical axis and the horizontal axis are indexes. Each cell-shaped section is a bucket as a bin product area. The sequence of numbers 11 to 33 assigned to each bucket is a combination of bin indexes that make up the bucket. This sequence of numbers can be considered as a position vector indicating the position of the bucket in the data space. A distance D between the buckets is defined from the index sequence, that is, the position vector of the bucket.

If the bucket containing the query is known, the distance from the bucket to any other bucket can be known using the bucket index. Therefore, the search for the nearest point may be performed by referring to the data included in the bucket in order from the bucket having the smallest bucket distance. In this way, a search for a hypersphere region centered on the query can be realized. Only points with a small approximate distance from the query can be targeted for distance calculation.

Now, if x is an arbitrary point, Ψ _i is an orthonormal basis, and W is a division width (bin width), a ν-dimensional hash function H is as follows.

As the approximate distance between the two points, the expected value of the Euclidean square distance obtained from the hash value of the bucket to which any two points p ₁ and p ₂ belong in the projection space is used. Assuming a uniform distribution in the data space and the independence of each basis, the expected value of the distance is as follows.

From the above equation, it is sufficient to use the distance (bucket distance) in the projection space for comparison of distances, and the search is performed using this as an approximate distance. Note that the square distance is an example.

Here, if B (p) is the center of gravity of the bucket to which an arbitrary point p belongs, the bucket distance of the bucket to which each of the two points p ₁ and p ₂ belongs is expressed as follows.

Referring to FIG. 23, search area selection will be described. FIG. 23 shows an example when the index of the bin is <2, 2> when the hash function is applied to the query. First, the distance between the points registered in the bucket with the same index as the query is calculated. Next, a bucket with a bucket distance of 1 is searched. In FIG. 23, the buckets are indexes <1, 2>, <2, 1>, <2, 3>, <3, 2>. These buckets are searched in order. If it is determined that a sufficient number of buckets have been searched, the search is terminated. On the other hand, if it is determined that the number of buckets is still insufficient, the search range is expanded to farther buckets. In FIG. 23, the buckets are the indexes <1,1>, <3,1>, <3,3>, <1,3>.

FIG. 1 is an explanatory diagram showing an example of distance estimation in the conventional technique of Sato et al. In FIG. 1, the vertical axis and the horizontal axis are two-dimensional orthonormal bases as in FIG. An asterisk represents a query. Unlike FIG. 23, the numbers on each axis represent the weights of bucket distances in the left-right direction and the up-down direction when the query is used as a reference (origin). In FIG. 1, the weight is the Euclidean square distance from the bin containing the query. The numbers in the buckets represent the estimated distance between the buckets. The distance between buckets is the sum of the weights in the left-right direction and the up-down direction. As a determination of whether or not the number of searched buckets is sufficient, a search radius R is given at the time of searching, and a point p satisfying D (B (q), B (p)) ≦ R is set as the nearest neighbor candidate. If the search radius is R = 2, nine buckets with an estimated distance of 2 or less are selected as nearest neighbor candidates. R may be determined according to the query.

In this method, the accuracy of the approximate distance decreases when the dimension of the projection space is smaller than the dimension of the data space. However, since the hash size increases exponentially with respect to the hash dimension number ν, an enormous hash size is required to maintain accuracy for high-dimensional data. It is difficult to increase the dimension number ν of the hash for high-dimensional data, and as a result, it is not possible to obtain a sufficient approximate distance accuracy for high-dimensional data. Also, if the hash size becomes too large, it is necessary to refer to many buckets in order to maintain the accuracy of nearest neighbor search, and it takes a lot of time to extract the nearest neighbor candidate.
<6. IVFADC and IMI>
Inverted File with Asymmetric Distance Calculation (IVFADC) and its improved method, Inveted Multi-Index (IMI), the data space is roughly quantized (coarse quantization) using the k-means method to divide the space (indexing). To do. At this time, a set of representative values (centroids) of each cluster (divided region) obtained by coarse quantization is C, and the total number of clusters, that is, the total number of centroids is | C | = G. At this time, the expected value (estimated distance) of the distance from the query to the point belonging to each area is the distance to the centroid of the area. Therefore, it is efficient if a centroid close to the query is obtained and a point belonging to the region is set as the nearest candidate.
1. IVFADC nearest neighbor selection IVFADC uses simple vector quantization for coarse quantization (eg H. Jegou, M. Douze, and C. Schmid, "Product quantization for nearest neighbor search," IEEE Trans. TPAMI, vol. .33, no.1, pp.117-128, 2011). In order to improve the selection accuracy of the nearest neighbor candidate, it is necessary to finely divide the space (increase G). However, the calculation cost for selecting the nearest neighbor candidate is O (G), and cannot take a large value. Therefore, there is a problem that the accuracy of selection of the nearest neighbor candidate cannot be obtained sufficiently.
2. IMI nearest neighbor candidate selection Therefore, IMI has been proposed as an improved method of IVFADV (for example, A. Babenko and V. Lempitsky, "The inverted multi-index," Proc. IEEE Conference on Computer Vision and Pattern Recognition ( CVPR), pp. 3069-3076, 2012). In coarse quantization, product quantization is performed by dividing the vector x into two partial vectors U ¹ (x) and U ² (x). The set of partial centroids obtained is C ¹ , C ² , and the number of elements | C ¹ | = | C ² | = is g. A set of centroids is obtained by C = C ¹ × C ² , and the number is G = g ² . If the centroid capable side by side the i-th and j-th element of C ² of C ¹ is defined as _{_{^{c ij = {c i 1,}}} c j 2}, estimated square distance F _ij to the point belonging to the query q and c _ij (Q) is the distance from the query to the i-th element of the m-th partial centroid

Is expressed by the following equation.

Therefore, the problem of selecting centroids in order of proximity to the query is the two partial distance lists

This results in a problem of searching for a combination of i and j where the sum F _ij is small. In IMI, this problem is solved by using a combinatorial search method called Multi-Sequence algorithm. Each time the algorithm generates the combination of subscripts with the shortest distance at a given time, it adds subscript candidates as subscript pairs that may generate the next smallest combination of distances. Repeat the process of selecting the next combination. Then, the search is terminated when the obtained number of nearest neighbor candidates reaches point L.
When the space division number G is equal, the product quantization is less accurate than the vector quantization. However, by using product quantization, the Multi-Sequence algorithm can be applied and the calculation cost is reduced.

Since the nearest neighbor candidates can be obtained, the search speed can be increased. In the above-mentioned document “The inverted multi-index,” it is reported that a solution can be obtained at a higher speed than that of IVFADC.
The conventional representative nearest neighbor search method has been described above.

<< Method of Distance Estimation According to the Invention >>
Hereinafter, the present invention will be described in more detail with reference to the drawings. In addition, the following description is an illustration in all the points, Comprising: It should not be interpreted as limiting this invention.
In order for the approximate nearest neighbor search to be highly accurate and fast, it is important that the number of candidates is reduced without leaking the true nearest neighbor in extracting the nearest neighbor candidates, and that this process itself is fast. At this time, if a hypersphere region centered on the query can be searched and used as the nearest neighbor candidate, the true nearest neighbor point is not leaked, which is ideal. However, if the data space is high-dimensional, it is not easy to realize this without performing distance calculation. If the approximate distance based on the hash index holds the distance in the original space well, there is a possibility that the distance from the query can be estimated based on the index. However, in many of the conventional methods, the distance in the space after projection cannot sufficiently hold the distance in the original space, and the estimated distance cannot sufficiently hold the magnitude relationship of the true distance. In that case, in order to increase the accuracy of the search, it is necessary to secure many nearest neighbor candidates, and a solution cannot be obtained at high speed.

Accordingly, the inventors pay attention to the method of Sato et al. In which the approximate distance well reflects the magnitude relationship of the true distance, and proposes an improved approximate nearest neighbor search method. This makes it possible to estimate the distance with higher accuracy and higher speed than the conventional method, and as a result, the overall speed of the approximate nearest neighbor search can be increased.
In order to improve the selection accuracy of the nearest neighbor candidate in the first step of the approximate nearest neighbor search and execute the nearest neighbor candidate selection at high speed, the present invention provides a distance estimation method and an adaptive search range determination method. Special attention.

The first improvement to the distance estimation in Sato et al.'S method is a method of estimating the distance using a point-to-bucket based on a hash. The second improvement is a method using a divided hash table. The third improvement is a space division method suitable for distance estimation. A fourth improvement is a distance estimation method based on a probability density function. Further, the fifth improvement relates to the adaptive search range, and is an expansion of the search area in consideration of the data density around the query. Each method can be applied alone, but can be applied in whole or in part. Details of each method will be described below. In addition, an experimental example showing the effectiveness of each method will be described.

In the following description, an approximate nearest neighbor search method to which the first point-to-bucket distance estimation and the second hash table partitioning are applied is referred to as Bucket Distance Hashing or BDH. Furthermore, a method in which BDH is combined with space division suitable for distance estimation, which is a third improvement, is referred to as k-means BDH. Further, a method in which k-means BDH is combined with distance estimation based on a probability density function, which is the fourth improvement point, is called k-means BDH P. Further, a technique in which the adaptive search range which is the fifth improvement point is combined with k-meansmeanBDH P is called k-meansmeanBDH PC.

<First improvement: Point-to-bucket distance estimation based on hash>
The first improved method mainly relates to improvement in accuracy of distance estimation. Sato et al. Estimated the distance between buckets to which the query and data belong, that is, the bucket-to-bucket distance. Here, a method for improving the accuracy of distance estimation with the same data structure by estimating the distance from the exact query position to the bucket to which each data belongs, that is, the distance of the point-to-bucket, is proposed.

This technique uses the same equation (5) as the technique of Sato et al. As a hash function. That is, for the two points p ₁ and p ₂ , the representative point is determined by calculating the expected value of the coordinate of p _{1 and} the coordinate of the point belonging to the same bucket as p ₂ . Specifically, the representative point is determined so that the coordinate of the representative point becomes an expected value of the coordinates of the point belonging to the bucket. That is, the representative point is the center of gravity of the bucket. Then, the Euclidean distance between the two points of the coordinates of p _{1 and} the coordinates of the representative point of the bucket to which p ₂ belongs is calculated, and this is used as the approximate distance. First, the expected value of the Euclidean square distance in the base i direction is as follows.

From the assumption that each base is independent, the expected value of the Euclidean square distance in v-dimensional space is as follows.

From the above equation, it can be seen that it is sufficient to use the following equation to compare the distances, which is equivalent to calculating the distance from the center of gravity as the representative point of the bucket to which p ₁ and p ₂ belong. The search is performed using as an approximate distance. The Euclidean square distance is an example, and the essence of the present invention is not limited to this.

The distance between the buckets B (p ₂ ) to which the points p ₁ and p ₂ belong, that is, the point-to-bucket distance is expressed as follows.

In Expression (7), (h _i (p ₂ ) +1/2) represents the coordinates in the Ψ _i direction of the center of gravity of the bucket B (p ₂ ), and the distance in Expression (7) is the distance between the query and the center of gravity of the bucket. be equivalent to. Since the position of the query is specified, distance estimation with higher accuracy than Expression (6) is realized. The error variance of the estimated distance is 1 / W ² compared to the method of Sato et al. That estimates the bucket-to-bucket distance.

Here, for the purpose of later explanation, the distance of the expression (7) is multiplied by W, and is rewritten as the following expression.

Here, BD _i (p ₁ , B (p ₂ )) represents the distance in the base direction of the i-th dimension between the point p ₁ and the center of gravity of the bucket B (p ₂ ).
Although the distance estimated by the repositioning is multiplied by W, the selection of the nearest neighbor candidate is performed based on the relative distance, and the result does not change.

FIG. 2 is an explanatory diagram showing an example of distance estimation according to this embodiment. It is a figure corresponding to FIG. 1 which shows the distance estimation in the method of the conventional Sato et al. As in FIG. 1, the star in FIG. 2 represents a query. If the distance between the bucket centers is 1, the left and right direction query is 0.6 from the center of the left bucket and 0.4 from the center of the right bucket, and the vertical direction is 0.7 from the center of the upper bucket. Located at 0.3 from the center of the lower bucket. The number on the axis (2.56, 0.36, etc.) represents the weight of the bucket distance in the horizontal direction and the vertical direction, and the number in the bucket (3.05, 0.85, etc.) represents the weight of the distance between the estimated query and each bucket. The weight is the Euclidean square distance.

FIG. 3 is an explanatory view showing a different example of distance estimation according to this embodiment. In FIG. 3, the dimension number ν = 2 of the hash function. This is an example of a search space and query different from those in FIG. A circle centered on the query in FIG. 3 represents the size of the search radius, a bucket whose bucket center is within the circle is a search area, and the search area is shown in gray. When searching, a bucket within a search radius R given as a parameter is referred to. Since the search range is within the circle centered on the query instead of the center of the bucket containing the query (a hypersphere when the dimension ν of the hash is expanded to an arbitrary natural number), Sato performs the bucket-to-bucket distance estimation. Distance estimation can be performed with higher accuracy than those methods.

As described above, the process of selecting a bucket within the search radius R is more complicated than the method of Sato et al. The reason is that the estimated distance is limited to an integer in the method of Sato et al., But is a real number in BDH, so it takes a long time to handle the estimated distance strictly. In order to avoid this problem, the present invention proposes an algorithm for quickly identifying buckets within the search radius R. FIG. 18 shows two algorithms based on the branch and bound method for quickly identifying buckets within the search radius R in this embodiment. “Algorithm 1” and “Algorithm 2”. However, these

algorithms

1 and 2 both have the dimension number of the feature vector as d and the query as q = {q ₁ , q ₂ ,..., Q _d }, and BD _ij = BD _i (q, j) and To do. BD _ij (p ₁ , B (p ₂ )) represents the distance in the base direction of the i-th dimension between the query and the centroid of the j-th bucket.

When the dimension number of the hash function is ν, ν coordinate values are required to obtain the distance from the query to the representative point of a bucket. The coordinate value of the representative point of the bucket is obtained using ν hash values, H (x) = {h ₁ (x), h ₂ (x)... Hν (x)}. Once these are determined, the distance to the bucket can be found by adding ν times.
Here, the ν bases are selected by principal component analysis and are arranged in descending order of the eigenvalues. Therefore, the data has a large variance in the upper base direction (the subscript number is young). Therefore, the most efficient method for searching for buckets within the radius R is to evaluate in the order of subscript numbers as described below.

If the top i hash values h ₁ (x), h ₂ (x)... h _i (x) have already been determined among the ν hash values, select the remaining ν−i hash values. Thus, a bucket within the radius R is searched. Here, when placing the distance calculated by the upper i pieces of hash values D _i, [nu-i number of the sum the search radius R distance mD _i and D _i obtained by adding the minimum value that can be taken in each hash function As long as the upper i hash values are used, the condition of radius R from the query cannot be satisfied. In this manner, buckets existing within the radius R from the query are sequentially searched.

In algorithm 1, when the hash values for the upper i dimensions are determined in the first to third lines, the remaining ν-i hash values can be freely selected, and the minimum distance mD _i for the ν-i dimensions is determined. calculate. In the 4th to 6th lines, if the distance Σν _{b = 1} min _j BD _bj from the query to the nearest bucket is smaller than the search radius R, the function of the algorithm 2 is called.

Algorithm 2 takes two arguments. I indicating the dimension of the hash value to be determined, and the distance D for the (i−1) th dimension already determined. In the 1st to 7th lines, if the last base (the stage at which the ν-th hash value is determined) has not been reached (1st line), k _i (the number of divisions of the i-th base, that is, the bucket in the i-th base direction) Try the hash value (index) of the number of bins that make up (second line). Since BD _ij is a one-dimensional distance (distance component in the i-th base direction) in the i-th base when the hash value j (j-th bin) is selected, D + BD _ij + mD _i is i hash values. Is the distance to the nearest bucket at a determined stage, and only when it is smaller than the search radius R, recursively calls itself and proceeds to the next base (lines 3-5). Lines 9 to 13 are executed only when the last νth base is reached. If there is a bucket within the search radius, the hash is subtracted.

<Second improvement: Hash table partitioning>
The technique of Sato et al. Has a problem that high-speed data cannot be searched at high speed. This is a problem caused by the relationship between the number of dimensions of the hash and the number of buckets included in the hash table (hereinafter referred to as hash size). When the distance between two points is calculated by projecting onto a low-dimensional subspace, the accuracy of the estimated distance becomes lower as the dimensionality of the subspace is lower. Therefore, in order to maintain a certain estimation accuracy, a subspace having the number of dimensions corresponding to the estimation accuracy is required. The same applies to distance estimation using a hash. In order to maintain the accuracy of the estimated distance, it is necessary to increase the hash dimension number ν in accordance with the data dimension number. However, if the dimension number ν of the hash is increased, the hash size becomes enormous and falls into a situation where it cannot fit in the memory.

For example, if the number of divisions of one base, that is, the number of bins is s, it is necessary to prepare ν sets of data storage areas corresponding to s bins as a hash table. Therefore, the order of the hash table size (hash size) is O (sν). Even if the number of bins s is kept to a minimum of two, about 1 billion data storage areas are required to construct a v = 30-dimensional hash.

Therefore, in order to suppress the increase in the hash size with respect to the increase in the number of data dimensions, the estimated distance of the high-dimensional hash is obtained by dividing the high-dimensional hash table and integrating the estimated distances obtained from the low-dimensional hash table. Suggest a method. When the ν-dimensional hash table is divided into M pieces, the ν hash functions are divided into M sets as follows.

However, j = 1,..., M and Σ _j t _j = ν. That is, H _j (x) is a hash function group of each set of hash functions divided into M sets, and the total number of dimensions of the divided hash functions H _j (x) is ν.

And the estimated distance from query q to any point p

Can be expressed as This is equal to the estimated distance determined by the ν-dimensional hash. Here, the target of the distance calculation is a point that exists within the search radius R from the query in any of the hash tables. If the distance calculation target point is within the search radius R from the query in the j-th hash table, the distance is D _j (q, B _j (p)), and if it does not exist, D _j (q, B _j (p)) = R. The advantage of dividing the hash table is that the hash size can be drastically reduced even when a hash of the same number of dimensions is expressed as compared to the case of using one hash table. Considering the case where each s is divided into one base direction, when a ν-dimensional hash is expressed by one hash table, the hash size is O (sν), but it is divided into M hash tables. When expressing the ν-dimensional hash, it can be seen that the size of one divided hash table is O (sν ^{/ M} ), and decreases exponentially with respect to M. The hash as a whole stays at most M times. Therefore, by dividing the hash table and expressing a multidimensional hash, the accuracy of distance estimation can be improved without reducing the extraction speed of the nearest neighbor candidate even for high-dimensional data.

<Experimental example according to first and second improvements>
An experiment was conducted to confirm the effectiveness of the first and second improvements described above. The contents of the experiment and the results are as follows.
<1. Preliminary experiment>
FIG. 4 is a graph showing the relationship between accuracy and processing time when the dimension number ν of the multidimensional hash of the conventional Sato et al. Method is changed. The data used here is 64 dimensions or 128 dimensions, FIG. 4A shows the case of 64 dimensions, and FIG. 4B shows the case of 128 dimensions. Artificial data and queries based on a normal distribution of 10 million points are both 2000 points generated under the same conditions. The computer used was an Opteron (tm) 6174 (2.2 GHz) CPU, 256 [GB] memory, and the experiment was performed on a single core.
In the artificial data, the relationship between accuracy and processing time is best when ν = 24 for both the method of Sato et al. And BDH (not shown), and hence ν = 24 in the experiments using the artificial data in this section.

<2. Experiment>
In this section, in order to evaluate the performance of the proposed BDH, a comparison experiment between the conventional method introduced in the previous section and the present invention is performed. The same computer as the preliminary experiment was used. The bases used in the method of Sato et al. And BDH were selected as ν of the original bases with large variance in the artificial data, and ν as the main components obtained by the principal component analysis in the real data with the largest variance. .

5, 6, and Table 1 show the accuracy in the optimum parameters (the ratio at which a true nearest point was obtained) and the processing time (giving a query) in the ANN, SH, Sato et al. Method and the BDH according to the present invention. The relationship between the average time taken to obtain a solution and the memory usage at that time. Here, “optimal” means a state in which the processing time is minimized when compared with the same accuracy. As a result of the preliminary experiment, when SH has a bit length of log ₂ n as a parameter, the method of Sato et al., In BDH, dimension number ν = log ₂ n × M, division width W = {max (Ψν · p) −min ( It has been found that Ψν · p)} / 2 is optimal. These parameters are values that make the hash size about the same as the number of data n.

Data is extracted from artificial data following normal distribution of 64, 128, and 256 dimensions (dispersion is uniformly selected from 100 to 400 for each base) and each frame image of the video distributed by Instance Search task of TRECVID2010 SIFT features (128 dimensions, for SIFT, DG Lowe, "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol.60, no.2, pp.91-110, 2004 .)) Was prepared for 10 million each. The query uses 2000 points created under the same conditions as the database, and the average is the result.

Figure 5 shows the relationship between accuracy and processing time for artificial data results. FIG. 5A shows experimental results of 64-dimensional artificial data, FIG. 5B shows experimental results of 128-dimensional artificial data, and FIG. 5C shows experimental results of 256-dimensional artificial data. Table 1 shows the memory usage at this time. The result of the image data is shown in FIG. 5 and 6, the horizontal axis represents accuracy, and the vertical axis represents processing time. The single hash of the legend is a case where the hash table is not divided in the present invention, and the divided hash is a case where the hash table is divided.

As a result of the above experiment, BDH was the fastest when all data were compared with the same accuracy. In artificial data, a single hash and a partitioned hash have a slightly better result for low-dimensional data. However, as the number of dimensions increases, the effectiveness of the partitioned hash increases. appear. This is because the number of bases that are considered for the search is increased by the hash table partitioning, thereby suppressing a decrease in accuracy of the estimated distance. Therefore, it can be said that hash table partitioning is effective for high-dimensional data.
Looking at the results for SIFT features (128 dimensions), a single hash is dominant. Looking at the processing time, it can be seen that the solution can be obtained faster than the 64-dimensional artificial data when compared with the same accuracy. In other words, the SIFT feature amount is apparently 128 dimensions, but the actual number of dimensions is less than half, so it is considered that the single hash has become dominant.

In the following, BDH is improved by three additional approaches. The first is the proposal of space division suitable for distance estimation, the second is the proposal of distance estimation based on the probability density function, and the third is the expansion of the search area considering the data density around the query. The details will be described below.

<Third improvement: Data space division suitable for distance estimation>
BDH divides each base equally, assuming a uniform distribution of data, and sets the center of the region as a representative vector of points belonging to the bucket. However, in general, actual data is not uniform, and in such a method, distance estimation is performed. The error is large. Thus, the inventors obtain the representative values of the elements of each base by the k-means method as shown in FIG. 7, and minimize the estimation error by performing adaptive division according to the data distribution. With this approach, the accuracy of distance estimation is improved, and the nearest neighbor candidate can be obtained more efficiently. In the present invention, it considers that the i-th base to k _i split, an appropriate representative values for each base and each k _i pieces available, minimizing the error E. Let the j-th representative value of the i-th basis be C _ij . The hash function is expressed by the following formula.

Here, V _i is a unit vector in the direction of the i-th basis.
Then, since a vector can be represented by a combination of base representative values C _ij, a bucket is defined by this vector. That is, an area where a certain vector is closest is defined as a bucket range. This vector is called a representative vector. In this way, the data space is divided into Π _i ^v k _i regions. Therefore, the distance component between the point p and the representative value C _ij in the i-th base direction, that is, the base distance BD _i is

It is expressed as In this case, the error between each sample (vector data) and the representative vector is defined by equation (15).

Here, X represents a set of data, X _n represents n-th data, and BE _i represents an error in the i-th principal component direction. In the present invention, the representative value {C _ij } in each base is obtained by the k-means method having the same objective function as the present invention.
What is important here is how many representative values in each base are set. In order to solve this problem, all bases are divided by the same number, but in the actual data, the variance of the data projected onto the principal component basis differs greatly depending on the principal component basis, and it is not efficient to handle it equally. . Naturally, if the number of representative vectors is the same, it is better that the estimation error is smaller.

Here, considering the case where the number of representative values of a certain base is multiplied by n, the number of representative vectors is also multiplied by n. Assuming a uniform distribution, the estimation error at this time is 1 / n ² times due to the influence of quantization.
Since it is expected that the error is greatly reduced by the increase of the representative vector as the variance is larger, the representative value is increased in order from the variance with the larger variance, and the division is performed when the errors of all the basis become smaller than the given threshold M. It is considered efficient to finish. This is an adaptive division considering the contribution rate of each base to the distance calculation. FIG. 7 is an explanatory diagram showing a comparative example of equal division and adaptive base division according to this embodiment.

<Fourth improvement: Distance estimation based on probability density function>
In the previous section, we proposed a space partitioning structure to ease the constraints of uniform distribution. In contrast, this section proposes a distance estimation method that assumes a probability density function for each section and can flexibly handle more general distributions. Details of the method will be described below.
In the present invention, the distribution of samples (vector data) in a region (t _{i (j−1)} ≦ V _i · p <t _ij ) satisfying h _i (p) = j for each principal component direction is represented by a histogram. This is normalized and then converted into a probability density function P _ij (y) by the least square method. However, t _ij is expressed as follows.

Then, the estimated distance BD _i (p, j) in the i-th principal component direction is expressed as the expected value of the square base distance as follows.

The feature of the distance estimation proposed here is that the fitness to a general complex distribution is higher than that of the distance estimation of BDH, and at the same time, the calculation cost is low and it is suitable for high-speed processing.

<Fifth improvement: Expansion of search radius considering data density around query>
In the technique using a hash, a search area is generally determined according to a search radius R given as a parameter. However, well-known literature (e.g., W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li, "Modeling LSH for performance tuning," Proceeding of the 17th ACM conference on Information and knowledge management, pp .669-678 2008.), the probability and processing time at which a true nearest neighbor is obtained in the approximate nearest neighbor search depends greatly on parameters. Therefore, when processing all queries with a common R, in order to obtain sufficient accuracy, it is necessary to give a somewhat large R so that the nearest point can be obtained even when the query is in a sparse region. As a result, when the query is in a dense area, the search area is wider than necessary, and the distance calculation becomes excessive. That is, the desired R varies greatly depending on the query position in the data space.

Therefore, the inventors propose a method that uses the nearest neighbor candidate number c instead of the search radius R as a parameter to adaptively change the search radius and stably reduce the processing time regardless of the query. In this method, the search area is expanded step by step from a place where the estimated distance is small, and the search area expansion is terminated when the nearest neighbor candidate number c is satisfied. FIG. 8 is an explanatory diagram showing how the search radius is changed according to the density around the query as an example of this embodiment. In this example, the dimension number of hash ν = 2. With this method, a wide range can be searched when the query periphery is sparse as shown in FIG. 8A, and a narrow range can be searched when the query periphery is dense as shown in FIG. 8B. As a result, the average performance can be improved because excessive distance calculation can be suppressed for a query that enters a dense area. In FIGS. 8A and 8B, gray areas indicate reference buckets, and 9 points are the nearest neighbor candidates.

Next, an efficient neighborhood bucket reference algorithm according to the method of this embodiment will be described. In the conventional method, the reference bucket is obtained by calculating the distance on the hash for all buckets (for example, H. Jegou, M. Douze, and C. Schmid, "Product quantization for nearest neighbor search," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.33, no.1, pp.117-128, 2011.) and implementation has a hash structure to calculate the distance on the hash for all sample points The processing time is also required for specifying the nearest neighbor candidate such as the one that determines a point less than the search radius (SH or its improved method).

Therefore, the condition that each bucket is strictly referred from a small estimated distance is relaxed. That is, the upper limit of the estimated distance is increased stepwise, and the order of buckets below the upper limit is ignored and referred to. Then, a process such as a complicated data structure or sorting is not necessary, and a search bucket can be specified at high speed. FIG. 19 shows two algorithms for adaptively changing the search radius in this embodiment. “Algorithm 3” and “Algorithm 4”. However, in any of these algorithms, the query is q, and BD _ij = BD _i (q, j). L and U respectively represent a lower limit and an upper limit of the search radius, and Δ is a difference in radius from the previous time when the search radius is expanded. Such stepwise processing based on the number of nearest neighbor candidates enables a more efficient search than the above-described conventional method, “algorithm 1”, and “algorithm 2”.

In “Algorithm 1” and “Algorithm 2”, a distance mD _i obtained by adding the minimum values that can be taken in each of the ν−i hash functions is calculated and a bucket existing within a radius R from the query is searched. In addition, “Algorithm 3” and “Algorithm 4” also use the condition that the distance MD _i obtained by adding the maximum possible values is calculated and exists beyond the radius L from the query.
In algorithm 3, the distance MDi obtained by adding the minimum value mD _i for the ν−i dimension and the maximum possible value is calculated in the first to fourth lines. In the 5th to 6th lines, initial values of L and U which are lower and upper limits of the search radius are set. In the seventh to eleventh lines, the function of algorithm 4 is called and the search is continued while the search radius is increased by Δ until the nearest neighbor candidate becomes c points or more.

The two arguments of algorithm 4 are the same as algorithm 2. In the 1st to 7th lines, if the last base (the stage at which the νth hash value is determined) has not been reached (1st line), k _i (the number of divisions of the i-th base) hash values are tried (2 Line). Since BD _ij is a one-dimensional distance in the i-th base when the hash value j is selected, D + BD _ij + mD _i is the distance to the nearest bucket when i hash values are determined, and D + BD _ij + MD _i is the distance to the farthest bucket. Only when the former is smaller than the upper limit U of the search radius and the latter is larger than the lower limit L of the search radius, recursively calls itself and proceeds to the next base (lines 3 to 5). Lines 9 to 13 are executed only when the last νth base is reached, and if there is a bucket within the upper limit of the search radius and beyond the lower limit, the hash is subtracted.

<Experiment related to 3rd to 5th improvements>
Experiments were conducted to confirm the effectiveness of the third to fifth improvements described above. The contents of the experiment and the results are as follows.
In the experiment, all implementations were done in C ++. The ANN uses the ANN Library (URL: http://www.cs.umd.edu/mount/ANN/), and SH is the MATLAB source code of the author (URL: http://www.cs.huji.ac) It was implemented by the inventors with reference to .il / yweiss / SpectralHashing /) and LSH-KIT (URL: http://lshkit.sourceforge.net/). The computer used in the experiment was an Opteron (tm) 6174 (2.2 GHz) CPU and 256 [GB] memory, and the experiment was performed with a single core. In the database, SIFT features were extracted from images obtained every 10 seconds from the video distributed by the Instance Search task of TRECVID2010, and duplicate vectors were removed.

<1. Experiment 1>
The correlation coefficient between the estimated distance when the code length is changed and the actual Euclidean distance is shown. This result represents how much the estimated distance reflects the true distance. The hash structures to be compared are SH, BDH, and k-means BDH (third improvement: division of data space suitable for distance estimation), and a 1000-point query is input for 1 million SIFT features. The average of the correlation coefficient was the result. In this experiment, the estimated distance of SH is the Hamming distance of the binary code, and the code length of BDH, k-means BDH is [log ₂ (b)] where b is the number of buckets (hash size). A result showing the accuracy of the estimated distance is shown in FIG. The horizontal axis is the code length [bit], and the vertical axis is the correlation coefficient. By the division considering the distribution by the k-means method, a correlation coefficient higher than that of BDH is obtained by comparison with the same code length. The reason why the correlation coefficient of SH does not increase even when the code length is increased is that SH is spatially divided assuming a uniform distribution and that the distance between the Hamming distance and the Euclidean distance is shifted. .
For reference, FIG. 10 shows the relationship between the estimated distance and the true distance when the code length is 120 bits. It can be seen that the estimated distance of k-means BDH reflects the true distance well as the result of the correlation coefficient shows.

<2. Experiment 2>
The average rank of the nearest neighbor points in the distance on the hash when the code length is changed is shown. In other words, this indicates the minimum amount of calculation required to find the nearest point, and the smaller the value, the better. This experiment corresponds to the evaluation by the code length-Precision curve in the k-neighbor search. However, the code length of BDH or k-means BDH is defined as [log ₂ (c)], where c is the number of buckets (hash size). Also, 1000 queries were input for 1 million SIFT features, and the average was used as the result. FIG. 11 shows the average rank of nearest neighbor points in the distance on the hash when the code length is changed.
The horizontal axis is the code length [bit], and the vertical axis is the rank of the nearest point. It can be seen that the average rank of the nearest points is higher in BDH than SH in any code length. The ranking of the nearest neighbors is about 1/8 at 20 bits, about 1/8 at 40 bits, and about 1/13 at 80 bits compared to SH.

<3. Experiment 3>
The relationship between the accuracy in the nearest neighbor search problem (the ratio at which a true nearest neighbor point is obtained) and the processing time (the average time from when a query is given to when a solution is obtained) when the search parameter is changed is shown. In this experiment, 10 million SIFT features were used as the database, and 1000 points created under the same conditions as the database were used as the query. The code length n of SH and BDH was 24 [bit]. The reason why n = 24 is because it is experimentally known that the hash size 2 ^{24 and the} number of data 10 ⁷ is optimal.

First, the effectiveness of distance estimation adapted to the data distribution when the search radius R is fixed is confirmed. The result of the comparison is shown in FIG. k-means BDH is obtained by applying a spatial partitioning method using k-means according to the third improvement point, and k-means BDH P uses a probability density function which is the fourth improvement point for k-means BDH. The distance estimation is applied. When only k-means is applied, the performance is lower than that of BDH, but improvement is seen by combining the probability density function.

Next, k-means BDH P (method for fixing the search radius R) of only the fourth improvement point (distance estimation based on the probability density function) that was the fastest in FIG. A comparison of k-means BDH PC with 5 improvement points (method of fixing the nearest neighbor candidate number c) is performed. The result of the comparison is shown in FIG. By reflecting the data density around the query, it was confirmed that the speed was increased by about twice, and a sufficient effect was obtained.

Finally, we compare ANN and SH, which are typical existing methods. Further, each search parameter is ε for ANN, SH for Hamming distance R, and k-means BDH PC nearest neighbor number c. The comparison results are shown in FIG. The processing time of k-means 時間 BDH PC is about 1/10 for ANN and about 1/6 to 1/12 for SH, compared with the same accuracy, and the speed is greatly increased. I understand that.

<4. Experiment 4>
Next, an experiment related to parameter tuning of the present invention is performed. When executing a search, it is very practically important to know the relationship between input parameters and the performance obtained at that time, and the present invention can easily know this relationship. FIGS. 15 and 16 show the relationship between the input parameter c of the nearest neighbor candidates, processing time, and accuracy, respectively. From FIG. 15, the processing time is almost linear with respect to the parameter c, and can be expressed as processing time T = tc. Further, FIG. 16 shows the aspect shaft in log scale, and drawing a substantially straight line graph in the range of precision of 20% to 90% indicating that the accuracy A = c ^a holds in this range. The graph deviates from the straight line within a range of 20% or less, as shown in FIG. 17, when c is small, most queries are actually obtained because c nearest neighbors are secured in the first search. This is because the number of nearest neighbor candidates does not change. Therefore, if a certain number of sample queries are input, the accuracy and processing time can be known in advance from the parameter c, and parameter tuning is easy.

In addition to the embodiments described above, there can be various modifications of the present invention. These modifications should not be construed as not belonging to the scope of the present invention. The present invention should include the meaning equivalent to the scope of the claims and all modifications within the scope.
For example, in the present invention, the Manhattan distance (L ₁ norm) is used as the distance scale, the Euclidean distance (L ₂ norm) is used, and other distances are used. Applicable.

<Sixth improvement>
A sixth improvement is described in which the indexing related to the “division of data space suitable for distance estimation” as the third improvement and the “distance estimation based on the probability density function” as the fourth improvement are developed. It is confirmed experimentally that high search performance can be realized by using this method in combination with the search method according to the fifth improvement point “expansion of search radius considering data density around query”.

1. Data Space Indexing In the third improvement, when the data space is divided (indexed), the M-dimensional space formed by the selected base is coarsely quantized by scalar quantization. However, even in the actual data, even if the bases are orthogonal, they are not necessarily independent of each other, and scalar quantization has poor quantization efficiency.

Here the hash function h ^m (x), by a number of sufficient hash values to cover the range in which data exists to g ^m, bucket number G is

As shown in FIG. As the number of buckets increases, the accuracy increases, but it takes time to search for adjacent buckets. Therefore, a large M cannot be used easily, and if the vector has a high dimension, the M used for indexing becomes relatively small with respect to the total number of dimensions, and the approximation cannot be obtained without obtaining the accuracy of distance estimation. The performance of the nearest neighbor search is degraded. Empirically, the number of buckets is the data set size

Has been found to be the best balance.

Therefore, in the sixth improvement, M sets of P sets of orthogonal bases are selected from the orthogonal bases V, and the data space is expressed as a direct product of M P-dimensional subspaces. Here, let V ^m be P orthogonal bases that span the mth subspace. At the time of indexing, a centroid set C ^m is obtained for each subspace using the k-means algorithm to minimize the quantization error. This is equivalent to performing product quantization by dividing the vector projected onto the PM dimensional space spanned by the base into M P dimensional partial vectors. Expressed the i-th partial centroid C ^m and C _i ^m, the hash function H (·) is as follows.

2. Bucket distance Here, an estimated amount of distance that minimizes the estimation error is derived, and the bucket distance is defined. Assuming that there is no correlation between the bases (if the principal component basis is used, non-correlation up to the second order is guaranteed), the error in the estimated distance in each base direction can be minimized independently. Consider minimizing the error of the square distance of.
The minimization problem is defined as follows: The data x follows the probability density function P (x), and if the event that exists in a certain region z is Z, the distribution of the data in the region z is expressed as P (x | Z). At this time,

And put

It becomes. Under such conditions, the query q and

Squared distance

Let e be the estimated quantity of the error, the error minimization problem is as follows.

Since the form of this formula is the same as the definition formula of variance, e that minimizes formula (23) is as follows:

Obtained as the expected value.

In Equation (24), Var [] is a function that returns an unbiased variance with the argument regarded as a sample of the population. Note that the estimated amount obtained here is not the distance from the center of gravity of the region.
From the above results, the bucket distance F _{H in} which the hash value list is H is defined as follows. In the following formula (26), u ^m _p is the p-th principal component basis of the m-th subspace.

3. Number of partial centroids Next, consider the number of centroids g ^m given to each partial space. In general product quantization, g ^m uses a common value in each subspace. However, when PCA is used, if the spread of data in each subspace is biased, the quantization efficiency is poor, and FIG. In order to minimize the quantization error as described above, it is necessary to adjust g ^m according to the spread of each subspace. A specific explanation will be given. FIG. 24 shows the distribution of points in the data space, and u ₁ and u ₂ represent the first and second principal component bases. When the bases of bases are quantized independently as in Normal of FIG. 24A, these often have a correlation, resulting in waste of quantization. Therefore, as in the case of PCA in FIG. 24B, if the projection is performed in the principal component direction, the correlation of the second order or less can be canceled. However, in this case, the dispersion between bases is biased, so that the area to be divided becomes longer in a certain direction. In general, it is ideal that each region is close to a sphere in this region division, and this also causes a large error. Therefore, as in “PCA + bit coordination” in FIG. 24C, the number of divisions may be changed in consideration of the point distribution.
Next, consider the criteria for distributing g ^m . Recalculating the error of equation (23) using the estimated amount e of equation (24) yields the following.

The last approximation is

This is because the dispersion is small with respect to q because it is confined in the region z, and E [q ² ] is very large compared to other elements. Therefore, during spatial indexing, the variance within the divided areas

Note that this corresponds to the quantization error of each subspace. From the above, in BDH, the larger the variance is, the larger g ^m is set so that the maximum value of the quantization error in each subspace is minimized.

<Experiment related to sixth improvement>
Here, the relationship between accuracy and processing time is verified by first examining the relationship between accuracy and processing time without limiting memory usage, and then applying memory reduction to set an upper limit on the amount of memory that can be used. Verified. The problem setting is a K-neighbor search problem that searches for K points close to the query.

1. Experimental conditions The experiment included the SIFT features of the BIGANN dataset ("Datasets for approximate nearest neighbor search," http://corpus-texmex.irisa.fr/.) And 80 million tiny images ("Tiny Images Dataset," http GIST feature quantity of: //groups.csail.mit.edu/vision/TinyImages/.) was used. Both queries are 1000, search accuracy is average recall, and processing time is the average of processing time per query vector. SIFT learning used 1 million points at the beginning of the data set, and GIST learning used 100,000 points. The CPU used for the experiment was Opteron (tm) 6174 (2.2 GHz), and all the experiments were performed with a single core.

2. Relationship between processing time and accuracy This section shows the relationship between the processing time and accuracy of the proposed method and each comparison method according to this embodiment. K = 1. Since no memory reduction is used here, any technique can always achieve 100% accuracy by expanding the search area. The experiment used a database of 1 million points for SIFT (learning data), 10 million points, 100 million points, and 100,000 points for GIST (learning data), 1 million points, and 10 million points. Randamized kd-tree (C. Silpa-Anan and R. Hartley, "Optimised kd-trees for fast image descriptor matching," Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp.1-8, 2008.), hierarchical k-means (D. Nister and H. Stewenius, "Scalable recognition with a vocabulary tree," Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. .2, pp.2161-2168, 2006.), IVFADC and IMI were used as the hash method. The results are shown in FIGS. Table 2 shows the parameters used in the experiment. In the figure, the vertical axis represents the time required for the search process, and the horizontal axis represents the accuracy. “BDH” in the legend of the graph indicates the proposed method according to this embodiment. On the other hand, “Multi” indicates an IMI, “IVF” indicates an IVFADC, “RKD” indicates a randomized kd-tree, and “HKM” indicates a hierarchical k-means comparison method.

As can be seen from the graphs of FIGS. 25 to 30, it was confirmed that the proposed method can always obtain a solution faster than the existing method regardless of the feature amount. In the high accuracy band where the accuracy exceeds 90%, there is not much difference between the methods, but in the low accuracy band of 50% or less, the proposed method is clearly faster than the others. This is particularly noticeable in FIG.
This is due to the overhead of candidate selection recently. Since IVFADC performs G distance calculation for the search of the neighborhood centroid, and IMI requires 2G ^1/2 distance calculation and sorting to generate the partial distance list {f _i ^m }, it is low accuracy. But the processing time is not reduced. However, since the quantization accuracy of spatial indexing is higher than that of BDH, the graph rises gradually with respect to accuracy.

11: Feature point extraction unit, 13: Search range determination unit, 15: Image database, 15h: Hash table, 17: Nearest neighbor determination unit, 19: Voting unit, 21: Voting table, 23: Image selection unit

Claims

When a plurality of points expressed by vector data are input, a hash index for each point is calculated by applying a hash function for calculating the index of the multidimensional hash table, and a plurality of points are calculated by bins of the multidimensional hash table. A database storage unit configured to store each point in a multidimensional hash table by projecting each point to the region corresponding to the hash index in the multidimensional space divided into regions;
When a query is entered, the hash function is applied to the query to determine the position of the query in the multidimensional space, to determine an estimate of the distance between the query and each region in the space, and A search range determination unit that determines at least one region to be searched based on the estimated value;
A distance between each point in the region to be searched and the query is calculated, and a nearest point determination unit that calculates a point closest to the query as a nearest point of the query,
The search range determination unit obtains a representative point of the region with reference to an index of each region, determines the estimated value based on a distance between the query and each representative point, and applies a branch and bound method An approximate nearest neighbor search apparatus, wherein an area to be searched is determined by excluding an area that cannot be the area to be searched.
The approximate nearest neighbor search device according to claim 1, wherein the database storage unit registers each point in an M set of multidimensional hash tables using M sets (M is a natural number of 2 or more) of hash functions. .
3. The approximate nearest neighbor search device according to claim 2, wherein the M sets of multidimensional hash tables have approximately the same number of dimensions.
The multi-dimensional hash table is a combination of hash tables corresponding to each dimension. Each hash table divides the base corresponding to each dimension into bins, and the width of each bin is equal to all bins. For each bin, the position error between the point registered in each bin and the representative point representing each bin is obtained, and the width of each bin is adjusted so that the sum of these errors becomes smaller. The approximate nearest neighbor search device according to any one of claims 1 to 3.
Each hash table calculates a representative point for each cluster by clustering each point into predetermined n clusters, and a variance indicating an average distance from each point belonging to each cluster to the representative point of the cluster is calculated. The approximate nearest neighbor search device according to claim 4, wherein the width of each bin is determined to be smaller than a predetermined threshold value.
The approximate search according to claim 4 or 5, wherein the search range determination unit obtains a probability density function from a distribution of vector data in each base direction, and determines the estimated value using the probability density function for weighting a distance. Side search device.
The approximate search according to any one of claims 1 to 6, wherein the search range determination unit sets a region having a representative point within a range of a predetermined search radius R centered on a query as the region to be searched. Side search device.
The search range determination unit sets a region having a representative point within the range of the search radius R around the query as the region to be searched, and the search radius until the number of points included in the region reaches a predetermined number 7. The approximate nearest neighbor search device according to claim 1, wherein R is gradually increased.
The database storage unit projects each point in a ν-dimensional space,
The multi-dimensional hash table selects M sets of subspaces spanned by P orthogonal bases from the ν orthogonal bases spanning the ν-dimensional space, and the subspaces are determined using the k-means method for each subspace. A subspace that is divided and has a large variance in each region is set to a larger number of divisions,
The search range determination unit is configured so that an estimation error of a square distance between a point existing in each region and a query is minimized in each base direction according to a probability density function obtained from the distribution of vector data in each base direction. 9. The approximate nearest neighbor search device according to claim 1, wherein the estimated value is determined.
In the database storage unit, each point is projected onto a ν-dimensional space whose basis is determined by principal component analysis,
The search range determination unit calculates the distance component in the direction of each base of the distance between the query and each representative point as the estimated value,
In the branch and bound method, each distance component is calculated in the order of bases having large eigenvalues in the principal component analysis, with the constraint that the sum of the distance components is within a defined search radius R in the process of calculating each distance component. The approximate nearest neighbor search device according to any one of claims 1 to 9, wherein it is determined whether a representative point of a region is within a radius R.
Computer
When a plurality of points expressed by vector data are input, a hash index for each point is calculated by applying a hash function for calculating the index of the multidimensional hash table, and a plurality of points are calculated by bins of the multidimensional hash table. Accessing a database storage unit that stores each point in a multidimensional hash table by projecting each point to an area corresponding to the hash index in the multidimensional space divided into areas;
When a query is input, the hash function is applied to the query to determine the position of the query in the space, to determine an estimate of the distance between the query and each region in the space, and the estimated value A search range determining step for determining at least one region to be searched based on:
Calculating the distance between each point in the region to be searched and the query, and calculating the closest point to the query as the nearest point of the query,
The search range determining step obtains a representative point of the area with reference to an index of each area, determines the estimated value based on a distance between the query and each representative point, and applies a branch and bound method An approximate nearest neighbor search method, wherein an area to be searched is determined by excluding an area that cannot be the area to be searched.
When a plurality of points expressed in vector data are input, a hash index for each point is calculated by applying a hash function for calculating an index of the multidimensional hash table, and a plurality of points are calculated by bins of the multidimensional hash table. Processing to access a database storage unit that stores each point in a multidimensional hash table by projecting each point to an area corresponding to a hash index in the multidimensional space divided into areas;
When a query is input, the hash function is applied to the query to determine the position of the query in the space, to determine an estimate of the distance between the query and each region in the space, and the estimated value Processing as a search range determination unit that determines at least one region to be searched based on:
Calculate the distance between each point in the region to be searched and the query, and cause the computer to execute processing as a nearest neighbor determination unit that calculates the closest point to the query as the nearest neighbor point of the query,
The search range determination unit obtains a representative point of the region with reference to an index of each region, determines the estimated value based on a distance between the query and each representative point, and applies a branch and bound method An approximate nearest neighbor search program characterized by determining an area to be searched by excluding an area that cannot be the area to be searched.