CN110168525B

CN110168525B - Fast database search system and method

Info

Publication number: CN110168525B
Application number: CN201780063107.2A
Authority: CN
Inventors: S.库马; D.M.西姆查; A.T.苏雷什; R.郭; X.于; D.霍尔特曼-瑞丝
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2016-10-11
Filing date: 2017-09-20
Publication date: 2022-03-01
Anticipated expiration: 2037-09-20
Also published as: CN110168525A; EP3526689A1; US10719509B2; WO2018071148A1; US20180101570A1

Abstract

Implementations provide an efficient system for computing inner products between high-dimensional vectors. An example method includes clustering database entries represented as vectors, selecting a cluster center for each cluster, and storing the cluster centers as entries in a first-level codebook. The method further comprises, for each database entry, calculating a residual based on the cluster center of the cluster to which the database entry is assigned, and projecting the residual into the subspace. The method also includes determining an entry in the second-layer codebook for each subspace and storing the entry in the first-layer codebook and the corresponding entry in the second-layer codebook for each subspace as a quantized vector of database entries. The entry can be used to classify the term represented by the query vector or to provide the database term in response to the query vector.

Description

Fast database search system and method

Cross reference to related applications

This application is a continuation of and claiming priority from U.S. application No.15/290,198 entitled "HIERARCHICAL QUANTIZATION FOR FAST INNER PRODUCT SEARCH" filed on 11/10/2016, the disclosure of which is incorporated herein by reference in its entirety.

Technical Field

The present invention discloses a computer system and a method for fast database search.

Background

Searching very large, high-dimensional databases is a challenging task that can involve a large amount of processing and memory resources. Many search tasks involve computing the inner product of a query vector with a set of database vectors to find the database instance with the largest (larget) or largest (maximum) inner product (e.g., highest similarity). This is the Maximum Inner Product Search (MIPS) problem. However, computing the inner product via linear scanning requires o (nd) time and memory, which is not affordable when the number (n) and dimension (d) of the database vectors are large.

Disclosure of Invention

Implementations provide fast approximations of inner products that are orders of magnitude faster than brute-force approaches, while maintaining high accuracy and small memory footprint. The method includes hierarchical quantization of database items, the hierarchy including at least two levels. In some implementations, the first layer is Vector Quantization (VQ) and the second layer is Product Quantization (PQ). In some implementations, the system may perform a transform, preferably an orthogonal transform (those skilled in the art will know the length of the hold vectors and the angle between them), on the residual between the quantization layers. In some implementations, there may be several layers of vector quantization prior to product quantization. Other hierarchical combinations may be used.

In one implementation, the system may quantize database vectors (e.g., database entries represented in high-dimensional points that are dense in vector space) via vector quantization. In other words, the database entries may be clustered (into groups of database entries in vector space) and the cluster center determined for each cluster (e.g., centroid, vector closest to centroid, or other average center position) and input to a codebook (VQ codebook or first layer codebook). Each database entry is then mapped to its respective cluster via a VQ codeword representing the corresponding entry in the VQ codebook. The system may then determine the residual of the database entry, which is the difference between the cluster center to which the database entry is mapped (i.e., the entry in the VQ codebook to which the entry is mapped) and the database vector. The residuals have much smaller "diameters" (measure of variance) than the database entries, which results in a significant reduction in quantization error. The system may then transform the residual via a learning (orthogonal) transform. The learned transformation reduces the variance within each subspace of the product quantization, which greatly reduces the quantization error and results in higher final recalls (higher proportion of queries ranked in the top t positions of the database vector to the correct nearest neighbor of the search vector). Finally, the transformed residuals may be submitted to product quantization, where each transformed residual is projected into a subspace, and each subspace (or block) of residuals is assigned an entry in the PQ codebook (using PQ codewords) for that subspace. Here, the subspace may be a subdivision of the vector space of the residual. The projection may simply be a division of the vector space of the residual into blocks, e.g. a division of the elements of the residual vector into sets or blocks of elements, thereby defining a subspace.

The PQ codebook may also be referred to as a second layer codebook. In other words, for each subspace, a cluster is generated and each transformed residual is mapped to one of the clusters for the subspace. Each subspace has its own PQ codeword (i.e., a different cluster assignment). The final quantized representation of the database entry is a concatenation (collocation) of codewords for each layer (e.g., entries of the VQ codebook) and PQ codewords for each subspace. In implementations with additional layers (e.g., additional vector quantization layers), additional VQ codewords would be concatenated after the first VQ codeword. By minimizing quantization errors on the database, the VQ codebook and the PQ codebook can be jointly learned by transformation.

At query time, the system calculates inner products between the query vector and each VQ codeword, and the system selects one or more of the VQ codewords that are most similar to the query based on the results of the inner product calculations. The system may then calculate the residuals of the query, e.g., the difference between the query vector and the VQ codeword that is most similar to the query. If more than one VQ codeword is selected, the system may generate a residual query vector for each selected VQ codeword. The system may then transform the residual query vector (or vectors) with the learned transform. In some implementations, the system may submit the transformed residual to another layer (e.g., additional vector quantization). The system may project the transformed residual query vector into a subspace and compare the query vector projection to quantized database entries mapped to the same VQ codeword, one PQ subspace at a time. For example, the system may select any quantized database entry having the same VQ codeword and, for the first block of the query (PQ subspace), determine a cluster identifier (i.e., PQ codeword) for the first subspace of each selected quantized database entry and use that identifier to identify the cluster center in the PQ codebook in the subspace. The system may then perform a dot product between the block of the query and the PQ codebook entry for the quantized data item. The result of the dot product is a subspace similarity score, and the similarity between the query and the database entry is the sum of the dot products of all the subspaces. The system may repeat this operation for any database entry mapped to the VQ codeword selected for the query. Thus, the system performs inner products for only a portion of the complete database vector, thereby improving query response time.

The VQ and PQ codebooks and transforms may be jointly learned using random gradient descent. At each iteration, the gradient of the quantization error is calculated over a small batch of data in order to assign a fixed data point to the codeword. After performing the step of dropping, the codeword assignment for the data point is recalculated. Can be passed through from

The transform is sampled and initialized via the Cayley characterization parameterization of the orthogonal matrix.

According to one general aspect, a computer system, such as for querying a database, includes at least one processor and a memory storing a database of quantized terms. Each quantized term comprises a first entry in a first codebook and a plurality of second entries in a second codebook, wherein each of the plurality of second entries represents a respective subspace of k subspaces. The memory also includes instructions that, when executed by the at least one processor, cause the system to perform operations. The operations may include determining a cluster center from the first codebook that is most similar to the query vector, calculating a residual vector from the cluster center and the query vector (e.g., from a difference therebetween), transforming the residual vector using the learned transformation, and projecting the transformed residual vector into k subspaces. For each quantized term having a first index corresponding to the cluster center determined by the query vector, the operations may further comprise: for each subspace, an inner product between the quantized term and the transformed residual vector is computed, and a similarity score between the quantized term and the query is computed by summing the inner products. The operations may also include providing the item with the highest similarity score in response to the query.

In some implementations, the number k of subspaces is much smaller than the dimension d of the data, e.g., 10 or more times smaller; this helps to reduce search complexity. For example, k may be 16. The operations may further include: for each subspace, calculating an inner product between the transformed residual vector and each cluster center in the second codebook; and storing the calculated inner product in a memory device of a register of a codebook lookup table. This may promote instruction-level parallelism, as described further below. Preferably, each subspace has a corresponding register. A database may have millions of quantized terms.

In some implementations, the transform may be learned jointly with two codebooks, which is efficient and may provide good results. However, prior to initializing the second codebook and performing joint learning, the first codebook may undergo initialization and (an integer) x learning periods, which may facilitate efficient learning.

According to one aspect, a method, for example, of indexing items in a database for subsequent approximate searching, includes clustering a data store of database items represented as high-dimensional vectors, and selecting a cluster center for each cluster and storing the cluster center as an entry in a first-level codebook. The method may further comprise, for each database entry, calculating a residual based on the cluster center of the cluster to which the database entry is assigned, projecting the residual into a subspace, determining an entry in the second-layer codebook for each subspace, and storing the entry in the first-layer codebook and the corresponding entry in the second-layer codebook for each subspace as vectors for quantization of the database entries.

The quantized vector may be used to determine a response database entry using a maximum inner product search. The method may further include transforming the residual using the learned rotation before projecting the residual into the subspace; the learned rotation may be jointly trained with parameters of the first and second layer codebooks.

The method may further comprise: determining t clusters most similar to the query vector from the first codebook based on inner product operation; calculating a residual error of the query vector for each of the t clusters based on a cluster center of the cluster; projecting each residual of the query into a subspace; for each database item assigned to one of the t clusters, determining a maximum inner product score with the query vector, the maximum inner product score being based on a sum over a subspace of inner products calculated between residuals of the database items and residuals of queries assigned to the clusters of database items; and identifying a database item that is most similar to the query vector from among the database items assigned to one of the t clusters based on the maximum inner product score. The database entries that are most similar to the query vector may be used to classify the entries represented by the query vector and/or to provide entries in response to the query vector.

According to one aspect, a method, for example, of searching a database, may include dividing vectors in the database into m partitions using vector quantization, such that each vector has an assigned vector quantization codeword and calculating, for each vector, a respective residual, which is a difference between the vector and a cluster center corresponding to the vector quantization codeword. The method may also include applying product quantization to each residual, generating a product quantization codeword for each of the k subspaces for each residual, storing the assigned vector quantization codeword and the k product quantization codewords for the residuals of the vectors for each vector, and using the vector quantization codewords to select a portion of the database vector that is most similar to the query vector. The method may further include, for each database vector in the portion, using a product-quantized codeword to determine a database vector from the portion that is most similar to the query vector.

In another aspect, a computer program product embodied on a computer-readable storage device includes instructions that, when executed by at least one processor formed in a substrate, cause the computing device to perform any of the disclosed methods, operations, or processes disclosed herein.

One or more implementations of the subject matter described herein can be implemented to realize one or more of the following advantages. As one example, an implementation provides a fast maximum inner product search over a large, dense, high-dimensional dataset. Such data sets are often associated with recommendation or classification systems, such as locating images, videos, or products similar to the query image, video, or product. Another example of such a problem is a classification model that uses inner products to compute probabilities of nearby words given a target word. The search avoids a full scan of the data set while providing high performance while minimizing recalls in modern CPU architectures. A hierarchical combination including vector quantization and product quantization is implemented that greatly reduces the error of approximating inner products for large, dense, high-dimensional datasets with low latency (e.g., faster processing time). As another example, the codebook and transforms may be jointly trained end-to-end, which results in lower approximation errors in representing the data set, thereby improving recalls. Some implementations provide an in-register lookup table to compute the inner product between the subspace of queries and the quantized database entries, which takes advantage of the instruction-level parallelism capabilities of modern processors and provides significant improvements to in-memory lookups. In some implementations, the final complexity of the search is

Where k is the number of subspaces, m is the number of vector quantizers (e.g., the number of entries in the VQ codebook), t is the number of VQ codewords selected for the query vector, and n is the number of database entries. Thus, it is possible to provideWhen k is much smaller than the data dimension d and t is much smaller than m, the complexity of the search is much faster than a brute force search. Further, the memory footprint of the disclosed implementation would be

This is much less than a brute force memory footprint (i.e.,

).)。

the details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.

Drawings

FIG. 1 illustrates an example system in accordance with the disclosed subject matter.

FIG. 2 illustrates a block diagram of hierarchical quantization of database vectors representing search terms in accordance with the disclosed subject matter.

FIG. 3 illustrates a flow diagram of an example process for performing hierarchical quantization of database items according to an implementation.

FIG. 4 illustrates a flow diagram of an example query process using hierarchical quantization and maximum inner product search in accordance with an implementation.

FIG. 5 illustrates a flow diagram of an example process for jointly learning orthogonal transforms and codebooks for hierarchical quantization in accordance with the disclosed subject matter.

Fig. 6A through 6D are diagrams illustrating benefits of various implementations.

FIG. 7 illustrates an example of a computer device that can be used to implement the described technology.

FIG. 8 illustrates an example of a distributed computer device that can be used to implement the described techniques.

Like reference symbols in the various drawings indicate like elements.

Detailed Description

Fig. 1 is a block diagram of a scalable inference system in accordance with an example implementation. The system 100 may be used to hierarchically quantify a database of terms and compute inner products with query vectors to find relevant database terms for use in applications such as recommendation systems, classification in machine learning algorithms, and other systems that use nearest neighbor computation. The system 100 jointly learns codebooks for hierarchical levels and reduces the processing time required to perform inner product searches while still maintaining high quality results. The depiction of system 100 in FIG. 1 is depicted as a server-based search system. However, other configurations and applications may be used. For example, some operations may be performed on a client device. Further, although system 100 is described as a search system, the methods and techniques of the disclosed implementations may be used for any task that uses, for example, the largest inner product, such as classification performed in the last layer of a neural network having a large number (e.g., millions) of output nodes. Thus, the implementation is not limited to a search system, but may be used in any system that addresses the MIPS problem.

Scalable inference system 100 can be one or more computing devices in the form of a number of different devices, such as a standard server, a group of such servers, or a rack server system, such as server 110. Further, system 100 may be implemented in a personal computer (e.g., a laptop computer). Server 110 may be an example of computer device 700 (as depicted in fig. 7) or computer device 800 (as depicted in fig. 8).

Although not shown in fig. 1, server 110 may include one or more processors formed in a substrate configured to execute one or more machine-executable instructions or blocks of software, firmware, or a combination thereof. The processor may be semiconductor-based-that is, the processor may comprise semiconductor material that may execute digital logic. The processor may also include registers capable of performing data level parallelism, such as Single Instruction Multiple Data (SIMD) registers. Server 110 may also include an operating system and one or more computer memories, such as a main memory, configured to store one or more pieces of data, either temporarily, permanently, semi-permanently, or a combination thereof. The memory may include any type of storage device that stores information in a format that may be read and/or executed by the one or more processors. The memory may include volatile memory, non-volatile memory, or a combination thereof, and store modules that, when executed by the one or more processors, perform certain operations. In some implementations, the modules may be stored in an external storage device and loaded into memory of server 110.

The modules may include a quantization engine 126 and a query engine 120. The quantization engine 126 may hierarchically quantize the database of database entries 132 and, in this process, generate a codebook, e.g., VQ codebook 134, PQ codebook 136, for each level in the hierarchy. The result of hierarchical quantization is a quantized database entry 130. In the example of fig. 1, the VQ codebook 134 represents a first layer in a hierarchy, and the PQ codebook 136 represents a second layer. In implementations with more than two layers, the quantization engine 126 may generate additional codebooks, e.g., one for each layer. Thus, for example, if system 100 uses two vector quantization layers followed by one product quantization layer, system 100 may generate a second VQ codebook for additional vector quantization layers. The resulting quantized database entries 130 and codebooks (e.g., VQ codebook 134 and PQ codebook 136) have a smaller memory footprint than the database of database entries 132. The database entry 132 may be a database of vectors. A vector can be thought of as an array of floating point numbers with d dimensions, or in other words, an array of d positions. Queries such as query 182 may also be expressed as vectors of dimension d. When d is large and the number of database entries is large (e.g., tens of thousands or even millions), the computation of the inner product between the query vector and the database vector is slow and processor intensive. Quantization enables the inner product to be approximated, but introduces quantization errors. The larger the error, the less accurate the result.

To achieve faster computation times while maintaining a high level of accuracy, the quantization engine 126 may quantize the database entries 132. Quantization engine 126 may first perform vector quantization on database entries 132 to assign each database entry an entry in VQ codebook 134. Thus, each database entry 132 has a corresponding VQ codeword that identifies an entry in the VQ codebook 134. The quantization engine 126 may then determine a residual vector for each database entry, the residual vector being the difference between the database entry vector and the cluster center (e.g., the vector corresponding to the entry in the VQ codebook 134 to which the database entry maps). The quantization engine 126 may then transform the residual using an orthogonal transform, for example using the learned rotation. The orthogonal transformation (learned rotation) reduces the variance in each subspace of the subsequent product quantization, which results in significantly lower quantization error and higher recall. Quantization engine 126 may then further quantize the rotated residual vectors via product quantization, or in other words, project the transformed residuals into subspaces and map each subspace to an entry in the PQ codebook generated for the subspace. Thus, each subspace of the transformed residual has a corresponding PQ codeword, the PQ codeword for a subspace identifying an entry in the PQ codebook for that subspace. The system may store the VQ codewords and the corresponding PQ codewords for each subspace as quantized database entries in quantized database entries 130.

Fig. 2 illustrates an example database entry 132 and a quantized database entry 130. In the example of fig. 2, database entry 132 includes n entries or n different database entries. Each of the database entries 132 (e.g., database vector (DBV)1, DBV 2, etc.) has d dimensions, or in other words, d elements. Quantization engine 126 may first quantize database entries 132 via vector quantization using VQ codebook 134. In vector quantization, each database vector is assigned a VQ codeword from VQ codebook 134. The VQ codeword represents a specific entry in the VQ codebook 134, and each entry in the VQ codebook 134 is a cluster center, or in other words, a data point that best represents a cluster. Thus, in effect, vector quantization associates each database entry with a cluster. The VQ codebook 134 is learned and generated by the quantization engine 126, and the assignment of database entries may occur concurrently with the generation of the VQ codebook 134. The VQ codebook 134 may have M entries, which may be determined when learning the VQ codebook 134 or may be provided as parameters. Thus, database entry 132 may be divided into M partitions. The quantization engine 126 stores the VQ codeword for each database entry as a first portion of a quantized database entry in the quantized database entries 130. This represents the first layer in hierarchical quantization.

The quantization engine may use the VQ codeword and VQ codebook 134 to generate a residual for each database entry. The residual may be defined as the difference between the database vector and the cluster center associated with the database vector. The difference may be calculated by subtracting the database vector from the cluster center vector (or vice versa). Thus, the quantization engine 126 may use vector quantization to obtain a residual data set (e.g., database entry residual 232) having a much smaller diameter than the original vector (e.g., database entry 132). In other words, database entry residuals 232 still have d dimensions, but the variance within the values of the floating point numbers in the vector is reduced. The quantization engine 126 may only temporarily store these residuals 232 as they are further quantized in another layer of the hierarchy. When further quantization is used with integration quantization (second layer), the smaller diameter results in a significant reduction of quantization error.

In some implementations, the quantization engine 126 can use the learned rotations to rotate the database entry residuals 232. The learned rotations can be used to achieve the best distribution of information to the various subspaces generated by the integration quantization. The learned rotation can be optimized to reduce errors using random gradient descent. The rotation may be learned jointly with the VQ codebook 134 and the PQ codebook 136 to minimize quantization errors. Thus, the learned rotation provides a smaller quantization error than the random rotation. However, in some implementations, no rotation or random rotation may be performed. If database entry residual vector 232 undergoes a learned rotation, it is referred to as a transformed or transformed residual. In some implementations, the transformed residual may undergo another round of vector quantization, adding layers to the hierarchy. After each vector quantization, the system may again compute the residual, which may undergo product quantization.

The quantization engine 126 then projects the database entry residuals 232, which may be transformed, into K subspaces. A subspace may be defined as a block of elements from each residual database entry vector, which occur at the same vector position. In some implementations, d may be evenly divisible by K such that each block includes the same number of elements. Such an implementation is shown in fig. 2, where each subspace is a block of six elements. In some implementations, even though d may be evenly divisible by K, a direct division may result in subspaces in which the number of elements in each subspace is unequal. In some implementations, the division may be based on a random or learned projection of the vector. In some implementations, variable size subspaces can be generated by assigning an additional dimension to each first mod (d, K) subspace. In such an implementation, the number of elements in each block may not be equal. In the example shown in fig. 2, the projection or division produces K subspaces or blocks, each subspace having n rows of six elements.

Once the transformed residual vectors are projected into the subspace, the quantization engine 126 may generate a PQ codebook 136. The PQ codebook 136 may include one codebook per subspace. Thus, using the example of FIG. 2, the PQ codebook 136 includes K PQ codebooks. In some implementations, the PQ codebook for each subspace may be collectively referred to as the PQ codebook for database entry 132. Each PQ codebook may include an entry for each of the J clusters. The number of clusters, J, may be determined when generating the PQ codebook 136, or the number, J, may be passed as a parameter to the quantization engine 126. For example, the parameters may indicate that quantization engine 126 should generate 16 clusters for each codebook, or 256 clusters for each codebook. In some implementations, the number of clusters (i.e., the value of J) may depend on the size of the register (e.g., SIMD register). In other words, to improve computation time, the number of clusters may be limited to the number of parallel lookups that registers in the system may perform. When generating clusters, each cluster will have a cluster center. The cluster center is the entry in the codebook for that cluster. For example, in fig. 2, subspace K (i.e., block (K)) has six elements (transformed residuals) from each of the n vectors. The quantization engine 126 may cluster each of the n vectors of six elements into one of J clusters. Of course, the cluster center need not match the database vector subspace, but may represent six elements as the cluster center. For ease of discussion, the codebook for the k-th subspace may be denoted as S^(k)And (4) showing. Since each codebook has J entries, the jth entry in the codebook may be represented as

In some implementations, the system can jointly utilize a transformation (e.g., rotation) to learn the cluster centers of the VQ codebook and the PQ codebook. The quantization engine 126 may use a conventional clustering algorithm based on euclidean distance or k-means and use a stochastic gradient descent, where at each iteration, a gradient of quantization error is calculated over a small batch of data for a fixed assignment of data points to codewords. After performing the step of dropping, the codeword assignments for the database entries will be recalculated. In this manner, the assignment of codewords to database entries may be performed concurrently with the learning of the codebook. To optimize the orthogonal transform of the residual while maintaining orthogonality, the quantization engine 126 may parameterize the transform by the Cayley characterization of the orthogonal matrix, which is represented by R ═ 1-A (I + A)^-1Where A is a skewed symmetric matrix, i.e. A ═ A^TAnd I is an identity matrix of size dxd. The parameters of the skewed symmetric matrix A are d (d-1)/2, so computing the transformation matrix R may involve the inversion of the dxd matrix at each iteration. If d is high-dimensional (e.g., greater than thousands), the system may limit the number of parameters for A. This trades off capacity against computational cost.

In some implementations, the quantization engine 126 may initialize the VQ codebook 134 using random samples from the database of database entries 132 and may initialize the PQ codebook using the residuals of the sets of independent samples (e.g., after vector quantization). To allow the vector quantization layer an opportunity to partition space, the quantization engine 126 may optimize the vector quantization error for only a few epochs before initializing the PQ codebook 136 and performing full joint training. The quantization engine 126 may be implemented by a slave

Samples are taken to initialize the parameters of the skew symmetry matrix a.

In some implementations, the system may assign the database vector to an entry in the codebook via one hot assignment vector in the M-dimension (for VQ codebooks) or the J-dimension (for PQ codebooks). For vector x (e.g., α) in addition to representing the location of the cluster assignment_x) Or one for the kth subspace of vector x (e.g.,

) May be all zero. In some implementations, an assignment vector for x (e.g., α)_x) May be a quantized database entry. In other words, in some implementations, the assignment vector may be a codeword. Thus, assigning the dot product of the vector and the codebook may provide a cluster center (e.g., quantization) for vector x (or the kth subspace of vector x). In some implementations, the information in the quantized database entry may be a pointer to a codebook entry. The quantization engine 126 may generate quantized database entries by concatenating codewords from different levels. In the example of fig. 2, there is first a VQ codeword and then a PQ codeword for each of the K subspaces. If additional layers are used by system 100, a codeword (or multiple codewords) from each additional layer may be concatenated in order of quantization. For example, if the system performs another layer of vector quantization before product quantization, the codeword for the second layer of vector quantization will follow the codeword of the first vector quantization layer before the codeword for product quantization. The quantization engine 126 may store quantized database entries 130, VQ codebooks 134, PQ codebooks 136, and learned transformation matrices R for use by the query engine 120.

Once the quantization engine 126 has generated the codebook, learned the rotation, and generated the quantized vector (e.g., quantized database entries 130), the system 100 is ready to respond to queries using the VQ codebook 134, the PQ codebook 136, the learned transform matrix R, and the quantized database entries 130. Thus, the module may include a query engine 120. Query engine 120 may be configured to use the codebook and quantized database entries 130 to identify database entries 132 that are responsive to query 182 and to provide results 184 in response to query 182. The query engine 120 may include a module or engine that creates query vectors from the queries 182 using conventional techniques. The query engine 120 may determine which clusters from the VQ codebook 134 the query is closest to. In some implementations, this may include calculating an inner product between the query vector and each cluster center and selecting the cluster center with the largest inner product. In some implementations, the query engine 120 may select more than one cluster center as "closest," e.g., select the top t clusters with the highest inner products. Thus, the query engine 120 may determine a VQ codeword (or codewords) for the query 182. The query engine 120 may use the VQ codeword (or codewords) to reduce the computation time of the inner product search by limiting the comparison of the query vectors to only those quantized database entries that share the VQ codeword. Thus, rather than comparing the query vector to each database entry, only those database entries that share a VQ codeword are considered. Of course, if the query engine 120 selects multiple VQ codewords, then the quantized database entries corresponding to the additional VQ codewords will also be included in the comparison to the query 182.

The system then calculates the residual of the query 182, for example, by subtracting the query vector from the cluster center corresponding to the VQ codeword. The query engine 120 may also transform the residual query vector if the system 100 has transformed quantized database entries. The system may project the residual query vector into a subspace. The subspace to which the query residual is projected matches the subspace to which the database entry 132 is projected. Thus, the residual query vector may have K subspaces.

In some implementations, the query engine 120 can generate the lookup table 138. The lookup table 138 may store the results of the inner product of each cluster center in each subspace and the corresponding subspace of the query vector. Thus, the system may pre-compute the inner product between each data point in each PQ codebook and the corresponding residual query vector subspace and store the result in the look-up table 138. This may result in a table or database where the results of the inner product may be accessed by knowing the PQ codeword for any particular subspace (e.g., which cluster in which subspace). In some implementations, the lookup table 138 may be stored in a register, for example, a SIMD register. In some implementations, each subspace may have a lookup table 138 stored in a register, giving K lookup tables. The query engine 120 may use the lookup table 138 to greatly speed up the search even in a memory table. However, using a table in a register may mean selecting J, e.g. 16 or 32, based on the capacity of the register.

The query engine 120 may then determine the inner product of each quantized database entry associated with the selected VQ codeword and the query. To accomplish this, query engine 120 may determine a PQ codebook assignment (i.e., a PQ codeword) in each subspace for each examined quantized database vector and determine the inner product of the data point represented by the PQ codebook assignment in that subspace and the corresponding subspace of the residual query vector. In implementations using lookup table 138, the system may look up the PQ codebook entry and subspace in the table. Thus, rather than performing an inner product operation between the query and the PQ codebook entry for the database entry, the query engine 120 may use a lookup into the lookup table 138. As indicated above, in some implementations, the lookup table 138 may be stored in a register and the system may store the corresponding PQ codeword for the database entry in the register. In this way, the system can perform 16 parallel (or 32 parallel) lookups in one CPU cycle. In other implementations, the system may perform the inner product. The query engine 120 may approximate the inner product between a database item and a query as the sum of the results of the inner product in each subspace between the PQ portion (second portion, or second layer portion) of the quantized database item and the query. In other words, using the example of fig. 2, the approximate inner product between the query and DBV 1 is the sum of the inner products between the quantized block (1) to the quantized block (K) of quantized search term 1 and the residual query vector. This can be expressed as

Wherein S^(k)Is a look-up table for the k-th subspace, and

is the PQ codeword for the kth subspace of database entry x.

Once the query engine 120 has determined the database term with the highest inner product using the above approximation, the search has determined the database term that is responsive to the query. In some implementations, the query engine 120 may include a ranking engine that orders the results 184 by similarity score (i.e., highest inner product). The query engine 120 may provide the results 184 for display by a client device, such as the client 170. Of course, the response database entry may be used for other purposes, such as sorting.

Scalable inference system 100 can communicate with client(s) 170 over network 160. The client 170 may allow a user to provide a query 182 to the query engine 120 and receive results 184, the results 184 including database items found in response to the query based on an approximate inner product with the search query using the quantified database items. Network 160 may be, for example, the internet or network 160 may be a wired or wireless Local Area Network (LAN), Wide Area Network (WAN), etc., implemented using, for example, gateway devices, bridges, switches, etc. Via network 160, scalable inference system 100 can communicate with clients 170 or send data from clients 170. In some implementations, the client 170 may include an application, such as a search application 175 that performs some or all of the functionality of the query engine 120. For example, quantized database items 130 do not take up much memory as compared to database items 132, and may be of a size suitable for storage on a client (such as in data store 180). Data store 180 may include any type of non-volatile memory, such as flash memory, SD, RAM, disk, and the like. Server 110 may transmit quantized database entries 130, VQ codebook 134, and PQ codebook 136 to client 170, and search application 175 may perform the actions described above with respect to query engine 120. In some implementations, the client 170 may be another server or system. The client 170 may be another example of the computing device 800 or computing device 700.

In some implementations, scalable inference system 100 can communicate with or include other computing devices that provide updates to database items 132. Scalable inference system 100 represents one example configuration, and other configurations are possible. Further, the components of system 100 may be combined or distributed in a different manner than shown. For example, in some implementations, one or more of the query engine 120 and the quantization engine 126 may be combined into a single module or engine. Further, the components or features of the query engine 120, the quantification engine 126, may be distributed between two or more modules or engines, or even distributed across multiple computing devices. For example, database entries 132 and/or quantized database entries 130 may be distributed across multiple computing devices.

Fig. 3 illustrates a flow diagram of an example process 300 for preparing a database of items using hierarchical quantization for fast Maximum Inner Product Search (MIPS), according to an implementation. Process 300 may be performed by a scalable inference system (such as system 100 of FIG. 1). Process 300 is an example of hierarchical quantization of a single database entry performed by quantization engine 126 of fig. 1. It should be understood that the system may also perform process 300 on all database vectors simultaneously, and process 300 may be performed simultaneously with generating or learning a codebook for quantization, as described in more detail with respect to fig. 5. Process 300 may be performed periodically by the system such that the generated quantized database entries and codebooks are kept up to date. For example, the system may perform process 300 once a day, once a week, once an hour, etc., depending on how often the database of items is updated with new items.

Process 300 may begin with the scalable inference system assigning an entry in a Vector Quantization (VQ) codebook for each database entry (i.e., each database vector) (305). In other words, the system may assign a VQ codeword to each database entry. The VQ codeword points to an entry in the VQ codebook that contains (or points to) the cluster center. Therefore, the VQ codeword may also be referred to as a cluster identifier. A cluster center is a vector that has the same dimensions as the database entry vector and is most representative of the database entries in the cluster. The VQ codebook may be generated via a learning process that may also perform mapping of database entries to VQ codebook entries. Assigning the respective VQ codeword to a database entry is the first stage in hierarchical quantization, and the VQ codeword for each database entry is the first stage portion of the quantized database entry. Formally, vector quantization can be expressed as

Which returns a vector quantized codeword for x,

is a vector quantization codebook (e.g., codebook 134) with m entries, and U_iIs the ith entry.

The system may calculate a residual vector for each database entry (310). The residual vector is the difference between the database entry vector and the cluster center corresponding to the VQ codeword for that database entry. Real world data is often clusterable, with the cluster diameter being much lower than the diameter of the data set as a whole. Thus, the system can use vector quantization to obtain a residual data set with a much smaller diameter, resulting in a significant reduction of quantization error when quantizing with product quantization. Thus, hierarchical quantization utilizes vector quantization that is well suited to approximate low-dimensional components, and product quantization that is well suited to capture high-dimensional data from residuals.

In some implementations, the system may perform a learned rotation or transformation on the residual vector (315). In some implementations, the rotation is learned with the codebook. As demonstrated in fig. 6C, the learned spin provides better recall. The learned rotation may be a matrix of residuals applied to vector quantization

In some implementations, the transformations may be random, but fixed, permutations. In other words, the permutation is randomly generated, but once the permutation is generated, it is fixed and can be applied to all database vectors and all query vectors. However, the random transformation does not produce the same recalls as the learned rotation. In some implementations, step 315 is optional and the residual vector remains unchanged.

The system may project each residual vector into a subspace (320). In some implementations, each subspace may have an equal number of elements from a vector. In some implementations, subspaces may not have an equal number of elements. The subspaces may also be referred to as blocks. The system may assign an entry in a Product Quantization (PQ) codebook for a subspace (325) for that subspace. In some implementations, the assignment may occur as part of generating a PQ codebook by clustering. Thus, the PQ codebook for a particular subspace includes an entry for each cluster with the cluster center as an entry. The cluster center has the same number of elements as the part of the residual vector in the subspace.

In other words, each subspace has a PQ codebook, and each codebook has J entries. The value of J may depend on parameters provided to the process of generating the PQ codebook, or the process may determine the value based on data. In some implementations, the value of J may depend on the capacity of a register (e.g., SIMD register or other register). For example, the value of J may be 16, such that a single register may hold the entire PQ codebook (e.g., S) for subspace k^(k)). Each database vector subspace may be mapped to or assigned to one of J entries in the codebook for that subspace. The specific entry j in the PQ codebook for subspace k is denoted as

In some implementations, the assignment may occur as part of generating a codebook. For example, when clustering is used, each residue may be assigned to one of the clusters to the subspace because the clusters are generated from the residual vectors for the subspaces.

The system may generate quantized vectors for each database vector by concatenating the VQ codeword with each PQ codeword for each subspace (330). The VQ codeword may be a codeword for a first level of hierarchy and the PQ codeword (one for each subspace) may be for a second level of hierarchy. Thus, in hierarchical quantization, the database vector x may be approximated by

r_x＝R(x-φVQ(x))，

Wherein

The VQ codeword for x is returned,

is a vector quantization codebook, matrix, with m codewords

Is the learned rotation applied to the residual of the vector quantization, and the product quantizer is given by:

by combining the rotated residuals r_xConcatenation of codewords obtained by division into K subspaces 1 to K, and by vector quantizer

The subspaces are quantized independently to minimize quantization error:

wherein

Is the PQ codebook for the k-th subspace (with j entries). The final quantized representation of x is passed through a cascade of phi_VQ(x) Is indexed and

k indices. This representation, i.e. the quantized database entry, has log₂m+Klog₂A total bit rate of j, where m is the number of entries in the VQ codebook, j is the number of entries in each PQ codebook, and K is the number of subspaces. The system may store each quantized database entry in a data store, number, along with the VQ codebook, PQ codebook, and learned rotation RA database or other memory. Process 300 then ends and the resulting structure can be used to approximate the maximum inner product between the query term and the database term in an efficient and accurate manner.

Although fig. 3 illustrates a hierarchy having two layers, a system may include more than two layers in the hierarchy. For example, the system may perform one or more additional vector quantization layers or one or more additional product quantization layers as needed and/or supported by the data. Each additional layer will add an additional codeword to the vectorized database entry. Further, each layer receives the residual calculated in the previous layer.

FIG. 4 illustrates a flow diagram of an example process 400 for identifying responsive database items using hierarchical quantization in accordance with the disclosed subject matter. Process 400 may be performed by a scalable inference system, such as system 100 of fig. 1. Process 400 may be performed each time a query is received to determine the database entry having the largest inner product with the query vector. Those terms with the highest inner products respond fastest to the query, or in other words, most like the query. The query vector may also represent the items to be classified, for example, as part of the last layer of a neural network having a large number of output nodes.

The process 400 may begin with the system determining the inner product of the query vector and each entry in the VQ codebook (405). This provides the VQ codeword that is most similar to the query. The system may select the most similar VQ codebook entry (based on the results of the inner product computation) and compute the residual of the query (410), e.g., as the difference between the query vector and the selected VQ codebook entry. In some implementations, the system may select more than one "most similar" VQ codebook entry, e.g., select t most similar entries (e.g., t-2 or t-5, etc., and t < m). In such an implementation, the system may generate a residual for each selected VQ codebook entry, such that each selected vector quantization entry has a corresponding residual query vector.

The system may transform the residual query vector (415). In implementations using a transformation, the transformation or rotation is the same as that used in step 315 of FIG. 3. The scalable inference system can also project the residual query vector (or vectors) into a subspace (420). The projection of the query vector is done in the same way as the projection of the database entry vector as part of step 320 in fig. 3. The system may then optionally generate a look-up table (425). The look-up table may comprise one entry for each entry of each PQ codebook. To generate the table, the system may perform, for each subspace (i.e., each PQ codebook), an inner product between the elements of the residual query vector in the subspace and the elements of each PQ codebook entry in the subspace. Thus, if the codebook has J entries, the look-up table will have J entries for that subspace. As part of step 435 below, the system may use a lookup table to speed up the computation of the inner product with the quantized database entry, but the use of a lookup table is optional. In some implementations, the lookup table is stored in a register, and the value of J is constrained by the characteristics of the register, e.g., is 16 or 32.

The system may then calculate a similarity score for each quantized database entry sharing the VQ codeword (VQ codebook entry) selected in step 410. Thus, the system may select quantized database entries (430) that share a VQ codeword, and compute, for each subspace, the inner product (435) between the residual query element in that subspace and the quantized database entry element for that subspace, which is represented by the PQ codebook entry assignment in the subspace (e.g., quantized block 1 or quantized block (K) of fig. 2). In some implementations, the system can determine a PQ codebook entry from a subspace of quantized database entries, determine a data point (e.g., cluster center) for the PQ codebook entry, and compute the inner product between the residual query subspace and the data point. In implementations using a look-up table, the system may determine the PQ codebook entry for the quantized database entry subspace and look up the inner product result for that PQ codebook entry in the look-up table. The system may calculate a similarity score for the database items by summing the inner products of each subspace as calculated in step 435 (440). The similarity score is the approximate inner product between the quantified database term and the query. If the query and database entry vectors are projected into K subspaces, the system can sum K values, each value representing an inner product calculation for the subspace. The system may repeat steps 430 through 440(445, yes) until a similarity score has been calculated for each database entry mapped to the same VQ codeword as the residual query vector (445, no).

In implementations using codebook look-up tables, steps 430 through 440 may also be expressed as

Where K is the kth subspace of the K subspaces, J is the jth entry of the J entries in the PQ codebook,

is a rotated residual query vector, wherein

As the k-th sub-vector and

is a look-up table entry that is the inner product between the residual query vector and the PQ codebook entry (e.g.,

wherein

Is the jth entry of the PQ codebook for the kth subspace). When the codebook look-up table is stored in a register, the system can take advantage of the instruction-level parallelization capabilities of the Central Processing Unit (CPU). For example, the system may use one register (e.g., a SIMD register) to hold the look-up table v^(k)And another register is used to hold the index of the PQ codeword for a given quantized database entry. In such an implementation, the system may use register instructions to perform several parallel lookups, e.g., 16, in one CPU cycleParallel lookup (PSHUFB, SSSE3) or 32 parallel lookups (VPSHUFB, AVX 2). This represents a significant improvement over the in-memory codebook look-up table, which has a throughput of only one look-up per CPU cycle.

If the system selects multiple VQ codewords as "most similar" to the query vector, the system may repeat steps 425 through 445 for the other VQ codewords. The system may then return the database item with the highest similarity score, e.g., an identifier that identifies the database item or the database vector itself (450). As shown in fig. 4, the VQ codewords are used to reduce the number of database entries for which query vectors are compared, e.g., via inner product operations. This reduces the processing time, thereby improving the responsiveness of the system, while the product quantization of the residuals provides high accuracy. The complexity of the search performed by the system may be expressed as

Where k is the number of subspaces, m is the number of vector quantizers (e.g., the number of entries in the VQ codebook), t is the number of VQ codewords selected for the query vector, and n is the number of database entries.

In some implementations, the system may use an accurate dot product calculation to re-score the highest scoring database entry. In other words, the system may calculate an exact dot product for the term with the highest similarity score and use that exact dot product to determine the database term to present to the query requester. For example, the system may use N terms as search results for the query requestor and calculate the exact dot product between the query vector and the database vector of the top 10 × N database terms determined using the quantization vector (e.g., the 10 × N database terms with the highest similarity score). The system may then use the top N database entries with the highest actual dot products. This improves the accuracy of the search results, but requires less time to determine than calculating the dot product of all database items. The system may provide search results that include information about those terms for display to the user providing the query. The process 400 then ends with the item identified as being the fastest responding.

FIG. 5 illustrates a flow diagram of an example process 500 for jointly learning orthogonal transforms and codebooks for hierarchical quantization in accordance with the disclosed subject matter. Process 500 may be performed by a scalable inference system, such as system 100 of fig. 1. The process 500 trains and optimizes a task dependent objective function to predict clusters in each hierarchical layer and optionally learned rotations starting from random samples from the database. For joint training parameters (codebook and orthogonal transformation), the system uses a random gradient descent, where at each iteration, the gradient of the quantization error is computed over a small batch of data for a fixed assignment of data points (database entries) to codewords. After performing the step of dropping, the codeword assignment for the data point is recalculated. In other words, process 500 uses an iterative process to alternate between solving the codebook for each layer and assigning database entries to codebook entries. Process 500 may be performed as part of or concurrently with process 300 of FIG. 3.

Process 500 may begin with the scalable inference system assigning a random database vector for each VQ codebook entry (505). The system may optimize vector quantization error over several epochs using a random gradient descent over a small batch of data (510). This allows the opportunity for vector quantization to partition the space before initializing the PQ codebook entries and performing full joint training. The system may initialize the PQ codebook entry by: a residual is generated from the vector quantization to the set of independent samples, the residual is projected into a subspace, and entries in the PQ codebook values are assigned from the respective subspaces of the residual (515). The system can also be used by using the information from

The sample filling the skew symmetric matrix to initialize the rotation matrix.

The system may then optimize vector quantization error, transform error, and product quantization error using random gradient descent over a small batch (e.g., 2000 items) of data (525). This may include finding a set of violated constraints (but not necessarily all violated constraints), adjusting the codebook assignment for the detected violation using gradient descent such that the violation no longer appears as having an approximation greater than the database entry having the largest dot product. A violated constraint occurs when the approximate dot product generated using the hierarchical layers (i.e., using codebooks and transforms) indicates that the value between the first quantized database entry and the query is greater than the value between the second quantized database entry and the query, but the second database entry (i.e., the original database entry vector) actually has the highest dot product with the query. In other words, the approximation indicates that the first database item has a higher similarity than the second database item, but that the actual inner products of the query and the second item are the most similar (have the largest inner product). As an example, the system may use the "Adam: adam optimization algorithm described in A method for stored optimization ", CoRR, abs/1412.6980,2014, to optimize parameters.

The system may determine whether additional iterations of the above steps are needed (530). If no violation is found in step 525, the iteration may be complete. If the iteration reaches a set number (e.g., 30), the iteration may be complete. If the iteration is not complete (530, no), the system may continue to adjust the parameters by looking for violations, adjusting assignments, and adjusting codebooks. If the iteration is complete (530, yes), then the process 500 ends, generating the VQ codebook, the PQ codebook, and the learned transformation. When the system includes additional layers in the hierarchy, additional codebooks are jointly learned in a similar manner.

Fig. 6A to 6D illustrate the benefits of an implementation using hierarchical quantization. In the example of fig. 6A to 6D, the evaluation of hierarchical quantization and other quantization processes is compared across four reference databases. Table 1 illustrates the characteristics of four databases:

data set	Dimension (d)	Size (n)
			movielens	150	10,681
netflix	300	17,770
			word2vec_text8	200	71,291
word2vec_wiki	500	3,519,681

TABLE 1

FIG. 6A illustrates a comparison of the efficiency of different distance calculations. In particular, FIG. 6A illustrates the time (in μ s) each query takes in a linear search of the database. In fig. 6A, hierarchical quantization using an in-register codebook look-up table (LUT16) is compared to 1) Hamming distance of binary codes (using XOR and POPCNT instructions; and 2) the asymmetric distance (PQ, using a look-up table in memory) to the product quantization code. All three use the same number of bits (64): hamming uses a 64-bit binary code, PQ uses 8 subspaces with 256 quantizers each, and LUT16 uses 16 subspaces with 16 quantizers each. Timing includes distance calculation and top N selection. In the example of fig. 6A, the computation is done with a single thread on a 3.5GHz Haswell machine, and the CPU scaling is turned off. As shown, LUT16 outperforms PQ and Hamming, and hierarchical quantization (LUT16) is significantly faster than PQ with in-memory look-up tables (5 times in larger databases) and slightly faster than Hamming distance calculations.

Fig. 6B illustrates the accuracy/recall curve when the first 10 neighbors are retrieved (retrieving) on all four databases. Fig. 6B compares hierarchical quantization using a codebook look-up table in register (hierarchy) with four baseline methods: signed ALSH, simple LSH, composite quantization and QUIPS. Signed ALSH is described in Shrivastava et al, "An Improved scheme for asymmetry LSH," CoRR abs/1410.5410,2014, LSH is described in Neyshabur et al, "Assimpler and beta LSH for Maximum Inner Product Search (MIPS)," arXiv preprint arXiv: 1410.5518,2014, composite Quantization is described in Du et al, "Inner product similarity search using composite codes", CoRR, abs/1406.4966,2014, and QUIPS is described in Guo et al, "Quantization based fast Inner product search", Arxiv print: 1509.01469,2015. In generating the curve, a basic-true (ground-true) MIPS result is generated using a brute force search and compared to the baseline method in the fixed bit rate setting and the results of hierarchical quantization. Fig. 6B illustrates that hierarchical quantification tends to be significantly better than all four baseline methods, with better performance on larger databases. Hierarchical quantization is highly scalable.

Fig. 6C illustrates a recall @ N for retrieving the top 10 neighbors, comparing the hierarchical quantization of the learned rotation with and without the residual on the largest database (e.g., word2vec wiki). FIG. 6C illustrates a significant improvement in recall by learned spin. Fig. 6D illustrates the recall @ N for retrieving the top 10 neighbors when searching different parts of the VQ partition on the largest database. A system using hierarchical quantization divides the database into m partitions using vector quantization and searches only those database entries that share (e.g., at step 410 of fig. 4) the same VQ codeword selected for a query entry. Thus, the system searches a small portion of the database. This speeds up processing time, but can affect recalls. FIG. 6D illustrates the recall curve for the largest data set at different search scores (fractions) t/m (where t is the number of VQ codewords selected for a query term, and m is the total number of partitions in the VQ codebook). As shown in fig. 6D, there was virtually no recall loss when the search score was 1/4 (25%), and the loss was less than 2% when the search score was 1/16 (6.25%). The number of selected partitions (t) may be adjusted to favor speed (lower t) or recall (higher t), but fig. 6D illustrates that t may be much lower than m and still obtain accurate results.

FIG. 7 illustrates an example of a general purpose computer device 700, which can be the server 110 and/or client 170 of FIG. 1, that can be used with the techniques described herein. Computing device 700 is intended to represent various example forms of computing devices, such as laptops, desktops, workstations, personal digital assistants, cellular telephones, smartphones, tablets, servers, and other computing devices, including wearable devices. The components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

Computing device 700 includes a processor 702, memory 704, storage 706, and expansion ports 710 connected via interfaces 708. In some implementations, the computing device 700 may include, among other components, a transceiver 746, a communication interface 744, and a GPS (global positioning system) receiver module 748 connected via the interface 708. Device 700 may communicate wirelessly through communication interface 744, which communication interface 744 may include digital signal processing circuitry as necessary. Each of the

components

702, 704, 706, 708, 710, 740, 744, 746, and 748 may be suitably mounted on a common motherboard or in other manners.

The processor 702 can process instructions for execution within the computing device 700, including instructions stored in the memory 704 or on the storage device 706, to display graphical information for a GUI on an external input/output device, such as the display 716. The display 716 may be a monitor or a flat-panel touch screen display. In some implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Moreover, multiple computing devices 700 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).

The memory 704 stores information within the computing device 700. In one implementation, the memory 704 is a volatile memory unit or units. In another implementation, the memory 704 is a non-volatile memory unit or units. The memory 704 may also be another form of computer-readable medium, such as a magnetic or optical disk. In some implementations, the memory 704 may include expansion memory provided through an expansion interface.

The storage device 706 is capable of providing mass storage for the computing device 700. In one implementation, the storage device 706 may be or include a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state storage device, or an array of devices, including devices in a storage area network or other configurations. A computer program product may be tangibly embodied in such a computer-readable medium. The computer program product may also include instructions that, when executed, perform one or more methods, such as those described above. The computer-or machine-readable medium is a storage device such as the memory 704, the storage device 706, or memory on the processor 702.

The interface 708 may be a high-speed controller that manages bandwidth-intensive operations for the computing device 700 or a low-speed controller that manages lower bandwidth-intensive operations, or a combination of such controllers. An external interface 740 may be provided to enable near area communication of device 700 with other devices. In some implementations, the controller 708 may be coupled to the storage device 706 and the expansion port 714. An expansion port, which may include various communication ports (e.g., USB, bluetooth, ethernet, wireless ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device (such as a switch or router), for example, through a network adapter.

The computing device 700 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 730, or multiple times in a group of such servers. It may also be implemented as part of a rack server system. Further, it may be implemented in a personal computer such as laptop 722 or smart phone 736. An entire system may be made up of multiple computing devices 700 in communication with each other. Other configurations are also possible.

FIG. 8 illustrates an example of a general purpose computer device 800, which may be the server 110 of FIG. 1, that may be used with the techniques described herein. Computing device 800 is intended to represent various example forms of large-scale data processing equipment, such as servers, blade servers, data centers, mainframes, and other large-scale computing devices. Computing device 800 may be a distributed system having multiple processors interconnected by one or more communication networks, possibly including network-attached storage nodes. The components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

The distributed computing system 800 may include any number of computing devices 880. Computing devices 880 may include servers or rack servers, mainframes, etc. that communicate over local or wide area networks, dedicated optical links, modems, bridges, routers, switches, wired or wireless networks, etc.

In some implementations, each computing device may include multiple racks. For example, computing device 880a includes multiple racks 858a-858 n. Each rack may include one or more processors, such as processors 852a-852n and 862a-862 n. The processor may include a data processor, a network attached storage device, and other computer controlled devices. In some implementations, one processor may operate as a main processor and control scheduling and data distribution tasks. The processors may be interconnected through one or more rack switches 858, and one or more racks may be connected through a switch 878. The switch 878 can handle communications between multiple connected computing devices 800.

Each rack may include memory, such as memory 854 and memory 864, as well as storage devices, such as 856 and 866. The

memories

856 and 866 may provide mass storage, and may include volatile or non-volatile storage (such as network-attached disks, floppy disks, hard disks, optical disks, tapes, flash memory or other similar solid state memory devices), or an array of devices, including devices in a storage area network or other configurations.

Storage

856 or 866 may be shared among multiple processors, multiple racks, or multiple computing devices, and may include a computer-readable medium storing instructions executable by one or more processors. The

memory

854 and 864 may include, for example, one or more volatile memory units, one or more non-volatile memory units, and/or other forms of computer-readable media, such as magnetic or optical disks, flash memory, cache, Random Access Memory (RAM), Read Only Memory (ROM), and combinations thereof. Memory, such as memory 854, may also be shared between the processors 852a-852 n. A data structure, such as an index, may be stored, for example, between memory 856 and memory 854. Computing device 800 may include other components not shown, such as controllers, buses, input/output devices, communication modules, and so forth.

An overall system, such as system 100, may be made up of multiple computing devices 800 in communication with each other. For example, device 880a may communicate with

devices

880b, 880c, and 880d, and these may be collectively referred to as system 100. As another example, the system 100 of fig. 1 may include one or more computing devices 800. Some of the computing devices may be located geographically close to each other, while other computing devices may be located geographically far away. The layout of system 800 is merely an example, and the system may take on other layouts or configurations.

According to one aspect, a computer system includes at least one processor and a memory storing a database of quantized items. Each quantized term comprises a first entry in a first codebook and a plurality of second entries in a second codebook, wherein each of the plurality of second entries represents a respective subspace of k subspaces. The memory also includes instructions that, when executed by the at least one processor, cause the system to perform operations. The operations may include determining an entry from a first codebook that is most similar to a query vector, computing a residual vector from the entry and the query vector in the first codebook, transforming the residual vector using the learned transformation, and projecting the transformed residual vector into k subspaces. For each quantized entry having a first entry that matches the most similar entry in the first codebook to the query vector, the operations may further comprise: an inner product of the quantized terms and the transformed residual vector is computed for each subspace, and a similarity score between the quantized terms and the query is computed by summing the inner products. The operations may also include providing the item with the highest similarity score in response to the query.

These and other aspects may include one or more of the following features. For example, k may be 16 and the operations may further include: for each subspace, an inner product between the transformed residual vector and each entry in the second codebook is computed and stored in codebook look-up table register storage. Each subspace may have a corresponding register. As another example, the transform may be learned jointly with the first codebook and the second codebook. In some implementations, the first codebook may undergo an initialization and x learning periods before initializing the second codebook and performing joint learning. As another example, the residual may be a difference between an entry in the first codebook and the query vector. The database may be large, e.g., having millions of quantized entries.

According to one aspect, a method includes clustering a data store of database items represented as high-dimensional vectors and selecting a cluster center for each cluster and storing the cluster center as an entry in a first-level codebook. The method may further comprise, for each database entry, calculating a residual based on the cluster center of the cluster to which the database entry is assigned, projecting the residual into a subspace, and determining, for each subspace, an entry in the second-layer codebook for that subspace, and the entries in the first-layer codebook and the corresponding entries in the second-layer codebook for each subspace being stored as quantized vectors of the database entries.

These and other aspects can include one or more of the following features. For example, the quantized vector may be used to determine a response database entry using a maximum inner product search. As another example, the method may further include transforming the residual using the learned rotation before projecting the residual into the subspace. As another example, the method may include: t clusters that are most similar to the query vector are determined from the first codebook based on an inner product operation, and a residual of the query vector is calculated for each of the t clusters based on a cluster center of the cluster. The method may also include projecting each residual of the query into a subspace, and determining a maximum inner product score with the query vector for each database item assigned to one of the t clusters. The maximum inner product score is based on a sum over a subspace of inner products calculated between residuals of database items and residuals of queries assigned to clusters of database items. The method may also include identifying a database item that is most similar to the query vector from among the database items assigned to one of the t clusters based on the maximum inner product score. In some implementations, the database entries that are most similar to the query vector are used to classify the entries represented by the query vector or to provide the database entries in response to the query vector. As another example, the method may include transforming the residual using the learned rotation before projecting the residual into the subspace, wherein the learned rotation is jointly trained with parameters of the first layer codebook and the second layer codebook.

According to one aspect, a method may include partitioning vectors in a database into m partitions using vector quantization, such that each vector has an assigned vector quantization codeword, and computing, for each vector, a respective residual, the residual being a difference between the vector and a cluster center corresponding to the vector quantization codeword. The method may also include applying product quantization to each residual, thereby generating a product quantization codeword for each of k subspaces for each residual, storing an assigned vector quantization codeword and k product quantization codewords for the residuals of the vectors for each vector, and using the vector quantization codewords to select a portion of the database vector that is most similar to the query vector. The method may also include, for each database vector in the portion, determining a database vector from the portion that is most similar to the query vector using a product-quantized codeword.

These and other aspects can include one or more of the following features. For example, the method may further comprise transforming the residual using the learned rotation before applying the product quantization. The learned rotation may be learned jointly with a codebook for vector quantization and a codebook for product quantization. As another example, using the vector quantization codeword to select a portion of the database vector may include performing an inner product between the query vector and each cluster center to produce a similarity value for each cluster center; and selecting the cluster center having the highest similarity value. The portion is a vector in the database having a vector quantized codeword corresponding to one of the t cluster centers. As another example, the respective residuals are first respective residuals, the vector quantized codewords are first vector quantized codewords, and the method may further comprise partitioning the first respective residuals into a plurality of second partitions using second vector quantization, so each vector has an assigned second vector quantized codeword, and calculating, for each of the first respective residuals, a second respective residual, the second residual being a difference between the first respective residual and a cluster center corresponding to the second vector quantized codeword. Product quantization may be applied to the second corresponding residual and the second vector quantized codeword stored with the first vector quantized codeword and the k product quantized codewords.

Various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium," "computer-readable medium" refer to any non-transitory computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory (including read-access memory), Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Many implementations have been described. Nevertheless, various modifications may be made without departing from the spirit and scope of the invention. Moreover, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.

Claims

1. A computer system for fast database searching, comprising:

at least one processor; and

a memory to store:

a database of quantized terms, each quantized term corresponding to a respective database term represented by a database vector in vector space, each of the quantized terms comprising a concatenation of a first index of a first codebook generated using vector quantization followed by a plurality of second indices of a second codebook generated using product quantization, wherein each of the plurality of second indices represents a respective subspace of k subspaces, and

instructions that, when executed by the at least one processor, cause the system to:

determining from the first codebook a cluster center that is most similar to a query vector,

computing a residual vector from the cluster center and the query vector,

transforming the residual vector using the learned transformation,

projecting the transformed residual vectors into the k subspaces,

for each quantized term having a first index corresponding to the cluster center determined for the query vector:

computing an inner product between the quantized term and the transformed residual vector for each subspace, an

Calculating a similarity score between the quantized term and the query vector by summing the inner products, an

Providing the quantized term having the highest similarity score in response to the query vector.

2. The system of claim 1, wherein k is 16.

3. The system of claim 2, wherein the memory further stores instructions that, when executed by the at least one processor, cause the system to:

for each subspace, calculating an inner product between the transformed residual vector and each cluster center in the second codebook; and

storing the calculated inner product in a memory device of a register of a codebook lookup table;

wherein the cluster center is an entry included in the second codebook.

4. The system of claim 3, wherein each subspace has a corresponding register.

5. The system of claim 1, wherein the transformation is learned jointly with the first codebook and the second codebook.

6. The system of claim 5, wherein the first codebook is subject to an initialization and x learning periods prior to initializing the second codebook and performing joint learning.

7. The system of claim 1, wherein the database has millions of quantified terms.

8. A method for fast database searching, comprising:

clustering data stores of database items represented as high-dimensional vectors;

selecting a clustering center for each cluster, and storing the clustering centers as entries in a first-layer codebook;

for each database entry:

calculating a residual between the vector of the database entry and a cluster center of a cluster to which the database entry is assigned,

the residual is projected into a subspace,

for each of the subspaces, determining an entry in a second layer codebook for the subspace, and

storing a concatenation of an entry in the first layer codebook and a corresponding entry of a second layer codebook for each of the subspaces as a quantized vector of the database entry;

determining t clusters most similar to a query vector from the first-layer codebook based on inner product operation;

calculating t query residuals, wherein the query residuals in the t query residuals are calculated through cluster centers of clusters in the t clusters and the query vectors;

projecting each query residual of the t query residuals into the subspace;

for each database item assigned to one of the t clusters, determining a maximum inner product score with the query vector, the maximum inner product score being based on a sum over the subspace of inner products calculated between residuals of the database items and query residuals of clusters assigned to the database items; and

identifying a database item that is most similar to the query vector from among database items assigned to one of the t clusters based on the maximum inner product score.

9. The method of claim 8, wherein the quantized vector is used to determine a response database entry using a maximum inner product search.

10. The method of claim 8, further comprising: transforming the residual using the learned rotation before projecting the residual into a subspace.

11. The method of claim 8, further comprising: the items represented by the query vector are classified using the database items that are most similar to the query vector.

12. The method of claim 8, further comprising: providing database entries in response to the query vector using the database entry that is most similar to the query vector.

13. The method of claim 8, further comprising:

transforming the residual using the learned rotation before projecting the residual into a subspace,

wherein the learned rotations are jointly trained with parameters of the first layer codebook and the second layer codebook.

14. A method for fast database searching, comprising:

partitioning vectors in a database into m partitions using vector quantization such that each vector has an assigned vector quantization codeword;

for each of the vectors, calculating a respective residual that is a difference between the vector and a cluster center corresponding to the vector quantized codeword;

applying a product quantization to each of the respective residuals, thereby generating a product quantization codeword for each of k subspaces for each respective residual;

storing, for each vector, the assigned vector quantization codeword and k product quantization codewords for the respective residuals of the vector;

selecting a portion of the database vector that is most similar to a query vector using the vector quantization codeword by:

performing an inner product between the query vector and each cluster center to produce a similarity value for each cluster center; and

selecting t cluster centers having the highest similarity values, wherein the portion is a vector in the database having a vector quantized codeword corresponding to one of the t cluster centers; and

for each of the database vectors in the portion, determining a database vector from a portion that is most similar to the query vector using the product quantization codeword.

15. The method of claim 14, further comprising: transforming the respective residuals using the learned rotations before applying product quantization.

16. The method of claim 15, wherein the learned rotation is jointly trained using parameters of a codebook used for the vector quantization and a codebook used for the product quantization.

17. The method of claim 14, the respective residual being a first respective residual, the vector quantized codeword being a first vector quantized codeword, and the method further comprising:

partitioning the first respective residuals into a plurality of second partitions using second vector quantization, so each vector has an assigned second vector quantization codeword; and

for each of the first respective residuals, calculating a second respective residual that is a difference between the first respective residual and a cluster center corresponding to the second vector quantized codeword,

wherein the product quantization is applied to the second corresponding residual and the second vector quantization codeword is stored with the first vector quantization codeword and the k product quantization codewords.