US20230090262A1 - Neural hashing for similarity search - Google Patents

Neural hashing for similarity search Download PDF

Info

Publication number
US20230090262A1
US20230090262A1 US17/795,233 US202117795233A US2023090262A1 US 20230090262 A1 US20230090262 A1 US 20230090262A1 US 202117795233 A US202117795233 A US 202117795233A US 2023090262 A1 US2023090262 A1 US 2023090262A1
Authority
US
United States
Prior art keywords
pseudo
training
vector
vectors
floating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US17/795,233
Other versions
US11763136B2 (en
Inventor
Daphna Idelson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GSI Technology Inc
Original Assignee
GSI Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GSI Technology Inc filed Critical GSI Technology Inc
Priority to US17/795,233 priority Critical patent/US11763136B2/en
Assigned to GSI TECHNOLOGY INC. reassignment GSI TECHNOLOGY INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: IDELSON, Daphna
Publication of US20230090262A1 publication Critical patent/US20230090262A1/en
Application granted granted Critical
Publication of US11763136B2 publication Critical patent/US11763136B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • G06N3/0454
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Definitions

  • the present invention relates to similarity search generally and to approximate nearest neighbor search in particular.
  • Similarity search is a general term used for a range of techniques which share the principle of searching, typically, very large object sets, where the only available comparator is the similarity between any pair of objects. More information on similarity search may be found in the Wikipedia article ‘Similarity search’ found at https://en.wikipedia.org/wiki/Similarity_search.
  • Nearest neighbor search is a fundamental technique that plays a crucial part in applications such as image retrieval, face recognition, document and text search and other applications.
  • the definition of the Nearest-Neighbor (NN) search is to retrieve candidate items close to a given sample item, from a database of candidate items. Distance or proximity is defined by a distance metric such as Euclidean distance—for example L2-distance, or angular distance—or, for example, cosine similarity.
  • K-Nearest-Neighbor (KNN) search is the retrieval of the K nearest neighbors to an item and is either used as is—for example to present results in a web search —or as a prediction algorithm for classification— using a voting method, or for regression—using an averaging method. More information on nearest neighbor search may be found in the Wikipedia article ‘Nearest neighbor search’ found at https://en.wikipedia.org/wiki/Nearest_neighbor_search.
  • NN is usually not performed on raw data like images or text, as distance between raw data values does not hold much information.
  • Data first needs to be transformed to d-dimensional feature vectors that have a meaningful distance metric in the feature space for the task at hand.
  • These real-valued feature vector representations are often called feature embeddings, and hopefully hold desired semantic information of the input data, so that semantically similar inputs fall close to one another in the embedding space.
  • Such feature vectors in the same embedding space can be compared using a distance (or similarity) metric, such as cosine similarity or L2 distance.
  • Cosine similarity is a similarity metric between two non-zero feature vectors. It is equal to the cosine of the angle between the two vectors. This is also the same as the inner product of the same vectors normalized to both have length 1.
  • the cosine of the two vectors can be derived using the dot product formula shown in equation 1:
  • is the angle between the two vectors A and B. More information on cosine similarity may be found in the Wikipedia article ‘Nearest neighbor search’ found at https://en.wikipedia.org/wiki/Cosine_similarity.
  • the process of converting raw input data to feature vectors, or feature embeddings is known as feature extraction, embedding, or X2Vec (i.e., from some kind of data X to a feature vector) and there are many methods, including deep neural networks (discussed hereinbelow), domain specific feature engineering, and other machine learning methods that are used for this process.
  • feature extraction models can be Word2Vec for word embeddings, deep-based FaceNet for face embeddings and SIFT feature detection for pattern matching in images.
  • this process will be referred to as ‘data encoding’ and it is assumed that the converter is a given and its feature space similarity is desired and required to be preserved.
  • KNN search system 10 has a data encoder 12 and a KNN searcher 14 .
  • Data encoder 12 transforms raw data d i into floating-point data vectors fv i the vector format having a measurable quantity required for KNN search, as mentioned herein above.
  • KNN searcher 14 can then perform a KNN similarity search on a set of vectors output by data encoder 12 .
  • Hashing methods aim to map data points into low-dimensional representations, or compact bit-codes, for efficient comparison and reduction of memory space.
  • One of the most popular hashing methods is locality sensitive hashing (LSH).
  • LSH maps high-dimensional points to low-dimensional points using a set of random projections.
  • classic LSH methods are data-independent and are in many cases empirically outperformed by data-dependent methods that exploit specific data structure and distribution.
  • One main group of data-dependent methods for ANN is based on binary hashing, which maps data points in the original floating-point, feature vector representation space into binary codes in the Hamming space for compact representation and fast search. Similarity search in the Hamming space is measured using Hamming distance or Hamming similarity. Hamming distance between two binary strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions needed to change one binary string into the other. Hamming similarity is the inverse of Hamming distance and is the number of positions at which the corresponding symbols are similar. More information on Hamming distance may be found in the Wikipedia article ‘Hamming distance’ found at https://en.wikipedia.org/wiki/Hamming_distance.
  • FIG. 2 illustrates an ANN search system 20 which has a data encoder 12 , similar to that in KNN search system 10 in FIG. 1 , a floating-point to binary encoder 22 , and an ANN searcher 24 .
  • Data encoder 12 encodes raw data d i into vectors fv i
  • floating-point to binary encoder 22 converts floating-point data vectors fv i into binary encoded data vectors bv i
  • ANN searcher 24 then performs an approximate similarity search on a set of binary vectors.
  • Neural network 30 has an input layer 32 , a plurality of hidden layers 33 , and an output layer 34 . Each layer is connected to the next and previous layers. Layers are made up of nodes 36 ; the number of layers and the number of nodes varies by input complexity and purpose of the neural network. As shown in FIG. 3 B , nodes are connected to make a path from input layer 32 through to output layer 34 . Connections are called edges 37 . Nodes 36 and edges 37 are weighted with weightings Wn,h, where n is the node number and h is the layer number. Weight, Wn,h, adjusts the strength of the signal at that connection.
  • An iteration may be performed simultaneously on a plurality of samples from the training set, called a batch or mini-batch, and the number of samples is called the batch size.
  • the number of times the entire training set is passed through the network is known as the number of epochs.
  • the network may be used operationally on unknown inputs, in what is known as ‘inference’ mode. It should be noted that during training, special layers may be present in the network to facilitate training and loss function generation, and these may be removed and other non-parametric layers that are required for inference added prior to operation.
  • a method for training a neural-network-based floating-point to binary feature vector encoder to preserve the locality relationships between samples in an input space over to an output space includes having a neural network under training which has floating-point inputs and floating-point pseudo-bipolar outputs, and generating a loss function which compares an input probability distribution constructed from floating-point cosine similarities of the input space and an output probability distribution constructed from floating-point pseudo-bipolar pseudo-Hamming similarities of the output space.
  • the generating includes calculating the output probability distribution between a floating-point pseudo-bipolar encoded sample vector and a set of pseudo-bipolar encoded reference vectors.
  • the method includes repeating the taking for each training iteration.
  • the method also includes repeating the finding thereby generating a plurality of the reference vector sets, one per sample vector of the plurality of sample vectors, for each training iteration.
  • the generating includes calculating the loss function using a Kullback-Leibler divergence from the input probability distribution and the output probability distribution.
  • the neural network includes an output layer which generates the floating-point pseudo bipolar encoded sample and reference vectors using a beta-scaled tan h layer.
  • the generating includes calculating the pseudo-Hamming similarities using an inner product in the output space.
  • the generating includes normalizing the cosine similarities and the pseudo-Hamming similarities to be within the same range of values.
  • the normalizing includes normalizing the pseudo-Hamming similarities using a binary code length.
  • the normalizing includes converting the cosine similarities and the pseudo-Hamming similarities to probabilities.
  • the method includes, once the neural network is trained, producing an inference neural network from the trained neural network, the inference neural network to output true binary vectors.
  • the true binary vectors are to be used in approximate nearest neighbor searches.
  • a system for training a neural-network-based floating-point to binary feature vector encoder to preserve the locality relationships between samples in an input space over to an output space includes a neural network under training and a probability distribution loss function generator.
  • the neural network under training has floating-point inputs and floating-point pseudo-bipolar outputs.
  • the probability distribution loss function generator generates a loss function which compares an input probability distribution constructed from floating-point cosine similarities of the input space and an output probability distribution constructed from floating-point pseudo-bipolar pseudo-Hamming similarities of the output space.
  • the probability distribution loss function generator includes a pseudo-bipolar Hamming distribution calculator to calculate the output probability distribution between a floating-point pseudo-bipolar encoded sample vector and a set of pseudo-bipolar encoded reference vectors.
  • the system includes a training data vector store to store a training vector set, a proxy vector set generator, a sample vector selector and a KNN vector set generator.
  • the proxy vector set generator takes at least a random sampling of a plurality of vectors from the training vector set, thereby to generate a representative proxy vector set.
  • the sample vector selector selects at least a sample vector from the training vector set.
  • the KNN vector set generator finds at least a set of k nearest neighbor vectors from the proxy vector set, closest to the sample vector, thereby to generate at least a reference vector set to be encoded by the encoder.
  • the proxy vector set generator takes a random sampling of a plurality of vectors from the training vector set, thereby to generate a representative proxy vector set for each training iteration.
  • the sample vector selector selects a plurality of the sample vectors from the training vector set, for each the training iteration.
  • the KNN vector set generator finds, per training iteration, a plurality of the set of k nearest neighbor vectors from the proxy vector set, one per sample vector of the plurality of sample vectors for each the training iteration.
  • the loss function is a Kullback-Leibler divergence.
  • the neural network under training includes a pseudo-bipolar output layer which generates the floating-point pseudo bipolar encoded sample and reference vectors using a beta-scaled tan h layer.
  • the probability distribution loss function generator calculates the pseudo-Hamming similarities using an inner product in the output space.
  • the neural-network-based floating-point to binary encoder is a modified version of the trained neural network under training with at least one binary output layer instead of pseudo-bipolar output layer.
  • the output of the at least one binary output layer is to be used in approximate nearest neighbor searches.
  • the probability distribution loss function generator normalizes the cosine similarities and the pseudo-Hamming similarities to be within the same range of values.
  • the probability distribution loss function generator normalizes the pseudo-Hamming similarities using a binary code length.
  • the probability distribution loss function generator converts the cosine similarities and the pseudo-Hamming similarities to probabilities.
  • FIG. 2 is a schematic illustration of a prior art approximate nearest neighbor search system
  • FIGS. 3 A and 3 B are schematic illustrations of a prior art neural network
  • FIG. 4 A is a schematic illustration of a model of a neural proxy hash (NPH) encoder during training, constructed and operative in accordance with a preferred embodiment of the present invention
  • FIG. 4 B is a schematic illustration of a neural network under training, useful in the NPH encoder of FIG. 4 A ;
  • FIG. 4 C is a schematic illustration of a probability distribution loss function (PDLF) generator, useful in the NPH encoder of FIG. 4 A ;
  • PDLF probability distribution loss function
  • FIG. 4 D is a schematic illustration detailing the vectors being operated on by the PDLF generator of FIG. 4 C ;
  • FIG. 4 E is a schematic illustration of a training vector generator, useful in the NPH encoder of FIG. 4 A ;
  • FIG. 5 A is a graphical illustration of an exemplary input distribution and two exemplary output distributions from early and late iterations, respectively, useful in understanding the operations of NPH encoder of FIG. 4 A ;
  • FIGS. 7 A and 7 B are graphical illustrations of recall-codelength curves for benchmark dataset Sift1M.
  • Training model 40 comprises a training vector generator 41 , a neural network under training 42 and probability distribution loss function (PDLF) generator 43 .
  • training vector generator 41 may generate a sample floating-point vector q and a set N of k floating-point reference vectors v i , which are input into neural network under training 42 .
  • Neural network under training 42 may encode input vector q and input vector set N into a pseudo-bipolar encoded sample vector f(q) and a set of pseudo-bipolar encoded reference vectors f(N), respectively.
  • PDLF generator 43 may generate a loss function L KL for neural network under training 42 as a function of the pseudo-bipolar Hamming space probability distribution between pseudo-bipolar encoded output sample vector f(q) and pseudo-bipolar encoded output reference vector set f(N), and the cosine space probability distribution between input sample vector q and input reference vector set N, as explained hereinbelow.
  • loss function Liu may then be used to adjust the weights of neural network under training 42 at each iteration of the training process.
  • neural network under training 42 may be reconfigured to output true binary vectors, as explained hereinbelow.
  • training vector generator 41 may output a plurality b of sample vectors q j , and a plurality of associated reference vector sets N j , for every training iteration.
  • the batch size b is a configurable hyper-parameter commonly used in machine learning.
  • the error, or loss is calculated for each vector q j against its reference vector set N j and the total loss function is defined as the mean across all q j in the iteration.
  • the back propagation process including partial derivative calculations and weights update may be performed by the neural network framework being used, for example TensorFlow, PyTorch or MxNet.
  • the gradient descent method of optimization e.g., SGD, Adam optimizer, etc.
  • standard NN training hyperparameters such as learning rate, scheduling and momentum, may be configured via the neural network framework.
  • each hidden layer 421 may comprise a dense layer (FC) 4211 with a configurable number of units, a batch normalization (BN) layer 4212 , and a ReLU activation layer 4213 .
  • Embedding layer 422 may comprise a dense layer (FC) 4221 with n number of units, n being the final desired code-length of the binary vectors, a BN layer 4222 and an L2-normalizer layer 4223 .
  • n is configurable and should take into consideration the tradeoff between a number of factors, such as database size, required accuracy, memory resources and required search time throughput and latency.
  • FIG. 4 D which details the vectors being operated on by PDLF generator 43 .
  • FIG. 4 D shows the operations on set N of associated reference vectors v i for sample vector q and in its right column, FIG. 4 D shows the operations on the pseudo-bipolar encoded output set f(N) of associated reference vectors f(v i ) for pseudo-bipolar encoded sample vector f(q).
  • cosine distribution calculator 431 may determine the probability distribution of the cosine similarities S Ci between sample floating-point vector q and its k associated floating-point reference vectors v i in set N.
  • Cosine distribution calculator 431 may then convert the cosine similarities S Ci (shown as a vector of different width elements in the second row of FIG. 4 D ) to a similarity probability distribution PD C for sample q by first defining the probabilities P Ci of q over its associated reference set N.
  • the input probability distribution PD C is a vector of the k per-reference vector probabilities P Ci , each determined in a manner similar to a “softmax” function, as shown in equation 3:
  • the pseudo-Hamming similarity S H for two encoded output vectors f(q) and f(v i ) in the pseudo-bipolar ⁇ 1, +1 ⁇ space, may be defined by the inner product of the pseudo-bipolar vectors f(q) and f(v i ), as provided in equation 4.
  • the output similarity scores may be normalized to the range of ( ⁇ 1, +1), using the binary code length n.
  • the normalized pseudo-Hamming similarity S Hi between a pseudo-bipolar encoded vector of sample q, f (q), and a pseudo-bipolar encoded vector of reference point v i , f (v i ) is defined as in equation 5.
  • the input cosine distribution PD C and the output pseudo-bipolar Hamming distribution PD H may be used by probability distribution loss calculator 433 to calculate the probability distribution loss function L KL , using the Kullback-Leibler divergence, D, also called the relative entropy.
  • Kullback-Leibler divergence is a measure of how the output probability distribution is different from the input or “target” probability distribution.
  • loss function L KL for sample q is defined using the Kullback-Leibler divergence, which, as shown in FIG. 4 D , utilizes the input and output probability distributions P Ci and P Hi vis-á-vis query q.
  • This minimization objective for creating binary, locality preserving vectors may be described as an explicit, multi-wise normalized KL divergence loss, where “explicit” refers to using relations between distances rather than implicitly using some space partitioning method, such as labels; “multi-wise” refers to using multiple reference points which preserve relations among more than two or three items; and “normalized” refers to similarity-to-similarity divergence minimization.
  • Applicant has realized that, to generate a meaningful similarity distribution around any query q, using meaningful points around q, such as its k nearest neighbors, as reference points to q may be most useful. Using the k nearest neighbors to q to form a distribution may exploit relative information that may preserve ranking between data points during the training process, and may avoid the noise of irrelevant, far away points, such as might be present if all vectors are used as reference vectors to each sample vector q.
  • proxy sets may be re-generated at each batch-iteration, to provide full augmentation and varieties of reference sets, for a more general solution.
  • ongoing subsampling of the training set may ensure a good representation of its distribution.
  • sample vector selector 413 may select a batch size b of sample vectors q j from the training set T.
  • KNN vector set generator 414 may determine the set N of k nearest neighbor vectors v i , to each vector q.
  • FIG. 5 A illustrates an exemplary input distribution 50 , and two exemplary output distributions 55 and 59 , where output distribution 55 shows the results of one of the early iterations of neural network under training 42 and output distribution 59 shows the results of one of the later iterations of neural network under training 42 .
  • output distribution 55 shows the results of one of the early iterations of neural network under training 42
  • output distribution 59 shows the results of one of the later iterations of neural network under training 42 .
  • Input vector q and its output vectors f(q) are shown, for clarity, as small circles, and their 5 associated reference vectors, v 1 -v 5 and f(v 1 )-f(v 5 ), are shown as x's.
  • the 5 associated reference vectors, v 1 -v 5 have different lengths, where reference vector v 1 is shortest (i.e., closest to input vector q) and reference vector v 5 is longest (i.e., further away from input vector q).
  • FIG. 5 B illustrates an exemplary distribution of the input vectors in 2-D in training vector set T, each vector illustrated as a dot 61 , that may be stored in training data vector store 411 .
  • a subset of r vectors, of proxy set R, illustrated with additional x's 62 represent vectors selected by proxy vector set generator 412 .
  • Sample vectors q, as selected by sample vector selector 413 are illustrated as large white dots 65
  • k nearest neighbor vectors, v i as selected by vector set generator 414 , are illustrated as x's 66 which are connected to the associated sample vector q j , within a bounding circle 68 . Note that the bounding circles 68 cover only portions of training vector set T and that some of the bounding circles 68 may overlap.
  • the resultant trained neural network may be used in inference mode in production.
  • the neural network encodes input floating-point vectors v into true binary encoded output vectors f(v), as opposed to the pseudo-bipolar encoded output vectors, generated during training.
  • FIGS. 7 A and 7 B illustrate recall-codelength curves for benchmark dataset Sift1M.
  • the encoder was trained on the predefined training set of 100k samples and evaluated on the predefined 10K sample set, searched against the 1M database set.
  • the ground-truth is the real 100-nearest neighbors in the original feature space using cosine similarity.
  • LSH Locality-Sensitive-Hashing
  • NPH Neural Proxy Hash
  • Embodiments of the present invention may include apparatus for performing the operations herein.
  • This apparatus may be specially constructed for the desired purposes, or it may comprise a computing device or system typically having at least one processor and at least one memory, selectively activated or reconfigured by a computer program stored in the computer.
  • the resultant apparatus when instructed by software may turn the general purpose computer into inventive elements as discussed herein.
  • the instructions may define the inventive device in operation with the computer platform for which it is desired.
  • Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk, including optical disks, magnetic-optical disks, read-only memories (ROMs), volatile and non-volatile memories, random access memories (RAMs), electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, Flash memory, disk-on-key or any other type of media suitable for storing electronic instructions and capable of being coupled to a computer system bus.
  • the computer readable storage medium may also be implemented in cloud storage.
  • Some general purpose computers may comprise at least one communication element to enable communication with a data network and/or a mobile communications network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Operations Research (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A system for training a neural-network-based floating-point-to-binary feature vector encoder preserves the locality relationships between samples in an input space over to an output space. The system includes a neural network under training and a probability distribution loss function generator. The neural network has floating-point inputs and floating-point pseudo-bipolar outputs. The generator compares an input probability distribution constructed from floating-point cosine similarities of an input space and an output probability distribution constructed from floating-point pseudo-bipolar pseudo-Hamming similarities of an output space. The system includes a proxy vector set generator to take a random sampling of vectors from training data for a proxy set, a sample vector selector to select sample vectors from the training data and a KNN vector set generator to find a set of k nearest neighbors closest to each sample vector from said proxy set for a reference set.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application claims priority from U.S. provisional patent application 63/043,215, filed Jun. 24, 2020, which is incorporated herein by reference.
  • FIELD OF THE INVENTION
  • The present invention relates to similarity search generally and to approximate nearest neighbor search in particular.
  • BACKGROUND OF THE INVENTION
  • Large-scale similarity search in general has become a fundamental task in recent information retrieval applications. Similarity search is a general term used for a range of techniques which share the principle of searching, typically, very large object sets, where the only available comparator is the similarity between any pair of objects. More information on similarity search may be found in the Wikipedia article ‘Similarity search’ found at https://en.wikipedia.org/wiki/Similarity_search.
  • Nearest neighbor search, in particular, is a fundamental technique that plays a crucial part in applications such as image retrieval, face recognition, document and text search and other applications. The definition of the Nearest-Neighbor (NN) search is to retrieve candidate items close to a given sample item, from a database of candidate items. Distance or proximity is defined by a distance metric such as Euclidean distance—for example L2-distance, or angular distance—or, for example, cosine similarity. K-Nearest-Neighbor (KNN) search is the retrieval of the K nearest neighbors to an item and is either used as is—for example to present results in a web search —or as a prediction algorithm for classification— using a voting method, or for regression—using an averaging method. More information on nearest neighbor search may be found in the Wikipedia article ‘Nearest neighbor search’ found at https://en.wikipedia.org/wiki/Nearest_neighbor_search.
  • NN is usually not performed on raw data like images or text, as distance between raw data values does not hold much information. Data first needs to be transformed to d-dimensional feature vectors that have a meaningful distance metric in the feature space for the task at hand. These real-valued feature vector representations are often called feature embeddings, and hopefully hold desired semantic information of the input data, so that semantically similar inputs fall close to one another in the embedding space. Such feature vectors in the same embedding space can be compared using a distance (or similarity) metric, such as cosine similarity or L2 distance.
  • Cosine similarity is a similarity metric between two non-zero feature vectors. It is equal to the cosine of the angle between the two vectors. This is also the same as the inner product of the same vectors normalized to both have length 1. The cosine of the two vectors can be derived using the dot product formula shown in equation 1:

  • A·B=∥A∥∥B∥ cos θ  (1)
  • Where θ is the angle between the two vectors A and B. More information on cosine similarity may be found in the Wikipedia article ‘Nearest neighbor search’ found at https://en.wikipedia.org/wiki/Cosine_similarity.
  • The process of converting raw input data to feature vectors, or feature embeddings, is known as feature extraction, embedding, or X2Vec (i.e., from some kind of data X to a feature vector) and there are many methods, including deep neural networks (discussed hereinbelow), domain specific feature engineering, and other machine learning methods that are used for this process. As examples, such feature extraction models can be Word2Vec for word embeddings, deep-based FaceNet for face embeddings and SIFT feature detection for pattern matching in images. Hereinafter, this process will be referred to as ‘data encoding’ and it is assumed that the converter is a given and its feature space similarity is desired and required to be preserved.
  • Reference is now made to FIG. 1 which illustrates a KNN search system 10. KNN search system 10 has a data encoder 12 and a KNN searcher 14. Data encoder 12 transforms raw data di into floating-point data vectors fvi the vector format having a measurable quantity required for KNN search, as mentioned herein above. KNN searcher 14 can then perform a KNN similarity search on a set of vectors output by data encoder 12.
  • For KNN in large retrieval databases, the search in the feature representation space often requires significant computation and memory resources and imposes a performance bottleneck. As data volumes become increasingly large, containing millions and billions of items, and content search becomes a widely required task, methods for fast Approximate-Nearest-Neighbor (ANN) search, which trades off a slight loss in accuracy for large performance gains, have become the focus of extensive research. There are a number of ANN techniques, including graph-based methods, clustering methods, and Hashing methods, each of which has its own limitations when used with large datasets and different hardware.
  • Hashing methods aim to map data points into low-dimensional representations, or compact bit-codes, for efficient comparison and reduction of memory space. One of the most popular hashing methods is locality sensitive hashing (LSH). LSH maps high-dimensional points to low-dimensional points using a set of random projections. Though theoretically robust and efficient, also on high dimensional vectors and output code-lengths, classic LSH methods are data-independent and are in many cases empirically outperformed by data-dependent methods that exploit specific data structure and distribution.
  • One main group of data-dependent methods for ANN is based on binary hashing, which maps data points in the original floating-point, feature vector representation space into binary codes in the Hamming space for compact representation and fast search. Similarity search in the Hamming space is measured using Hamming distance or Hamming similarity. Hamming distance between two binary strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions needed to change one binary string into the other. Hamming similarity is the inverse of Hamming distance and is the number of positions at which the corresponding symbols are similar. More information on Hamming distance may be found in the Wikipedia article ‘Hamming distance’ found at https://en.wikipedia.org/wiki/Hamming_distance.
  • Reference is now made to FIG. 2 which illustrates an ANN search system 20 which has a data encoder 12, similar to that in KNN search system 10 in FIG. 1 , a floating-point to binary encoder 22, and an ANN searcher 24. Data encoder 12 encodes raw data di into vectors fvi, then floating-point to binary encoder 22 converts floating-point data vectors fvi into binary encoded data vectors bvi. ANN searcher 24 then performs an approximate similarity search on a set of binary vectors.
  • One implementation of floating-point to binary encoder 22 uses a shallow neural network to encode vectors. It should be noted that, in vector to binary conversion, it is essential that the similarity relationships between the binary encoded vectors and the similarity relationships between the original floating-point vectors are preserved as best as possible, while finding the balance between loss in accuracy and resources, such as memory and search-time.
  • A brief explanation of a standard neural network follows, with respect to FIGS. 3A and 3B, which illustrate the components of a feedforward, neural network 31. Neural network 30 has an input layer 32, a plurality of hidden layers 33, and an output layer 34. Each layer is connected to the next and previous layers. Layers are made up of nodes 36; the number of layers and the number of nodes varies by input complexity and purpose of the neural network. As shown in FIG. 3B, nodes are connected to make a path from input layer 32 through to output layer 34. Connections are called edges 37. Nodes 36 and edges 37 are weighted with weightings Wn,h, where n is the node number and h is the layer number. Weight, Wn,h, adjusts the strength of the signal at that connection.
  • Before a neural network can be used, it needs to be trained to perform a specific feature extraction task. Training is a process by which weights, Wn,h, are adjusted throughout the network, using example data called training data, until the network operates as expected. Data sets available for neural networks are divided into training, validation and test sets, so that the neural network can use different data sets during different phases of the training and evaluation processes. At the start of training, the network weights are randomly initialized, and then adjusted according to a defined error, called the loss or cost function, between a desired output and the actual output. This is an iterative process called gradient descent through back propagation. More detail may be found in the Wikipedia article “Artificial neural network”, stored at https://en.wikipedia.org/wiki/Artificial_neural_network.
  • Such a forward and backward pass performed to adjust the neural network's weights is called an iteration. An iteration may be performed simultaneously on a plurality of samples from the training set, called a batch or mini-batch, and the number of samples is called the batch size. The number of times the entire training set is passed through the network is known as the number of epochs.
  • Once training and verification is completed, the network may be used operationally on unknown inputs, in what is known as ‘inference’ mode. It should be noted that during training, special layers may be present in the network to facilitate training and loss function generation, and these may be removed and other non-parametric layers that are required for inference added prior to operation.
  • SUMMARY OF THE PRESENT INVENTION
  • There is provided, in accordance with a preferred embodiment of the present invention, a method for training a neural-network-based floating-point to binary feature vector encoder to preserve the locality relationships between samples in an input space over to an output space. The method includes having a neural network under training which has floating-point inputs and floating-point pseudo-bipolar outputs, and generating a loss function which compares an input probability distribution constructed from floating-point cosine similarities of the input space and an output probability distribution constructed from floating-point pseudo-bipolar pseudo-Hamming similarities of the output space.
  • Moreover, in accordance with a preferred embodiment of the present invention, the generating includes calculating the output probability distribution between a floating-point pseudo-bipolar encoded sample vector and a set of pseudo-bipolar encoded reference vectors.
  • Further, in accordance with a preferred embodiment of the present invention, the method also includes taking a random sampling of a plurality of vectors from a training vector set, thereby to generate a representative proxy vector set, selecting the sample vector from the training vector set, and finding a set of k nearest neighbor vectors from the proxy vector set, closest to the sample vector, thereby to generate a reference vector set to be encoded by the encoder.
  • Still further, in accordance with a preferred embodiment of the present invention, the method includes repeating the taking for each training iteration.
  • Moreover, in accordance with a preferred embodiment of the present invention, the method includes repeating the selecting multiple times per training iteration thereby generating a plurality of sample vectors for each training iteration.
  • Further, in accordance with a preferred embodiment of the present invention, the method also includes repeating the finding thereby generating a plurality of the reference vector sets, one per sample vector of the plurality of sample vectors, for each training iteration.
  • Still further, in accordance with a preferred embodiment of the present invention, the generating includes calculating the loss function using a Kullback-Leibler divergence from the input probability distribution and the output probability distribution.
  • Additionally, in accordance with a preferred embodiment of the present invention, for training, the neural network includes an output layer which generates the floating-point pseudo bipolar encoded sample and reference vectors using a beta-scaled tan h layer.
  • Moreover, in accordance with a preferred embodiment of the present invention, the generating includes calculating the pseudo-Hamming similarities using an inner product in the output space.
  • Further, in accordance with a preferred embodiment of the present invention, the generating includes normalizing the cosine similarities and the pseudo-Hamming similarities to be within the same range of values.
  • Still further, in accordance with a preferred embodiment of the present invention, the normalizing includes normalizing the pseudo-Hamming similarities using a binary code length.
  • Moreover, in accordance with a preferred embodiment of the present invention, the normalizing includes converting the cosine similarities and the pseudo-Hamming similarities to probabilities.
  • Additionally, in accordance with a preferred embodiment of the present invention, the method includes, once the neural network is trained, producing an inference neural network from the trained neural network, the inference neural network to output true binary vectors.
  • Further, in accordance with a preferred embodiment of the present invention, the producing includes removing pseudo-bipolar output layers from the trained neural network, and adding at least one binary output layer to the trained neural network, to generate the inference neural network.
  • Additionally, in accordance with a preferred embodiment of the present invention, the true binary vectors are to be used in approximate nearest neighbor searches.
  • There is also provided, in accordance with a preferred embodiment of the present invention, a system for training a neural-network-based floating-point to binary feature vector encoder to preserve the locality relationships between samples in an input space over to an output space. The system includes a neural network under training and a probability distribution loss function generator. The neural network under training has floating-point inputs and floating-point pseudo-bipolar outputs. The probability distribution loss function generator generates a loss function which compares an input probability distribution constructed from floating-point cosine similarities of the input space and an output probability distribution constructed from floating-point pseudo-bipolar pseudo-Hamming similarities of the output space.
  • Moreover, in accordance with a preferred embodiment of the present invention, the probability distribution loss function generator includes a pseudo-bipolar Hamming distribution calculator to calculate the output probability distribution between a floating-point pseudo-bipolar encoded sample vector and a set of pseudo-bipolar encoded reference vectors.
  • Further, in accordance with a preferred embodiment of the present invention, the system includes a training data vector store to store a training vector set, a proxy vector set generator, a sample vector selector and a KNN vector set generator. The proxy vector set generator takes at least a random sampling of a plurality of vectors from the training vector set, thereby to generate a representative proxy vector set. The sample vector selector selects at least a sample vector from the training vector set. The KNN vector set generator finds at least a set of k nearest neighbor vectors from the proxy vector set, closest to the sample vector, thereby to generate at least a reference vector set to be encoded by the encoder.
  • Moreover, in accordance with a preferred embodiment of the present invention, the proxy vector set generator takes a random sampling of a plurality of vectors from the training vector set, thereby to generate a representative proxy vector set for each training iteration.
  • Further, in accordance with a preferred embodiment of the present invention, the sample vector selector selects a plurality of the sample vectors from the training vector set, for each the training iteration.
  • Still further, in accordance with a preferred embodiment of the present invention, the KNN vector set generator finds, per training iteration, a plurality of the set of k nearest neighbor vectors from the proxy vector set, one per sample vector of the plurality of sample vectors for each the training iteration.
  • Moreover, in accordance with a preferred embodiment of the present invention, the loss function is a Kullback-Leibler divergence.
  • Further, in accordance with a preferred embodiment of the present invention, for training, the neural network under training includes a pseudo-bipolar output layer which generates the floating-point pseudo bipolar encoded sample and reference vectors using a beta-scaled tan h layer.
  • Still further, in accordance with a preferred embodiment of the present invention, the probability distribution loss function generator calculates the pseudo-Hamming similarities using an inner product in the output space.
  • Additionally, in accordance with a preferred embodiment of the present invention, the neural-network-based floating-point to binary encoder is a modified version of the trained neural network under training with at least one binary output layer instead of pseudo-bipolar output layer.
  • Further, in accordance with a preferred embodiment of the present invention, the output of the at least one binary output layer is to be used in approximate nearest neighbor searches.
  • Still further, in accordance with a preferred embodiment of the present invention, the probability distribution loss function generator normalizes the cosine similarities and the pseudo-Hamming similarities to be within the same range of values.
  • Moreover, in accordance with a preferred embodiment of the present invention, the probability distribution loss function generator normalizes the pseudo-Hamming similarities using a binary code length.
  • Finally, in accordance with a preferred embodiment of the present invention, the probability distribution loss function generator converts the cosine similarities and the pseudo-Hamming similarities to probabilities.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
  • FIG. 1 is a schematic illustration of a prior art K nearest neighbor search system;
  • FIG. 2 is a schematic illustration of a prior art approximate nearest neighbor search system;
  • FIGS. 3A and 3B are schematic illustrations of a prior art neural network;
  • FIG. 4A is a schematic illustration of a model of a neural proxy hash (NPH) encoder during training, constructed and operative in accordance with a preferred embodiment of the present invention;
  • FIG. 4B is a schematic illustration of a neural network under training, useful in the NPH encoder of FIG. 4A;
  • FIG. 4C is a schematic illustration of a probability distribution loss function (PDLF) generator, useful in the NPH encoder of FIG. 4A;
  • FIG. 4D is a schematic illustration detailing the vectors being operated on by the PDLF generator of FIG. 4C;
  • FIG. 4E is a schematic illustration of a training vector generator, useful in the NPH encoder of FIG. 4A;
  • FIG. 5A is a graphical illustration of an exemplary input distribution and two exemplary output distributions from early and late iterations, respectively, useful in understanding the operations of NPH encoder of FIG. 4A;
  • FIG. 5B is a graphical illustration of the distribution of input vectors in an exemplary training vector set, useful in understanding the operations of NPH encoder of FIG. 4A;
  • FIG. 6 is a schematic illustration of a neural network reconfigured for inference mode; and
  • FIGS. 7A and 7B are graphical illustrations of recall-codelength curves for benchmark dataset Sift1M.
  • It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
  • DETAILED DESCRIPTION OF THE PRESENT INVENTION
  • In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
  • Applicant has realized that the speed and quality of any approximate nearest-neighbor (ANN) similarity search may be increased by improving the preservation of the locality relations between the original floating-point data vectors and the locality relations between the output binary vectors during binary encoding by a neural binary hashing (NPH) encoder.
  • Applicant has realized that the performance of such a NPH encoder may be improved by a novel calculation of the loss function, by a comparison of an output pseudo-bipolar Hamming probability distribution to an input cosine probability distribution.
  • As back propagation of the loss function for weight update uses partial derivatives, it is essential that during training the operations used to calculate the loss through the network are differentiable. Applicant has realized that, for optimizing the encoding of binary output vectors at inference time, during training they may be represented as floating point pseudo-bipolar vectors. The training process trains the NPH encoder so as to achieve output pseudo-bipolar values which strive to either −1 or +1. As mentioned hereinabove, the system architecture of the NPH encoder during training may be different to that of the NPH encoder used during inference, which in turn will encode to real bipolar or binary vectors.
  • Reference is now made to FIG. 4A which illustrates a model of a NPH encoder during training. Training model 40 comprises a training vector generator 41, a neural network under training 42 and probability distribution loss function (PDLF) generator 43. In the training scenario, training vector generator 41 may generate a sample floating-point vector q and a set N of k floating-point reference vectors vi, which are input into neural network under training 42. Neural network under training 42 may encode input vector q and input vector set N into a pseudo-bipolar encoded sample vector f(q) and a set of pseudo-bipolar encoded reference vectors f(N), respectively. PDLF generator 43 may generate a loss function LKL for neural network under training 42 as a function of the pseudo-bipolar Hamming space probability distribution between pseudo-bipolar encoded output sample vector f(q) and pseudo-bipolar encoded output reference vector set f(N), and the cosine space probability distribution between input sample vector q and input reference vector set N, as explained hereinbelow. As mentioned hereinabove, loss function Liu may then be used to adjust the weights of neural network under training 42 at each iteration of the training process. However, during inference, neural network under training 42 may be reconfigured to output true binary vectors, as explained hereinbelow.
  • It should be noted that during training, training vector generator 41 may output a plurality b of sample vectors qj, and a plurality of associated reference vector sets Nj, for every training iteration. The batch size b is a configurable hyper-parameter commonly used in machine learning. The error, or loss, is calculated for each vector qj against its reference vector set Nj and the total loss function is defined as the mean across all qj in the iteration. The back propagation process including partial derivative calculations and weights update may be performed by the neural network framework being used, for example TensorFlow, PyTorch or MxNet. The gradient descent method of optimization (e.g., SGD, Adam optimizer, etc.) and standard NN training hyperparameters such as learning rate, scheduling and momentum, may be configured via the neural network framework.
  • Training iterations may continue until all vectors from a training vector set T have been used as a sample vector q. This is known as an epoch. Training set T may be passed through the network under training multiple times until the loss function converges and the network is fully trained. The number of times the entire training vector set T is passed through the network is known as the number of epochs.
  • Reference is now made to FIG. 4B which details neural network under training 42 comprising a configurable number of hidden layers 421, a final embedding layer 422, and a bipolar simulator layer 423. For example, each hidden layer 421 may comprise a dense layer (FC) 4211 with a configurable number of units, a batch normalization (BN) layer 4212, and a ReLU activation layer 4213. Embedding layer 422 may comprise a dense layer (FC) 4221 with n number of units, n being the final desired code-length of the binary vectors, a BN layer 4222 and an L2-normalizer layer 4223. As mentioned hereinabove, neural network 42 may be designed to output a pseudo-bipolar output vector, which may be used in the training phase. To this end, embedding layer 422 may create an L2-normalized, floating-point representation vector as an output. In order to simulate a bipolar vector {−1, +1}, the output embedding from embedding layer 422 may then undergo a relaxation of the non-differential sgn function, using a β-scaled function 4231 and a hyperbolic tangent function (tan h) 4232, in bipolar simulator layer 423.
  • It will be appreciated that the chosen code length n is configurable and should take into consideration the tradeoff between a number of factors, such as database size, required accuracy, memory resources and required search time throughput and latency.
  • Reference is now made to FIG. 4C which details PDLF generator 43. Generator 43 comprises a cosine space probability distribution calculator 431, a pseudo-bipolar Hamming space probability distribution calculator 432 and a loss function calculator 433. Generator 43 may operate on all sample vectors qj in a batch. However, to simplify the equations below, the following text will describe the operation for a single sample vector q and its associated reference set N (unless noted otherwise).
  • Reference is also made to FIG. 4D, which details the vectors being operated on by PDLF generator 43. In its left column, FIG. 4D shows the operations on set N of associated reference vectors vi for sample vector q and in its right column, FIG. 4D shows the operations on the pseudo-bipolar encoded output set f(N) of associated reference vectors f(vi) for pseudo-bipolar encoded sample vector f(q).
  • As mentioned hereinabove, in the training scenario, cosine distribution calculator 431 may determine the probability distribution of the cosine similarities SCi between sample floating-point vector q and its k associated floating-point reference vectors vi in set N.
  • The cosine similarity SCi for one reference vector vi is defined as a function of the inner product between vector q and vector vi and their norms, as provided in equation 2.
  • S C i ( q · v i ) = q · v i q v i ( 2 )
  • For simplicity, the input vectors may be preprocessed to undergo an L2-normalization prior to entering the network and loss computation so that ∥q∥=1 and ∥vi∥=1.
  • Cosine distribution calculator 431 may then convert the cosine similarities SCi (shown as a vector of different width elements in the second row of FIG. 4D) to a similarity probability distribution PDC for sample q by first defining the probabilities PCi of q over its associated reference set N. As shown in the third row of FIG. 4D, the input probability distribution PDC is a vector of the k per-reference vector probabilities PCi, each determined in a manner similar to a “softmax” function, as shown in equation 3:
  • P C i = e S Ci ( q , ν i ) Σ m = 1 k e S Cm ( q , ν m ) ( 3 )
  • where the softmax function takes k real numbers and normalizes them into a probability distribution consisting of k probabilities proportional to the exponentials of the real numbers. As a result, the set of probabilities PCi sum to 1.
  • Applicant has realized that, in order to preserve relational similarities of the encoded vectors in a binary Hamming space, rather than in a cosine space, during training, normalized pseudo-Hamming similarities may be used to simulate a real Hamming similarity. Applicant has also realized that the pseudo-Hamming similarities may be calculated from the pseudo-bipolar vectors in a differential manner using the inner product, as follows:
  • The pseudo-Hamming similarity SH, for two encoded output vectors f(q) and f(vi) in the pseudo-bipolar {−1, +1} space, may be defined by the inner product of the pseudo-bipolar vectors f(q) and f(vi), as provided in equation 4.
  • H i ( f ( q ) , f ( v i ) ) = # of identical bit positions = f ( q ) · f ( v i ) + n 2 ( 4 )
  • where n is the binary code length of the encoded vectors, which is the dimension of the output pseudo-bipolar vectors during training, as described hereinabove. The vector of pseudo-Hamming similarities Sit are shown on the right side of the second row of FIG. 4D.
  • Applicant has realized that, to ensure that the similarities of the original input space and the output space are within the same range of values, assuming pseudo-bipolar values, the output similarity scores may be normalized to the range of (−1, +1), using the binary code length n. Hence, the normalized pseudo-Hamming similarity SHi between a pseudo-bipolar encoded vector of sample q, f (q), and a pseudo-bipolar encoded vector of reference point vi, f (vi), is defined as in equation 5.
  • S H i ( f ( q ) , f ( v i ) ) = 2 * H ( f ( q ) , f ( v i ) ) α n - 1 = f ( q ) · f ( v i ) α n ( 5 )
  • Where a is a configurable correction factor that may be fine-tuned during the training for optimized accuracy and depends mainly on the range of values of the input feature vectors within the dataset and the β-scale in function 4231.
  • Hamming distribution calculator 432 may then utilize the pseudo-Hamming similarity SHi to construct a pseudo-Hamming similarity probability distribution for encoded pseudo-bipolar vector f (q) of sample q by first defining the probabilities PHi of q over its associated reference set N. The vector of probabilities PHi defines the pseudo-bipolar Hamming probability distribution PDH and is shown on the right side of the third row of FIG. 4D.
  • Each pseudo-bipolar Hamming probability is determined according to equation 6 as follows:
  • P H i = e S Hi ( f ( q ) , f ( ν i ) ) Σ m = 1 k e S Hm ( f ( q ) , f ( ν m ) ) ( 6 )
  • As for the cosine similarities, the probability for one encoded reference vector f(vi) is normalized into a probability. As a result, the set of probabilities PHi sum to 1.
  • The input cosine distribution PDC and the output pseudo-bipolar Hamming distribution PDH may be used by probability distribution loss calculator 433 to calculate the probability distribution loss function LKL, using the Kullback-Leibler divergence, D, also called the relative entropy. Kullback-Leibler divergence is a measure of how the output probability distribution is different from the input or “target” probability distribution. In equation 7, loss function LKL for sample q is defined using the Kullback-Leibler divergence, which, as shown in FIG. 4D, utilizes the input and output probability distributions PCi and PHi vis-á-vis query q.
  • L K L = Σ i = 1 k P C i log P Ci P Hi ( 7 )
  • This minimization objective for creating binary, locality preserving vectors may be described as an explicit, multi-wise normalized KL divergence loss, where “explicit” refers to using relations between distances rather than implicitly using some space partitioning method, such as labels; “multi-wise” refers to using multiple reference points which preserve relations among more than two or three items; and “normalized” refers to similarity-to-similarity divergence minimization.
  • It will be appreciated that the present invention may attempt to optimize the pseudo-Hamming similarity in the output space in a differential, rather than continuous, way and it does so by using an inner product between f(q) and f(vi).
  • As mentioned hereinabove, in the case where there are multiple samples qj per batch, loss function calculator 433 may calculate equation 7 for each sample qj and may average the multiple losses to form a total mean loss function. This is shown in FIG. 4D as a loop over the next sample q and its reference vectors vi with the final output as the average value of the LKLS. In an alternative, preferred embodiment, loss function calculator 433 may generate the loss functions for all samples qj in parallel, after which, loss function calculator 433 may determine their average value to form the total mean loss function.
  • Applicant has realized that, to generate a meaningful similarity distribution around any query q, using meaningful points around q, such as its k nearest neighbors, as reference points to q may be most useful. Using the k nearest neighbors to q to form a distribution may exploit relative information that may preserve ranking between data points during the training process, and may avoid the noise of irrelevant, far away points, such as might be present if all vectors are used as reference vectors to each sample vector q.
  • On the other hand, Applicant has realized that, using the entire training set as the basis to generate the k-NN reference sets may over-localize and overfit, which may, therefore, fail to reach the best generalized solution, besides being computationally almost infeasible when the training set is very large. Instead, a subset of ‘proxy’ points, enough to form a representation of the data distribution (e.g., around 10%), may be randomly sampled from the training set to create a ‘proxy’ vector set from which to extract nearest-neighbors as reference sets.
  • Applicant has realized that such proxy sets may be re-generated at each batch-iteration, to provide full augmentation and varieties of reference sets, for a more general solution. Moreover, ongoing subsampling of the training set may ensure a good representation of its distribution.
  • Reference is made to FIG. 4E, which details training vector generator 41 comprising a training data vector store 411, a proxy vector set generator 412, a sample vector selector 413 and a KNN vector set generator 414. Vector data store 411 may store a full training vector set T, containing t training vectors. Proxy vector set generator 412 may create a ‘proxy’ vector set R, for each training iteration, by taking a large random sample of r vectors from T, such that R is a compact ‘proxy’ representation of T. ‘r’ is a configurable hyperparameter. For example, r may vary from 1% to 10% of t.
  • For each iteration, sample vector selector 413 may select a batch size b of sample vectors qj from the training set T. KNN vector set generator 414 may determine the set N of k nearest neighbor vectors vi, to each vector q.
  • It should be noted that only one proxy vector set R may be generated during each iteration and it may be used to generate the reference vector sets N for the samples q of that iteration.
  • Reference is briefly made to FIG. 5A which illustrates an exemplary input distribution 50, and two exemplary output distributions 55 and 59, where output distribution 55 shows the results of one of the early iterations of neural network under training 42 and output distribution 59 shows the results of one of the later iterations of neural network under training 42. For clarity, only two-dimensional spaces are shown and, for visual simplicity, the similarities are shown as Euclidean distances. Input vector q and its output vectors f(q) are shown, for clarity, as small circles, and their 5 associated reference vectors, v1-v5 and f(v1)-f(v5), are shown as x's.
  • As can be seen, in input distribution 50, the 5 associated reference vectors, v1-v5 have different lengths, where reference vector v1 is shortest (i.e., closest to input vector q) and reference vector v5 is longest (i.e., further away from input vector q).
  • Early iterations generally produce poor results. As a result, the order of output reference vectors f(vi) of output distribution 55 is different than those of reference vectors vi of input distribution 50. Thus, output reference vector f(v3) is now the shortest though output reference vector f(v5) is still the longest. Moreover, while reference vectors vi of input distribution 50 may be the closest neighbors to input vector q, output reference vectors f(vi) of output distribution 55 are not the closest points. Unrelated points 56 of output distribution 55, marked with solid dots, are closer, indicating that neural network 42 is not yet fully trained.
  • However, training improves the results. Thus, in later output distribution 59, the order of output reference vectors f(v1)-f(v5) is similar to that of reference vectors v1 to v5 of input distribution 50. Moreover, the unrelated points, here labeled 58, are further away from output vector f(q) than output reference vectors f(v1)-f(v5).
  • Reference is briefly made to FIG. 5B which illustrates an exemplary distribution of the input vectors in 2-D in training vector set T, each vector illustrated as a dot 61, that may be stored in training data vector store 411. A subset of r vectors, of proxy set R, illustrated with additional x's 62, represent vectors selected by proxy vector set generator 412. Sample vectors q, as selected by sample vector selector 413, are illustrated as large white dots 65, and k nearest neighbor vectors, vi, as selected by vector set generator 414, are illustrated as x's 66 which are connected to the associated sample vector qj, within a bounding circle 68. Note that the bounding circles 68 cover only portions of training vector set T and that some of the bounding circles 68 may overlap.
  • As mentioned hereinabove, when the neural network has been trained (i.e., when its loss value has converged), the resultant trained neural network may be used in inference mode in production. However, as mentioned hereinabove, in inference mode, the neural network encodes input floating-point vectors v into true binary encoded output vectors f(v), as opposed to the pseudo-bipolar encoded output vectors, generated during training.
  • Reference is briefly made to FIG. 6 which illustrates a neural network 42′, similar to neural network 42 in FIG. 4B but reconfigured for inference mode. Accordingly, all layers that were present in neural network 42, in order to output a pseudo-bipolar encoded vector, have been removed. L2-normalizer 4223 has been removed from embedding layer 422′. Bipolar simulation layer 423, comprising β-scaled function 4231 and tan h function 4232 layers, has been replaced with a binary output layer 424 comprising a SGN layer 4241. The resulting binary codes may be packed into bit-representation vectors (i.e., bit per value) for memory print reduction and computation efficiency.
  • Reference is now made to FIGS. 7A and 7B which illustrate recall-codelength curves for benchmark dataset Sift1M. The encoder was trained on the predefined training set of 100k samples and evaluated on the predefined 10K sample set, searched against the 1M database set. Recall-K=100@L=1000 is an accuracy metric that can be used when evaluating similarity search. The ground-truth is the real 100-nearest neighbors in the original feature space using cosine similarity. Recall is the percentage from the k=100 real nearest-neighbors in the original space that reside among the L=1000 located samples with shortest Hamming distance in the binary space, averaged over all samples. Recall is calculated over several trained output code lengths. FIG. 7A illustrates recall-K=100@L=1000 over vector code lengths from 32 to 128 bits for Locality-Sensitive-Hashing (LSH) encoding and the Neural Proxy Hash (NPH) encoding of the present invention. FIG. 7B illustrates recall-K=100@L=1000 over vector code lengths from 256 to 1024 bits. Note that, in both figures, the NPH curve is significantly above the LSH curve, indicating a significant improvement over LSH.
  • It should be noted that, in testing with four publicly available standard ANN benchmark datasets (Sift1M, Gist, Deep1M, ANN1M), Applicant demonstrated improvement of between 7% to 17% over other binary hashing methods, in both low (64 bits) and high (1024 bits) code lengths. It should be noted that previous studies in this field report improvement in accuracy only over small code sizes (up to 128 bits). It will be appreciated that this offers increased accuracy when resources are available.
  • Unless specifically stated otherwise, as apparent from the preceding discussions, it is appreciated that, throughout the specification, discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” or the like, refer to the action and/or processes of a general purpose computer of any type, such as a client/server system, mobile computing devices, associative processing units, smart appliances, cloud computing units or similar electronic computing devices that manipulate and/or transform data within the computing system's registers and/or memories into other data within the computing system's memories, registers or other such information storage, transmission or display devices.
  • Embodiments of the present invention may include apparatus for performing the operations herein. This apparatus may be specially constructed for the desired purposes, or it may comprise a computing device or system typically having at least one processor and at least one memory, selectively activated or reconfigured by a computer program stored in the computer. The resultant apparatus when instructed by software may turn the general purpose computer into inventive elements as discussed herein. The instructions may define the inventive device in operation with the computer platform for which it is desired. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk, including optical disks, magnetic-optical disks, read-only memories (ROMs), volatile and non-volatile memories, random access memories (RAMs), electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, Flash memory, disk-on-key or any other type of media suitable for storing electronic instructions and capable of being coupled to a computer system bus. The computer readable storage medium may also be implemented in cloud storage.
  • Some general purpose computers may comprise at least one communication element to enable communication with a data network and/or a mobile communications network.
  • The processes and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the desired method. The desired structure for a variety of these systems will appear from the description below. In addition, embodiments of the present invention are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
  • While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

Claims (29)

1. A method for training a neural-network-based floating-point to binary feature vector encoder to preserve locality relationships between samples in an input space over to an output space, the method comprising:
having a neural network under training having floating-point inputs and floating-point pseudo-bipolar outputs; and
generating a loss function which compares an input probability distribution constructed from floating-point cosine similarities of said input space and an output probability distribution constructed from floating-point pseudo-bipolar pseudo-Hamming similarities of said output space.
2. The method of claim 1 wherein said generating comprises calculating said output probability distribution between a floating-point pseudo-bipolar encoded sample vector and a set of pseudo-bipolar encoded reference vectors.
3. The method of claim 2 and also comprising:
taking a random sampling of a plurality of vectors from a training vector set,
thereby to generate a representative proxy vector set;
selecting said sample vector from said training vector set; and
finding a set of k nearest neighbor vectors from said proxy vector set, closest to said sample vector, thereby to generate a reference vector set to be encoded by said encoder.
4. The method of claim 3 and also comprising repeating said taking for each training iteration.
5. The method of claim 3 and also comprising repeating said selecting multiple times per training iteration thereby generating a plurality of sample vectors for each training iteration.
6. The method of claim 5 and also comprising repeating said finding thereby generating a plurality of said reference vector sets, one per sample vector of said plurality of sample vectors, for each training iteration.
7. The method of claim 1, wherein said generating comprises calculating said loss function using a Kullback-Leibler divergence from said input probability distribution and said output probability distribution.
8. The method of claim 2, wherein for training, said neural network includes an output layer which generates said floating-point pseudo-bipolar encoded sample and reference vectors using a beta-scaled tan h layer.
9. The method of claim 1, wherein said generating comprises calculating said pseudo-Hamming similarities using an inner product in said output space.
10. The method of claim 1, wherein said generating comprises normalizing said cosine similarities and said pseudo-Hamming similarities to be within the same range of values.
11. The method of claim 10 and wherein said normalizing comprises normalizing said pseudo-Hamming similarities using a binary code length.
12. The method of claim 10 and wherein said normalizing comprises converting said cosine similarities and said pseudo-Hamming similarities to probabilities.
13. The method of claim 1, and also comprising, once said neural network is trained, producing an inference neural network from said trained neural network, said inference neural network to output true binary vectors.
14. The method of claim 13 wherein said producing comprises:
removing pseudo-bipolar output layers from said trained neural network; and
adding at least one binary output layer to said trained neural network, to generate said inference neural network.
15. The method of claim 13 wherein said true binary vectors to be used in approximate nearest neighbor searches.
16. A system for training a neural-network-based floating-point to binary feature vector encoder to preserve locality relationships between samples in an input space over to an output space, the system comprising:
a neural network under training having floating-point inputs and floating-point pseudo-bipolar outputs; and
a probability distribution loss function generator to generate a loss function which compares an input probability distribution constructed from floating-point cosine similarities of said input space and an output probability distribution constructed from floating-point pseudo-bipolar pseudo-Hamming similarities of said output space.
17. The system of claim 16 wherein said probability distribution loss function generator comprises a pseudo-bipolar Hamming distribution calculator to calculate said output probability distribution between a floating-point pseudo-bipolar encoded sample vector and a set of pseudo-bipolar encoded reference vectors.
18. The system of claim 16 further comprising:
a training data vector store to store a training vector set;
a proxy vector set generator to take at least a random sampling of a plurality of vectors from said training vector set, thereby to generate a representative proxy vector set;
a sample vector selector to select at least a sample vector from said training vector set; and
a KNN vector set generator to find at least a set of k nearest neighbor vectors from said proxy vector set, closest to said sample vector, thereby to generate at least a reference vector set to be encoded by said encoder.
19. The system of claim 18 wherein said proxy vector set generator to take a random sampling of a plurality of vectors from said training vector set, thereby to generate a representative proxy vector set for each training iteration.
20. The system of claim 18 wherein said sample vector selector to select a plurality of said sample vectors from said training vector set, for each said training iteration.
21. The system of claim 20 wherein said KNN vector set generator to find, per training iteration, a plurality of said set of k nearest neighbor vectors from said proxy vector set, one per sample vector of said plurality of sample vectors for each said training iteration.
22. The system of claim 16, wherein said loss function is a Kullback-Leibler divergence.
23. The system of claim 17, wherein for training, said neural network under training comprises a pseudo-bipolar output layer which generates said floating-point pseudo-bipolar encoded sample and reference vectors using a beta-scaled tan h layer.
24. The system of claim 16, wherein said probability distribution loss function generator to calculate said pseudo-Hamming similarities using an inner product in said output space.
25. The system of claim 23, wherein said neural-network-based floating-point to binary encoder is a modified version of said trained neural network under training with at least one binary output layer instead of pseudo-bipolar output layer.
26. The system of claim 25, wherein output of said at least one binary output layer to be used in approximate nearest neighbor searches.
27. The system of claim 16, wherein said probability distribution loss function generator to normalize said cosine similarities and said pseudo-Hamming similarities to be within the same range of values.
28. The system of claim 27 and wherein said probability distribution loss function generator to normalize said pseudo-Hamming similarities using a binary code length.
29. The system of claim 27 and wherein said probability distribution loss function generator to convert said cosine similarities and said pseudo-Hamming similarities to probabilities.
US17/795,233 2020-06-24 2021-06-24 Neural hashing for similarity search Active US11763136B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/795,233 US11763136B2 (en) 2020-06-24 2021-06-24 Neural hashing for similarity search

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202063043215P 2020-06-24 2020-06-24
PCT/IB2021/055598 WO2021260612A1 (en) 2020-06-24 2021-06-24 Neural hashing for similarity search
US17/795,233 US11763136B2 (en) 2020-06-24 2021-06-24 Neural hashing for similarity search

Publications (2)

Publication Number Publication Date
US20230090262A1 true US20230090262A1 (en) 2023-03-23
US11763136B2 US11763136B2 (en) 2023-09-19

Family

ID=79282655

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/795,233 Active US11763136B2 (en) 2020-06-24 2021-06-24 Neural hashing for similarity search

Country Status (4)

Country Link
US (1) US11763136B2 (en)
KR (1) KR102615073B1 (en)
CN (1) CN115151914A (en)
WO (1) WO2021260612A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210241025A1 (en) * 2020-10-28 2021-08-05 Beijing More Health Technology Group Co. Ltd. Object recognition method and apparatus, and storage medium
US20220383063A1 (en) * 2021-05-27 2022-12-01 International Business Machines Corporation Representation of an ordered group of symbols by hypervectors

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8977627B1 (en) * 2011-11-01 2015-03-10 Google Inc. Filter based object detection using hash functions

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101326914B1 (en) * 2005-11-15 2013-11-11 베르나데트 가너 Method for training neural networks
US8537919B2 (en) 2010-09-10 2013-09-17 Trellis Phase Communications, Lp Encoding and decoding using constrained interleaving
RU2641447C1 (en) * 2016-12-27 2018-01-17 Общество с ограниченной ответственностью "ВижнЛабс" Method of training deep neural networks based on distributions of pairwise similarity measures
US11151203B2 (en) 2017-02-28 2021-10-19 Apple Inc. Interest embedding vectors
US20220019441A1 (en) * 2020-07-14 2022-01-20 The Regents Of The University Of California Circuits, methods, and articles of manufacture for hyper-dimensional computing systems and related applications

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8977627B1 (en) * 2011-11-01 2015-03-10 Google Inc. Filter based object detection using hash functions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Imani, et al., "Efficient Neural Network Acceleration on GPGPU using Content Addressable Memory", Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 (Year: 2017) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210241025A1 (en) * 2020-10-28 2021-08-05 Beijing More Health Technology Group Co. Ltd. Object recognition method and apparatus, and storage medium
US20220383063A1 (en) * 2021-05-27 2022-12-01 International Business Machines Corporation Representation of an ordered group of symbols by hypervectors

Also Published As

Publication number Publication date
WO2021260612A1 (en) 2021-12-30
CN115151914A (en) 2022-10-04
KR20220110620A (en) 2022-08-08
US11763136B2 (en) 2023-09-19
KR102615073B1 (en) 2023-12-15

Similar Documents

Publication Publication Date Title
Guo et al. Accelerating large-scale inference with anisotropic vector quantization
Norouzi et al. Fast exact search in hamming space with multi-index hashing
JP7360497B2 (en) Cross-modal feature extraction method, extraction device, and program
Van Der Maaten Barnes-hut-sne
Liu et al. Large-scale unsupervised hashing with shared structure learning
US11763136B2 (en) Neural hashing for similarity search
CN106033426A (en) A latent semantic min-Hash-based image retrieval method
US10990626B2 (en) Data storage and retrieval system using online supervised hashing
CN112732864B (en) Document retrieval method based on dense pseudo query vector representation
US20210326756A1 (en) Methods of providing trained hyperdimensional machine learning models having classes with reduced elements and related computing systems
Sumbul et al. Deep learning for image search and retrieval in large remote sensing archives
CN112256727B (en) Database query processing and optimizing method based on artificial intelligence technology
Yu et al. Product quantization network for fast visual search
CN112488231A (en) Cosine measurement supervision deep hash algorithm with balanced similarity
CN115795065A (en) Multimedia data cross-modal retrieval method and system based on weighted hash code
Tuli et al. FlexiBERT: Are current transformer architectures too homogeneous and rigid?
Liu et al. L3e-hd: A framework enabling efficient ensemble in high-dimensional space for language tasks
Morris et al. Locality-based encoder and model quantization for efficient hyper-dimensional computing
Leng et al. Learning binary codes with bagging PCA
Magliani et al. Efficient nearest neighbors search for large-scale landmark recognition
Xu et al. DHA: Supervised deep learning to hash with an adaptive loss function
Silva et al. Large-scale distributed locality-sensitive hashing for general metric data
Mudiyanselage et al. Feature selection with graph mining technology
US11755671B2 (en) Projecting queries into a content item embedding space
Zhu et al. Laplace maximum margin Markov networks

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

AS Assignment

Owner name: GSI TECHNOLOGY INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:IDELSON, DAPHNA;REEL/FRAME:060649/0016

Effective date: 20220727

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE