CN110309139B - High-dimensional neighbor pair searching method and system - Google Patents

High-dimensional neighbor pair searching method and system Download PDF

Info

Publication number
CN110309139B
CN110309139B CN201810179962.6A CN201810179962A CN110309139B CN 110309139 B CN110309139 B CN 110309139B CN 201810179962 A CN201810179962 A CN 201810179962A CN 110309139 B CN110309139 B CN 110309139B
Authority
CN
China
Prior art keywords
sample
neighbor
signature
vector
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810179962.6A
Other languages
Chinese (zh)
Other versions
CN110309139A (en
Inventor
童毅轩
张佳师
姜珊珊
郑继川
董滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Software Research Center Beijing Co Ltd
Original Assignee
Ricoh Software Research Center Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Software Research Center Beijing Co Ltd filed Critical Ricoh Software Research Center Beijing Co Ltd
Priority to CN201810179962.6A priority Critical patent/CN110309139B/en
Publication of CN110309139A publication Critical patent/CN110309139A/en
Application granted granted Critical
Publication of CN110309139B publication Critical patent/CN110309139B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Abstract

The invention provides a high-dimensional neighbor pair searching method and a system, wherein the high-dimensional neighbor pair searching method comprises the following steps: generating a corresponding sample signature according to the numerical value of the sample vector; generating a neighbor candidate group according to the sample signature; and calculating the distance between any two samples in each neighbor candidate group, and taking a sample pair with the distance meeting the preset requirement as a neighbor search result. Therefore, the efficient search of the high-dimensional neighbor pairs is realized, the search requirement of the user is met, and the method is simple and easy to realize.

Description

High-dimensional neighbor pair searching method and system
Technical Field
The invention relates to the technical field of computers, in particular to a high-dimensional neighbor pair searching method and a high-dimensional neighbor pair searching system.
Background
With the development of scientific technology, large-scale search engines must have effective and rapid searching capability, and currently common searching methods include k-d trees, R-trees and the like. However, both of these data structures and their transformation structures are only suitable for searching for data of lower dimensionality. In order to increase the search accuracy, feature vectors used to characterize objects to be searched, such as images, often have high dimensional characteristics, which may be on the order of 105 dimensions. When the dimension of the data exceeds 100, even up to thousands of dimensions, the search capability of the above data structure will decline rapidly. Therefore, how to realize the efficient search of the high-dimensional neighbor pairs still has high research value.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems in the related art to some extent.
Therefore, a first object of the present invention is to provide a method for searching a high-dimensional neighbor pair, so as to realize efficient searching of the high-dimensional neighbor pair and meet the searching requirement of a user.
A second object of the invention is to propose a non-transitory computer readable storage medium.
A third object of the invention is to propose a computer programme product.
A fourth object of the present invention is to provide a high-dimensional neighbor pair search system.
To achieve the above objective, an embodiment of a first aspect of the present invention provides a method for searching a high-dimensional neighbor pair, including the following steps: generating a corresponding sample signature according to the numerical value of the sample vector; generating a neighbor candidate group according to the sample signature; and calculating the distance between any two samples in each neighbor candidate group, and taking a sample pair with the distance meeting the preset requirement as a neighbor search result.
According to the high-dimensional neighbor pair search system provided by the embodiment of the invention, the corresponding sample signature is firstly generated according to the numerical value of the sample vector, then the neighbor candidate group is generated according to the sample signature, the distance between any two samples in each neighbor candidate group is further calculated, and the sample pair with the distance meeting the preset requirement is used as a neighbor search result, so that the efficient search of the high-dimensional neighbor pair is realized, the search requirement of a user is met, and the method is simple and easy to realize.
In addition, the high-dimensional neighbor pair search method according to the above embodiment of the present invention may further have the following additional technical features:
according to one embodiment of the invention, the sample signature is a binary vector.
According to one embodiment of the invention, the generating a sample signature from the values of the sample vector includes: through projection matrix R (k×d) Mapping the sample vector from the original vector to the target vector, wherein d is the dimension of the original vector, k is the dimension of the target vector, and d is greater than k; if the value of the target vector is not smaller than zero, assigning 1 to the corresponding position of the sample signature; if the value of the target vector is less than zero, a 0 is assigned to the corresponding position of the sample signature.
According to one embodiment of the invention, the projection matrix R (k×d) From Gaussian distribution N (01/k) randomly generated.
According to one embodiment of the present invention, the generating a neighbor candidate set according to the sample signature includes: s21, constructing a binary tree with depth of N, and storing the sample signature in leaf nodes of the binary tree, wherein a path from a root node of the binary tree to the leaf nodes corresponds to a numerical value of N dimensions before the sample signature, samples with the same front N-bit signature are stored in the same leaf node, and N is smaller than the length M of the sample signature; s22, when the number of sample signatures in the leaf node is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different leaf nodes under the leaf node; s23, pruning the tree, and pruning leaf nodes of the N layer; s24, repeating the steps S22 and S23 until no leaf node to be segmented exists.
According to one embodiment of the present invention, the taking the sample pair with the distance meeting the preset requirement as the neighbor search result includes: respectively sequencing the calculated distances between sample pairs in the same neighbor candidate group, and acquiring first K sample pairs with smaller distances; and sequencing the K sample pairs in the acquired different neighbor candidate groups, and taking the first K sample pairs with smaller distances as the neighbor search results.
To achieve the above object, a second aspect of the present invention provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described high-dimensional neighbor pair search method.
The non-transitory computer readable storage medium of the embodiment of the invention can realize the effective search of the high-dimensional neighbor pairs by executing the program stored on the non-transitory computer readable storage medium and corresponding to the high-dimensional neighbor pair search method, thereby meeting the search requirement of users.
To achieve the above object, an embodiment of a third aspect of the present invention provides a computer program product, which when executed by a processor, performs the above-mentioned high-dimensional neighbor pair search method.
The computer program product of the embodiment of the invention can realize the effective search of the high-dimensional neighbor pairs and meet the search requirement of users by executing the program corresponding to the high-dimensional neighbor pair search method.
To achieve the above object, a fourth aspect of the present invention provides a high-dimensional neighbor pair search system, including: the first generation module is used for generating a corresponding sample signature according to the numerical value of the sample vector; the second generation module is used for generating a neighbor candidate group according to the sample signature; and the processing module is used for calculating the distance between any two samples in each neighbor candidate group and taking a sample pair with the distance meeting the preset requirement as a neighbor search result.
According to the high-dimensional neighbor pair search system provided by the embodiment of the invention, the corresponding sample signature is firstly generated according to the numerical value of the sample vector through the first generation module, then the neighbor candidate group is generated according to the sample signature through the second generation module, the distance between any two samples in each neighbor candidate group is calculated through the processing module, and the sample pair with the distance meeting the preset requirement is used as a neighbor search result, so that the high-dimensional neighbor pair is effectively searched, the search requirement of a user is met, and the system is simple and easy to realize.
According to one embodiment of the invention, the first generating module is configured to: through projection matrix R (k×d) Mapping the sample vector from the original vector to the target vector, and assigning 1 at the corresponding position of the sample signature when the value of the target vector is not less than zero, and assigning 0 at the corresponding position of the sample signature when the value of the target vector is less than zero, wherein d is the dimension of the original vector, k is the dimension of the target vector, and d > k.
According to one embodiment of the invention, the second generation module performs the steps of: s21, constructing a binary tree with depth of N, and storing the sample signature in leaf nodes of the binary tree, wherein a path from a root node of the binary tree to the leaf nodes corresponds to a numerical value of N dimensions before the sample signature, samples with the same front N-bit signature are stored in the same leaf node, and N is smaller than the length M of the sample signature; s22, when the number of the signatures of the leaf nodes is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different cotyledon nodes below the leaf node; s23, pruning the tree, and pruning leaf nodes of the N layer; s24, repeating the steps S22 and S23 until no leaf node to be segmented exists.
According to one embodiment of the present invention, when the processing module uses a sample pair whose distance meets a preset requirement as a neighbor search result, the processing module is specifically configured to: respectively sequencing the calculated distances between sample pairs in the same neighbor candidate group, and acquiring first K sample pairs with smaller distances; and sequencing the K sample pairs in the acquired different neighbor candidate groups, and taking the first K sample pairs with smaller distances as the neighbor search results.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of a high-dimensional neighbor pair search method according to an embodiment of the present invention;
fig. 2 is a flowchart of step S2 in a high-dimensional neighbor pair search method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a binary tree according to one example of the present invention;
fig. 4 is a flowchart of step S3 in a high-dimensional neighbor pair search method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a system for performing the high-dimensional neighbor pair search method of the present invention; and
fig. 6 is an illustration of a high-dimensional neighbor pair search system according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The high-dimensional neighbor pair search method and system of the embodiment of the invention are described below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a high-dimensional neighbor pair search method according to an embodiment of the present invention. As shown in fig. 1, the high-dimensional neighbor pair search method includes the following steps:
s1, generating a corresponding sample signature according to the numerical value of the sample vector.
In one embodiment of the invention, the sample signature is a binary vector.
Specifically, the projection matrix R can be passed first (k×d) Mapping the sample vector from the original vector to the target vector, and assigning 1 at the corresponding position of the sample signature when the value of the target vector is not less than zero, and assigning 0 at the corresponding position of the sample signature when the value of the target vector is less than zero, wherein d is the dimension of the original vector (high dimension), k is the dimension of the target vector (low dimension), and d > k.
Optionally, the projection matrix R (k×d) May be randomly generated from a gaussian distribution N (0, 1/k).
And S2, generating a neighbor candidate group according to the sample signature.
Specifically, a neighbor candidate group may be generated according to the values of the respective positions of the sample signatures, and the sample signatures with closer values may be divided into one immediately adjacent candidate group, for example, samples with sample signatures of 00001111, 00001010, 00001110, 00010010, 00011100, 00011101, respectively, may be divided into one neighbor candidate group and 00010010, 00011100, 00011101 into another neighbor candidate group.
And S3, calculating the distance between any two samples in each neighbor candidate group, and taking a sample pair with the distance meeting the preset requirement as a neighbor search result.
Specifically, if a neighbor candidate sets are generated, each neighbor candidate set includes b samples, the distance (e.g., euclidean distance) between any two samples in each neighbor candidate set can be calculated and sorted from small to large, and the sorting in the a set is selectedThe first K sample pairs are used to determine the number of samples,and then sorting from small to large, and selecting K sample pairs which are sorted at the moment and are in front of each other, so that neighbor search results are obtained.
In one embodiment of the present invention, as shown in fig. 2, the step S2 may further include the steps of:
s21, constructing a binary tree with depth of N, and storing sample signatures in leaf nodes of the binary tree, wherein paths from root nodes of the binary tree to the leaf nodes correspond to values of N dimensions before the sample signatures, samples with the same front N-bit signatures are stored in the same leaf node, and N is smaller than the length M of the sample signatures.
S22, when the number of sample signatures in the leaf node is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different cotyledon nodes under the leaf node.
For example, referring to fig. 3, if the leaf node C needs to be sliced, the current depth is 2, the signature 110 is split into the left node because the n+1=3 position is 0, the signature 111 is split into the right node because the 3 rd position is 1, and the tree depth is changed to 3, i.e., n+1.
S23, pruning the tree, and pruning leaf nodes of the Nth layer.
Specifically, during the slicing process of step S22, only a part of the leaf nodes may be sliced. In leaf nodes that are not segmented, the same prefix length of the signature is 1 less than for segmented nodes, i.e., the samples in these nodes are not as close to each other as the samples in the segmented leaf nodes. At this time, leaf nodes of the nth layer that are not segmented may be pruned, so that a more close sample set can be retained.
S24, repeating the steps S22 and S23 until no leaf node to be segmented exists.
In one embodiment of the present invention, as shown in fig. 4, the step S3 includes the following steps:
s31, sorting the calculated distances between the sample pairs in the same neighbor candidate group, and acquiring the first K sample pairs with smaller distances.
S32, sequencing the K sample pairs in the obtained different neighbor candidate groups, and taking the first K sample pairs with smaller distances as neighbor search results.
It should be noted that the length M of the sample signature needs to be longer than the depth N of the binary tree where the final splitting is completed. If the samples are too concentrated in one neighbor candidate set, this can lead to too much time for algorithm iteration, where the sample vector can be normalized before neighbor search.
T is a threshold for determining whether to segment the leaf node, if the T value is larger, the number of repetitions of steps S22 and S23 is smaller, and the generated neighbor candidate set is larger (i.e. the number of neighbor candidate sets is larger), and the larger neighbor candidate set makes the subsequent processing algorithm more difficult due to the limitation of the computing resource. If the value of T is smaller, the more the number of repetitions of S22 and S23, the more time will be consumed.
N is the starting point depth of the binary tree at the time of neighbor search. If N is too small, it may take multiple iterations to get a neighbor candidate set. If N is too large, the accuracy of the neighbor search may be compromised.
Therefore, the values of the parameters M, N and T can be adjusted according to different data distributions through experiments to ensure that the iteration number should be at a preset value, such as 9 times, 10 times, and 11 times.
In addition, it should be noted that, according to the Johnson-Linden Strauss theorem, the above random projection can project a high-dimensional vector into a low-dimensional vector, and can retain the position information of a sample, which is implemented based on the following assumption:
1) It is assumed that the sample signature may preserve the location information of the samples, thereby indirectly preserving the distance information between the samples, i.e. the distance between the samples is similar to the distance between the sample signatures, or for most samples. The generated sample signature should satisfy: if the distance between two samples is closer, then more of the position values in the two sample signatures are equal.
2) If there are more samples in a neighbor candidate set, the samples in this neighbor candidate set are more prone to be close by two.
For example, the high-dimensional neighbor pair search method of the above embodiment may be implemented by the system shown in fig. 5. As shown in fig. 5, the system includes: a network interface for connecting to the internet or other form of communication network to obtain sample vectors; an input device for collecting input signals of a user of the system, including parameters M, T, N, K, etc.; a hard disk for storing information in the form of user logs; the central processing unit is used for running a program, namely executing the program corresponding to the high-dimensional neighbor pair searching method; the storage unit is used for storing temporary variables such as iteration times when the program is executed; and a display for displaying relevant information, namely, neighbor search results, to a user of the system.
In summary, according to the high-dimensional neighbor pair searching method provided by the embodiment of the invention, corresponding sample signatures are generated according to the numerical values of the sample vectors, then neighbor candidate groups are generated according to the sample signatures, the distance between any two samples in each neighbor candidate group is calculated, and the sample pair with the distance meeting the preset requirement is taken as a neighbor searching result, so that the high-dimensional neighbor pair is effectively searched, the searching requirement of a user is met, and the method is simple and easy to implement.
Further, the present invention proposes a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described high-dimensional neighbor pair search method.
The non-transitory computer readable storage medium of the embodiment of the invention can realize the effective search of the high-dimensional neighbor pairs by executing the program stored on the non-transitory computer readable storage medium and corresponding to the high-dimensional neighbor pair search method, thereby meeting the search requirement of users.
Further, the present invention proposes a computer program product which, when executed by a processor, performs the above-described high-dimensional neighbor pair search method.
The computer program product of the embodiment of the invention can realize the effective search of the high-dimensional neighbor pairs and meet the search requirement of users by executing the program corresponding to the high-dimensional neighbor pair search method.
Fig. 6 is a schematic structural diagram of a high-dimensional neighbor pair search system according to an embodiment of the present invention. As shown in fig. 6, the high-dimensional neighbor pair search system 100 includes: a first generation module 110, a second generation module 120, and a processing module 130.
The first generation module 110 is configured to generate a corresponding sample signature according to the value of the sample vector. The second generation module 120 is configured to generate a neighbor candidate set according to the sample signature. The processing module 130 is configured to calculate a distance between any two samples in each neighbor candidate set, and use a pair of samples whose distance meets a preset requirement as a neighbor search result.
In one embodiment of the present invention, the first generation module 110 is configured to pass through the projection matrix R (k×d) Mapping the sample vector from the original vector to the target vector, and assigning 1 at the corresponding position of the sample signature when the value of the target vector is not less than zero, and assigning 0 at the corresponding position of the sample signature when the value of the target vector is less than zero, wherein d is the dimension of the original vector, k is the dimension of the target vector, and d > k.
In one embodiment of the present invention, the second generation module 120 performs the steps of:
s21, constructing a binary tree with depth of N, and storing sample signatures in leaf nodes of the binary tree, wherein a path from a root node of the binary tree to the leaf nodes corresponds to a numerical value of N dimensions before the sample signatures, samples with the same front N-bit signature are stored in the same leaf node, and N is smaller than the length M of the sample signatures;
s22, when the number of the signatures of the leaf nodes is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different cotyledon nodes below the leaf node;
s23, pruning the tree, and pruning leaf nodes of the N layer;
s24, repeating the steps S22 and S23 until no leaf node to be segmented exists.
In one embodiment of the present invention, when the processing module 130 uses the sample pairs whose distances meet the preset requirement as the neighbor search result, the processing module is specifically configured to sort the calculated distances between the sample pairs in the same neighbor candidate group, and obtain the first K sample pairs with smaller distances; and sequencing the K sample pairs in the acquired different neighbor candidate groups, and taking the first K sample pairs with smaller distances as neighbor search results.
It should be noted that the foregoing explanation of the embodiment of the method for searching a high-dimensional neighbor pair is also applicable to the high-dimensional neighbor pair search system of this embodiment, and will not be repeated herein.
According to the high-dimensional neighbor pair search system provided by the embodiment of the invention, the corresponding sample signature is firstly generated according to the numerical value of the sample vector through the first generation module, then the neighbor candidate group is generated according to the sample signature through the second generation module, the distance between any two samples in each neighbor candidate group is calculated through the processing module, and the sample pair with the distance meeting the preset requirement is used as a neighbor search result, so that the high-dimensional neighbor pair is effectively searched, the search requirement of a user is met, and the system is simple and easy to realize.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (4)

1. The high-dimensional neighbor pair searching method is characterized by comprising the following steps of:
generating a corresponding sample signature according to the value of the image sample vector;
generating a neighbor candidate group according to the sample signature;
calculating the distance between any two samples in each neighbor candidate group, and taking a sample pair with the distance meeting the preset requirement as a neighbor search result; if a neighbor candidate sets are generated, each neighbor candidate set contains b samples, the distance between any two samples in each neighbor candidate set can be calculated and sorted from small to large, the K sample pairs in the first sorted set a are respectively selected,sorting from small to large, and selecting K sample pairs which are sorted at the moment and are in front of each other, so that neighbor search results are obtained;
the sample signature is a binary vector; by projection matrixMapping the sample vector from the original vector to the target vector, wherein d is the dimension of the original vector, k is the dimension of the target vector, and d is greater than k; if the value of the target vector is not smaller than zero, assigning 1 to the corresponding position of the sample signature; if the value of the target vector is smaller than zero, 0 is given to the corresponding position of the sample signature; the projection matrix->Randomly generating by Gaussian distribution N (0, 1/k); wherein generating a neighbor candidate group from the sample signature comprises:
s21: constructing a binary tree with depth of N, and storing the sample signature in leaf nodes of the binary tree, wherein a path from a root node of the binary tree to the leaf nodes corresponds to a numerical value of N dimensions before the sample signature, samples with the same front N-bit signature are stored in the same leaf node, and N is smaller than the length M of the sample signature;
s22: when the number of the sample signatures in the leaf node is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different leaf nodes under the leaf node;
s23: pruning the tree, and pruning leaf nodes of the Nth layer;
s24: steps S22 and S23 are repeated until there are no leaf nodes to be sliced.
2. The method of searching for a high-dimensional neighbor pair according to claim 1, wherein the step of using a sample pair whose distance satisfies a preset requirement as a neighbor search result comprises:
respectively sequencing the calculated distances between sample pairs in the same neighbor candidate group, and acquiring first K sample pairs with smaller distances;
and sequencing the K sample pairs in the acquired different neighbor candidate groups, and taking the first K sample pairs with smaller distances as the neighbor search results.
3. A high-dimensional neighbor pair search system, comprising:
the first generation module is used for generating a corresponding sample signature according to the numerical value of the image sample vector;
the second generation module is used for generating a neighbor candidate group according to the sample signature;
the processing module calculates the distance between any two samples in each neighbor candidate group, and takes a sample pair with the distance meeting the preset requirement as a neighbor search result; if a neighbor candidate sets are generated, each neighbor candidate set contains b samples, the distance between any two samples in each neighbor candidate set can be calculated and sorted from small to large, the K sample pairs in the first sorted set a are respectively selected,sorting from small to large, and selecting K sample pairs which are sorted at the moment and are in front of each other, so that neighbor search results are obtained;
the sample signature is a binary vector; by projection matrixMapping the sample vector from the original vector to the target vector, wherein d is the dimension of the original vector, k is the dimension of the target vector, and d is greater than k; if the value of the target vector is not smaller than zero, assigning 1 to the corresponding position of the sample signature; if the value of the target vector is smaller than zero, 0 is given to the corresponding position of the sample signature; the projection matrix->Randomly generating by Gaussian distribution N (0, 1/k); generating a neighbor candidate group according to the sample signature, including:
s21: constructing a binary tree with depth of N, and storing the sample signature in leaf nodes of the binary tree, wherein a path from a root node of the binary tree to the leaf nodes corresponds to a numerical value of N dimensions before the sample signature, samples with the same front N-bit signature are stored in the same leaf node, and N is smaller than the length M of the sample signature;
s22: when the number of the sample signatures in the leaf node is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different leaf nodes under the leaf node;
s23: pruning the tree, and pruning leaf nodes of the Nth layer;
s24: steps S22 and S23 are repeated until there are no leaf nodes to be sliced.
4. The high-dimensional neighbor pair search system of claim 3, wherein the processing module is configured to, when the pair of samples having a distance satisfying the preset requirement is used as the neighbor search result:
respectively sequencing the calculated distances between sample pairs in the same neighbor candidate group, and acquiring first K sample pairs with smaller distances;
and sequencing the K sample pairs in the acquired different neighbor candidate groups, and taking the first K sample pairs with smaller distances as the neighbor search results.
CN201810179962.6A 2018-03-05 2018-03-05 High-dimensional neighbor pair searching method and system Active CN110309139B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810179962.6A CN110309139B (en) 2018-03-05 2018-03-05 High-dimensional neighbor pair searching method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810179962.6A CN110309139B (en) 2018-03-05 2018-03-05 High-dimensional neighbor pair searching method and system

Publications (2)

Publication Number Publication Date
CN110309139A CN110309139A (en) 2019-10-08
CN110309139B true CN110309139B (en) 2024-02-13

Family

ID=68073598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810179962.6A Active CN110309139B (en) 2018-03-05 2018-03-05 High-dimensional neighbor pair searching method and system

Country Status (1)

Country Link
CN (1) CN110309139B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112308122B (en) * 2020-10-20 2024-03-01 中国刑事警察学院 High-dimensional vector space sample rapid searching method and device based on double trees

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1477563A (en) * 2003-07-03 2004-02-25 复旦大学 High-dimensional vector data quick similar search method
CN101334786A (en) * 2008-08-01 2008-12-31 浙江大学 Formulae neighborhood based data dimensionality reduction method
CN101556601A (en) * 2009-03-12 2009-10-14 华为技术有限公司 Method and device for searching k neighbor
CN103377237A (en) * 2012-04-27 2013-10-30 常州市图佳网络科技有限公司 High dimensional data neighbor search method and fast approximate image search method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8260583B2 (en) * 2009-03-12 2012-09-04 Siemens Product Lifecycle Management Software Inc. System and method for identifying wall faces in an object model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1477563A (en) * 2003-07-03 2004-02-25 复旦大学 High-dimensional vector data quick similar search method
CN101334786A (en) * 2008-08-01 2008-12-31 浙江大学 Formulae neighborhood based data dimensionality reduction method
CN101556601A (en) * 2009-03-12 2009-10-14 华为技术有限公司 Method and device for searching k neighbor
CN103377237A (en) * 2012-04-27 2013-10-30 常州市图佳网络科技有限公司 High dimensional data neighbor search method and fast approximate image search method

Also Published As

Publication number Publication date
CN110309139A (en) 2019-10-08

Similar Documents

Publication Publication Date Title
Grygorash et al. Minimum spanning tree based clustering algorithms
Hwang et al. A fast nearest neighbor search algorithm by nonlinear embedding
US10776400B2 (en) Clustering using locality-sensitive hashing with improved cost model
CN108154198B (en) Knowledge base entity normalization method, system, terminal and computer readable storage medium
CN112487168B (en) Semantic question-answering method and device of knowledge graph, computer equipment and storage medium
CN105787126B (en) K-d tree generation method and k-d tree generation device
CN112163145B (en) Website retrieval method, device and equipment based on editing distance and cosine included angle
Boytsov et al. Learning to prune in metric and non-metric spaces
Makalic et al. Review of modern logistic regression methods with application to small and medium sample size problems
KR101116663B1 (en) Partitioning Method for High Dimensional Data
CN110309139B (en) High-dimensional neighbor pair searching method and system
CN113065036B (en) Method and device for measuring performance of space supporting point and related components
CN105701128A (en) Query statement optimization method and apparatus
Jeong et al. Task-adaptive neural network search with meta-contrastive learning
KR20190105147A (en) Data clustering method using firefly algorithm and the system thereof
Ha et al. Leveraging bayesian optimization to speed up automatic precision tuning
JP2014228975A (en) Retrieval device, retrieval method and retrieval program
CN104462503A (en) Method for determining similarity between data points
CN115168326A (en) Hadoop big data platform distributed energy data cleaning method and system
CN104820661A (en) Exploratory data analysis system based on business object
JP5555238B2 (en) Information processing apparatus and program for Bayesian network structure learning
Kumar et al. A new Initial Centroid finding Method based on Dissimilarity Tree for K-means Algorithm
van Blokland et al. Partial 3D object retrieval using local binary QUICCI descriptors and dissimilarity tree indexing
CN110826488A (en) Image identification method and device for electronic document and storage equipment
JP2016218847A (en) Sequential clustering device, method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant