CN110309139B - High-dimensional neighbor pair searching method and system - Google Patents
High-dimensional neighbor pair searching method and system Download PDFInfo
- Publication number
- CN110309139B CN110309139B CN201810179962.6A CN201810179962A CN110309139B CN 110309139 B CN110309139 B CN 110309139B CN 201810179962 A CN201810179962 A CN 201810179962A CN 110309139 B CN110309139 B CN 110309139B
- Authority
- CN
- China
- Prior art keywords
- sample
- neighbor
- signature
- vector
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 239000013598 vector Substances 0.000 claims abstract description 63
- 238000012545 processing Methods 0.000 claims description 14
- 238000013138 pruning Methods 0.000 claims description 12
- 238000012163 sequencing technique Methods 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 6
- 238000009826 distribution Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
Abstract
The invention provides a high-dimensional neighbor pair searching method and a system, wherein the high-dimensional neighbor pair searching method comprises the following steps: generating a corresponding sample signature according to the numerical value of the sample vector; generating a neighbor candidate group according to the sample signature; and calculating the distance between any two samples in each neighbor candidate group, and taking a sample pair with the distance meeting the preset requirement as a neighbor search result. Therefore, the efficient search of the high-dimensional neighbor pairs is realized, the search requirement of the user is met, and the method is simple and easy to realize.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a high-dimensional neighbor pair searching method and a high-dimensional neighbor pair searching system.
Background
With the development of scientific technology, large-scale search engines must have effective and rapid searching capability, and currently common searching methods include k-d trees, R-trees and the like. However, both of these data structures and their transformation structures are only suitable for searching for data of lower dimensionality. In order to increase the search accuracy, feature vectors used to characterize objects to be searched, such as images, often have high dimensional characteristics, which may be on the order of 105 dimensions. When the dimension of the data exceeds 100, even up to thousands of dimensions, the search capability of the above data structure will decline rapidly. Therefore, how to realize the efficient search of the high-dimensional neighbor pairs still has high research value.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems in the related art to some extent.
Therefore, a first object of the present invention is to provide a method for searching a high-dimensional neighbor pair, so as to realize efficient searching of the high-dimensional neighbor pair and meet the searching requirement of a user.
A second object of the invention is to propose a non-transitory computer readable storage medium.
A third object of the invention is to propose a computer programme product.
A fourth object of the present invention is to provide a high-dimensional neighbor pair search system.
To achieve the above objective, an embodiment of a first aspect of the present invention provides a method for searching a high-dimensional neighbor pair, including the following steps: generating a corresponding sample signature according to the numerical value of the sample vector; generating a neighbor candidate group according to the sample signature; and calculating the distance between any two samples in each neighbor candidate group, and taking a sample pair with the distance meeting the preset requirement as a neighbor search result.
According to the high-dimensional neighbor pair search system provided by the embodiment of the invention, the corresponding sample signature is firstly generated according to the numerical value of the sample vector, then the neighbor candidate group is generated according to the sample signature, the distance between any two samples in each neighbor candidate group is further calculated, and the sample pair with the distance meeting the preset requirement is used as a neighbor search result, so that the efficient search of the high-dimensional neighbor pair is realized, the search requirement of a user is met, and the method is simple and easy to realize.
In addition, the high-dimensional neighbor pair search method according to the above embodiment of the present invention may further have the following additional technical features:
according to one embodiment of the invention, the sample signature is a binary vector.
According to one embodiment of the invention, the generating a sample signature from the values of the sample vector includes: through projection matrix R (k×d) Mapping the sample vector from the original vector to the target vector, wherein d is the dimension of the original vector, k is the dimension of the target vector, and d is greater than k; if the value of the target vector is not smaller than zero, assigning 1 to the corresponding position of the sample signature; if the value of the target vector is less than zero, a 0 is assigned to the corresponding position of the sample signature.
According to one embodiment of the invention, the projection matrix R (k×d) From Gaussian distribution N (01/k) randomly generated.
According to one embodiment of the present invention, the generating a neighbor candidate set according to the sample signature includes: s21, constructing a binary tree with depth of N, and storing the sample signature in leaf nodes of the binary tree, wherein a path from a root node of the binary tree to the leaf nodes corresponds to a numerical value of N dimensions before the sample signature, samples with the same front N-bit signature are stored in the same leaf node, and N is smaller than the length M of the sample signature; s22, when the number of sample signatures in the leaf node is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different leaf nodes under the leaf node; s23, pruning the tree, and pruning leaf nodes of the N layer; s24, repeating the steps S22 and S23 until no leaf node to be segmented exists.
According to one embodiment of the present invention, the taking the sample pair with the distance meeting the preset requirement as the neighbor search result includes: respectively sequencing the calculated distances between sample pairs in the same neighbor candidate group, and acquiring first K sample pairs with smaller distances; and sequencing the K sample pairs in the acquired different neighbor candidate groups, and taking the first K sample pairs with smaller distances as the neighbor search results.
To achieve the above object, a second aspect of the present invention provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described high-dimensional neighbor pair search method.
The non-transitory computer readable storage medium of the embodiment of the invention can realize the effective search of the high-dimensional neighbor pairs by executing the program stored on the non-transitory computer readable storage medium and corresponding to the high-dimensional neighbor pair search method, thereby meeting the search requirement of users.
To achieve the above object, an embodiment of a third aspect of the present invention provides a computer program product, which when executed by a processor, performs the above-mentioned high-dimensional neighbor pair search method.
The computer program product of the embodiment of the invention can realize the effective search of the high-dimensional neighbor pairs and meet the search requirement of users by executing the program corresponding to the high-dimensional neighbor pair search method.
To achieve the above object, a fourth aspect of the present invention provides a high-dimensional neighbor pair search system, including: the first generation module is used for generating a corresponding sample signature according to the numerical value of the sample vector; the second generation module is used for generating a neighbor candidate group according to the sample signature; and the processing module is used for calculating the distance between any two samples in each neighbor candidate group and taking a sample pair with the distance meeting the preset requirement as a neighbor search result.
According to the high-dimensional neighbor pair search system provided by the embodiment of the invention, the corresponding sample signature is firstly generated according to the numerical value of the sample vector through the first generation module, then the neighbor candidate group is generated according to the sample signature through the second generation module, the distance between any two samples in each neighbor candidate group is calculated through the processing module, and the sample pair with the distance meeting the preset requirement is used as a neighbor search result, so that the high-dimensional neighbor pair is effectively searched, the search requirement of a user is met, and the system is simple and easy to realize.
According to one embodiment of the invention, the first generating module is configured to: through projection matrix R (k×d) Mapping the sample vector from the original vector to the target vector, and assigning 1 at the corresponding position of the sample signature when the value of the target vector is not less than zero, and assigning 0 at the corresponding position of the sample signature when the value of the target vector is less than zero, wherein d is the dimension of the original vector, k is the dimension of the target vector, and d > k.
According to one embodiment of the invention, the second generation module performs the steps of: s21, constructing a binary tree with depth of N, and storing the sample signature in leaf nodes of the binary tree, wherein a path from a root node of the binary tree to the leaf nodes corresponds to a numerical value of N dimensions before the sample signature, samples with the same front N-bit signature are stored in the same leaf node, and N is smaller than the length M of the sample signature; s22, when the number of the signatures of the leaf nodes is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different cotyledon nodes below the leaf node; s23, pruning the tree, and pruning leaf nodes of the N layer; s24, repeating the steps S22 and S23 until no leaf node to be segmented exists.
According to one embodiment of the present invention, when the processing module uses a sample pair whose distance meets a preset requirement as a neighbor search result, the processing module is specifically configured to: respectively sequencing the calculated distances between sample pairs in the same neighbor candidate group, and acquiring first K sample pairs with smaller distances; and sequencing the K sample pairs in the acquired different neighbor candidate groups, and taking the first K sample pairs with smaller distances as the neighbor search results.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of a high-dimensional neighbor pair search method according to an embodiment of the present invention;
fig. 2 is a flowchart of step S2 in a high-dimensional neighbor pair search method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a binary tree according to one example of the present invention;
fig. 4 is a flowchart of step S3 in a high-dimensional neighbor pair search method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a system for performing the high-dimensional neighbor pair search method of the present invention; and
fig. 6 is an illustration of a high-dimensional neighbor pair search system according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The high-dimensional neighbor pair search method and system of the embodiment of the invention are described below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a high-dimensional neighbor pair search method according to an embodiment of the present invention. As shown in fig. 1, the high-dimensional neighbor pair search method includes the following steps:
s1, generating a corresponding sample signature according to the numerical value of the sample vector.
In one embodiment of the invention, the sample signature is a binary vector.
Specifically, the projection matrix R can be passed first (k×d) Mapping the sample vector from the original vector to the target vector, and assigning 1 at the corresponding position of the sample signature when the value of the target vector is not less than zero, and assigning 0 at the corresponding position of the sample signature when the value of the target vector is less than zero, wherein d is the dimension of the original vector (high dimension), k is the dimension of the target vector (low dimension), and d > k.
Optionally, the projection matrix R (k×d) May be randomly generated from a gaussian distribution N (0, 1/k).
And S2, generating a neighbor candidate group according to the sample signature.
Specifically, a neighbor candidate group may be generated according to the values of the respective positions of the sample signatures, and the sample signatures with closer values may be divided into one immediately adjacent candidate group, for example, samples with sample signatures of 00001111, 00001010, 00001110, 00010010, 00011100, 00011101, respectively, may be divided into one neighbor candidate group and 00010010, 00011100, 00011101 into another neighbor candidate group.
And S3, calculating the distance between any two samples in each neighbor candidate group, and taking a sample pair with the distance meeting the preset requirement as a neighbor search result.
Specifically, if a neighbor candidate sets are generated, each neighbor candidate set includes b samples, the distance (e.g., euclidean distance) between any two samples in each neighbor candidate set can be calculated and sorted from small to large, and the sorting in the a set is selectedThe first K sample pairs are used to determine the number of samples,and then sorting from small to large, and selecting K sample pairs which are sorted at the moment and are in front of each other, so that neighbor search results are obtained.
In one embodiment of the present invention, as shown in fig. 2, the step S2 may further include the steps of:
s21, constructing a binary tree with depth of N, and storing sample signatures in leaf nodes of the binary tree, wherein paths from root nodes of the binary tree to the leaf nodes correspond to values of N dimensions before the sample signatures, samples with the same front N-bit signatures are stored in the same leaf node, and N is smaller than the length M of the sample signatures.
S22, when the number of sample signatures in the leaf node is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different cotyledon nodes under the leaf node.
For example, referring to fig. 3, if the leaf node C needs to be sliced, the current depth is 2, the signature 110 is split into the left node because the n+1=3 position is 0, the signature 111 is split into the right node because the 3 rd position is 1, and the tree depth is changed to 3, i.e., n+1.
S23, pruning the tree, and pruning leaf nodes of the Nth layer.
Specifically, during the slicing process of step S22, only a part of the leaf nodes may be sliced. In leaf nodes that are not segmented, the same prefix length of the signature is 1 less than for segmented nodes, i.e., the samples in these nodes are not as close to each other as the samples in the segmented leaf nodes. At this time, leaf nodes of the nth layer that are not segmented may be pruned, so that a more close sample set can be retained.
S24, repeating the steps S22 and S23 until no leaf node to be segmented exists.
In one embodiment of the present invention, as shown in fig. 4, the step S3 includes the following steps:
s31, sorting the calculated distances between the sample pairs in the same neighbor candidate group, and acquiring the first K sample pairs with smaller distances.
S32, sequencing the K sample pairs in the obtained different neighbor candidate groups, and taking the first K sample pairs with smaller distances as neighbor search results.
It should be noted that the length M of the sample signature needs to be longer than the depth N of the binary tree where the final splitting is completed. If the samples are too concentrated in one neighbor candidate set, this can lead to too much time for algorithm iteration, where the sample vector can be normalized before neighbor search.
T is a threshold for determining whether to segment the leaf node, if the T value is larger, the number of repetitions of steps S22 and S23 is smaller, and the generated neighbor candidate set is larger (i.e. the number of neighbor candidate sets is larger), and the larger neighbor candidate set makes the subsequent processing algorithm more difficult due to the limitation of the computing resource. If the value of T is smaller, the more the number of repetitions of S22 and S23, the more time will be consumed.
N is the starting point depth of the binary tree at the time of neighbor search. If N is too small, it may take multiple iterations to get a neighbor candidate set. If N is too large, the accuracy of the neighbor search may be compromised.
Therefore, the values of the parameters M, N and T can be adjusted according to different data distributions through experiments to ensure that the iteration number should be at a preset value, such as 9 times, 10 times, and 11 times.
In addition, it should be noted that, according to the Johnson-Linden Strauss theorem, the above random projection can project a high-dimensional vector into a low-dimensional vector, and can retain the position information of a sample, which is implemented based on the following assumption:
1) It is assumed that the sample signature may preserve the location information of the samples, thereby indirectly preserving the distance information between the samples, i.e. the distance between the samples is similar to the distance between the sample signatures, or for most samples. The generated sample signature should satisfy: if the distance between two samples is closer, then more of the position values in the two sample signatures are equal.
2) If there are more samples in a neighbor candidate set, the samples in this neighbor candidate set are more prone to be close by two.
For example, the high-dimensional neighbor pair search method of the above embodiment may be implemented by the system shown in fig. 5. As shown in fig. 5, the system includes: a network interface for connecting to the internet or other form of communication network to obtain sample vectors; an input device for collecting input signals of a user of the system, including parameters M, T, N, K, etc.; a hard disk for storing information in the form of user logs; the central processing unit is used for running a program, namely executing the program corresponding to the high-dimensional neighbor pair searching method; the storage unit is used for storing temporary variables such as iteration times when the program is executed; and a display for displaying relevant information, namely, neighbor search results, to a user of the system.
In summary, according to the high-dimensional neighbor pair searching method provided by the embodiment of the invention, corresponding sample signatures are generated according to the numerical values of the sample vectors, then neighbor candidate groups are generated according to the sample signatures, the distance between any two samples in each neighbor candidate group is calculated, and the sample pair with the distance meeting the preset requirement is taken as a neighbor searching result, so that the high-dimensional neighbor pair is effectively searched, the searching requirement of a user is met, and the method is simple and easy to implement.
Further, the present invention proposes a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described high-dimensional neighbor pair search method.
The non-transitory computer readable storage medium of the embodiment of the invention can realize the effective search of the high-dimensional neighbor pairs by executing the program stored on the non-transitory computer readable storage medium and corresponding to the high-dimensional neighbor pair search method, thereby meeting the search requirement of users.
Further, the present invention proposes a computer program product which, when executed by a processor, performs the above-described high-dimensional neighbor pair search method.
The computer program product of the embodiment of the invention can realize the effective search of the high-dimensional neighbor pairs and meet the search requirement of users by executing the program corresponding to the high-dimensional neighbor pair search method.
Fig. 6 is a schematic structural diagram of a high-dimensional neighbor pair search system according to an embodiment of the present invention. As shown in fig. 6, the high-dimensional neighbor pair search system 100 includes: a first generation module 110, a second generation module 120, and a processing module 130.
The first generation module 110 is configured to generate a corresponding sample signature according to the value of the sample vector. The second generation module 120 is configured to generate a neighbor candidate set according to the sample signature. The processing module 130 is configured to calculate a distance between any two samples in each neighbor candidate set, and use a pair of samples whose distance meets a preset requirement as a neighbor search result.
In one embodiment of the present invention, the first generation module 110 is configured to pass through the projection matrix R (k×d) Mapping the sample vector from the original vector to the target vector, and assigning 1 at the corresponding position of the sample signature when the value of the target vector is not less than zero, and assigning 0 at the corresponding position of the sample signature when the value of the target vector is less than zero, wherein d is the dimension of the original vector, k is the dimension of the target vector, and d > k.
In one embodiment of the present invention, the second generation module 120 performs the steps of:
s21, constructing a binary tree with depth of N, and storing sample signatures in leaf nodes of the binary tree, wherein a path from a root node of the binary tree to the leaf nodes corresponds to a numerical value of N dimensions before the sample signatures, samples with the same front N-bit signature are stored in the same leaf node, and N is smaller than the length M of the sample signatures;
s22, when the number of the signatures of the leaf nodes is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different cotyledon nodes below the leaf node;
s23, pruning the tree, and pruning leaf nodes of the N layer;
s24, repeating the steps S22 and S23 until no leaf node to be segmented exists.
In one embodiment of the present invention, when the processing module 130 uses the sample pairs whose distances meet the preset requirement as the neighbor search result, the processing module is specifically configured to sort the calculated distances between the sample pairs in the same neighbor candidate group, and obtain the first K sample pairs with smaller distances; and sequencing the K sample pairs in the acquired different neighbor candidate groups, and taking the first K sample pairs with smaller distances as neighbor search results.
It should be noted that the foregoing explanation of the embodiment of the method for searching a high-dimensional neighbor pair is also applicable to the high-dimensional neighbor pair search system of this embodiment, and will not be repeated herein.
According to the high-dimensional neighbor pair search system provided by the embodiment of the invention, the corresponding sample signature is firstly generated according to the numerical value of the sample vector through the first generation module, then the neighbor candidate group is generated according to the sample signature through the second generation module, the distance between any two samples in each neighbor candidate group is calculated through the processing module, and the sample pair with the distance meeting the preset requirement is used as a neighbor search result, so that the high-dimensional neighbor pair is effectively searched, the search requirement of a user is met, and the system is simple and easy to realize.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.
Claims (4)
1. The high-dimensional neighbor pair searching method is characterized by comprising the following steps of:
generating a corresponding sample signature according to the value of the image sample vector;
generating a neighbor candidate group according to the sample signature;
calculating the distance between any two samples in each neighbor candidate group, and taking a sample pair with the distance meeting the preset requirement as a neighbor search result; if a neighbor candidate sets are generated, each neighbor candidate set contains b samples, the distance between any two samples in each neighbor candidate set can be calculated and sorted from small to large, the K sample pairs in the first sorted set a are respectively selected,sorting from small to large, and selecting K sample pairs which are sorted at the moment and are in front of each other, so that neighbor search results are obtained;
the sample signature is a binary vector; by projection matrixMapping the sample vector from the original vector to the target vector, wherein d is the dimension of the original vector, k is the dimension of the target vector, and d is greater than k; if the value of the target vector is not smaller than zero, assigning 1 to the corresponding position of the sample signature; if the value of the target vector is smaller than zero, 0 is given to the corresponding position of the sample signature; the projection matrix->Randomly generating by Gaussian distribution N (0, 1/k); wherein generating a neighbor candidate group from the sample signature comprises:
s21: constructing a binary tree with depth of N, and storing the sample signature in leaf nodes of the binary tree, wherein a path from a root node of the binary tree to the leaf nodes corresponds to a numerical value of N dimensions before the sample signature, samples with the same front N-bit signature are stored in the same leaf node, and N is smaller than the length M of the sample signature;
s22: when the number of the sample signatures in the leaf node is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different leaf nodes under the leaf node;
s23: pruning the tree, and pruning leaf nodes of the Nth layer;
s24: steps S22 and S23 are repeated until there are no leaf nodes to be sliced.
2. The method of searching for a high-dimensional neighbor pair according to claim 1, wherein the step of using a sample pair whose distance satisfies a preset requirement as a neighbor search result comprises:
respectively sequencing the calculated distances between sample pairs in the same neighbor candidate group, and acquiring first K sample pairs with smaller distances;
and sequencing the K sample pairs in the acquired different neighbor candidate groups, and taking the first K sample pairs with smaller distances as the neighbor search results.
3. A high-dimensional neighbor pair search system, comprising:
the first generation module is used for generating a corresponding sample signature according to the numerical value of the image sample vector;
the second generation module is used for generating a neighbor candidate group according to the sample signature;
the processing module calculates the distance between any two samples in each neighbor candidate group, and takes a sample pair with the distance meeting the preset requirement as a neighbor search result; if a neighbor candidate sets are generated, each neighbor candidate set contains b samples, the distance between any two samples in each neighbor candidate set can be calculated and sorted from small to large, the K sample pairs in the first sorted set a are respectively selected,sorting from small to large, and selecting K sample pairs which are sorted at the moment and are in front of each other, so that neighbor search results are obtained;
the sample signature is a binary vector; by projection matrixMapping the sample vector from the original vector to the target vector, wherein d is the dimension of the original vector, k is the dimension of the target vector, and d is greater than k; if the value of the target vector is not smaller than zero, assigning 1 to the corresponding position of the sample signature; if the value of the target vector is smaller than zero, 0 is given to the corresponding position of the sample signature; the projection matrix->Randomly generating by Gaussian distribution N (0, 1/k); generating a neighbor candidate group according to the sample signature, including:
s21: constructing a binary tree with depth of N, and storing the sample signature in leaf nodes of the binary tree, wherein a path from a root node of the binary tree to the leaf nodes corresponds to a numerical value of N dimensions before the sample signature, samples with the same front N-bit signature are stored in the same leaf node, and N is smaller than the length M of the sample signature;
s22: when the number of the sample signatures in the leaf node is larger than a first preset value T, dividing the sample signatures with different N+1st bit values into two different leaf nodes under the leaf node;
s23: pruning the tree, and pruning leaf nodes of the Nth layer;
s24: steps S22 and S23 are repeated until there are no leaf nodes to be sliced.
4. The high-dimensional neighbor pair search system of claim 3, wherein the processing module is configured to, when the pair of samples having a distance satisfying the preset requirement is used as the neighbor search result:
respectively sequencing the calculated distances between sample pairs in the same neighbor candidate group, and acquiring first K sample pairs with smaller distances;
and sequencing the K sample pairs in the acquired different neighbor candidate groups, and taking the first K sample pairs with smaller distances as the neighbor search results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810179962.6A CN110309139B (en) | 2018-03-05 | 2018-03-05 | High-dimensional neighbor pair searching method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810179962.6A CN110309139B (en) | 2018-03-05 | 2018-03-05 | High-dimensional neighbor pair searching method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110309139A CN110309139A (en) | 2019-10-08 |
CN110309139B true CN110309139B (en) | 2024-02-13 |
Family
ID=68073598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810179962.6A Active CN110309139B (en) | 2018-03-05 | 2018-03-05 | High-dimensional neighbor pair searching method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110309139B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112308122B (en) * | 2020-10-20 | 2024-03-01 | 中国刑事警察学院 | High-dimensional vector space sample rapid searching method and device based on double trees |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1477563A (en) * | 2003-07-03 | 2004-02-25 | 复旦大学 | High-dimensional vector data quick similar search method |
CN101334786A (en) * | 2008-08-01 | 2008-12-31 | 浙江大学 | Formulae neighborhood based data dimensionality reduction method |
CN101556601A (en) * | 2009-03-12 | 2009-10-14 | 华为技术有限公司 | Method and device for searching k neighbor |
CN103377237A (en) * | 2012-04-27 | 2013-10-30 | 常州市图佳网络科技有限公司 | High dimensional data neighbor search method and fast approximate image search method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8260583B2 (en) * | 2009-03-12 | 2012-09-04 | Siemens Product Lifecycle Management Software Inc. | System and method for identifying wall faces in an object model |
-
2018
- 2018-03-05 CN CN201810179962.6A patent/CN110309139B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1477563A (en) * | 2003-07-03 | 2004-02-25 | 复旦大学 | High-dimensional vector data quick similar search method |
CN101334786A (en) * | 2008-08-01 | 2008-12-31 | 浙江大学 | Formulae neighborhood based data dimensionality reduction method |
CN101556601A (en) * | 2009-03-12 | 2009-10-14 | 华为技术有限公司 | Method and device for searching k neighbor |
CN103377237A (en) * | 2012-04-27 | 2013-10-30 | 常州市图佳网络科技有限公司 | High dimensional data neighbor search method and fast approximate image search method |
Also Published As
Publication number | Publication date |
---|---|
CN110309139A (en) | 2019-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Grygorash et al. | Minimum spanning tree based clustering algorithms | |
Hwang et al. | A fast nearest neighbor search algorithm by nonlinear embedding | |
US10776400B2 (en) | Clustering using locality-sensitive hashing with improved cost model | |
CN108154198B (en) | Knowledge base entity normalization method, system, terminal and computer readable storage medium | |
CN112487168B (en) | Semantic question-answering method and device of knowledge graph, computer equipment and storage medium | |
CN105787126B (en) | K-d tree generation method and k-d tree generation device | |
CN112163145B (en) | Website retrieval method, device and equipment based on editing distance and cosine included angle | |
Boytsov et al. | Learning to prune in metric and non-metric spaces | |
Makalic et al. | Review of modern logistic regression methods with application to small and medium sample size problems | |
KR101116663B1 (en) | Partitioning Method for High Dimensional Data | |
CN110309139B (en) | High-dimensional neighbor pair searching method and system | |
CN113065036B (en) | Method and device for measuring performance of space supporting point and related components | |
CN105701128A (en) | Query statement optimization method and apparatus | |
Jeong et al. | Task-adaptive neural network search with meta-contrastive learning | |
KR20190105147A (en) | Data clustering method using firefly algorithm and the system thereof | |
Ha et al. | Leveraging bayesian optimization to speed up automatic precision tuning | |
JP2014228975A (en) | Retrieval device, retrieval method and retrieval program | |
CN104462503A (en) | Method for determining similarity between data points | |
CN115168326A (en) | Hadoop big data platform distributed energy data cleaning method and system | |
CN104820661A (en) | Exploratory data analysis system based on business object | |
JP5555238B2 (en) | Information processing apparatus and program for Bayesian network structure learning | |
Kumar et al. | A new Initial Centroid finding Method based on Dissimilarity Tree for K-means Algorithm | |
van Blokland et al. | Partial 3D object retrieval using local binary QUICCI descriptors and dissimilarity tree indexing | |
CN110826488A (en) | Image identification method and device for electronic document and storage equipment | |
JP2016218847A (en) | Sequential clustering device, method and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |