CN117828114A - K neighbor graph construction method, system, equipment and medium - Google Patents

K neighbor graph construction method, system, equipment and medium Download PDF

Info

Publication number
CN117828114A
CN117828114A CN202410013729.6A CN202410013729A CN117828114A CN 117828114 A CN117828114 A CN 117828114A CN 202410013729 A CN202410013729 A CN 202410013729A CN 117828114 A CN117828114 A CN 117828114A
Authority
CN
China
Prior art keywords
node
neighbor
rnn
new
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410013729.6A
Other languages
Chinese (zh)
Inventor
刘英帆
杨硕
彭延国
夏小芳
崔江涛
董龙翔
宋超伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202410013729.6A priority Critical patent/CN117828114A/en
Publication of CN117828114A publication Critical patent/CN117828114A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a method, a system, equipment and a medium for constructing a K neighbor graph, which reduce bit required by representing each data by quantizing original data; in the process of collecting the nodes of the reverse neighbor lists rnn _new and rnn _old, unnecessary node storage is avoided; new flag (new 'and old') is used for representing the node of nn_new and nn_old which is originally stored in the NN-Descent of the existing method, so that an additional data structure is not needed to store the node in the construction process of the K neighbor graph, and the consumption of memory is reduced; the cache miss times in the CPU are reduced and the data access efficiency is accelerated by carrying out block processing on the reverse neighbor lists rnn _new and rnn _old; the system, the device and the medium are used for realizing the method, and on the basis of ensuring the high recall rate of the K neighbor graph, the invention improves the composition efficiency of the K neighbor graph and simultaneously reduces the memory required by constructing the K neighbor graph.

Description

K neighbor graph construction method, system, equipment and medium
Technical Field
The present invention relates to the field of data mining and machine learning technologies, and in particular, to a method, a system, an apparatus, and a medium for constructing a K-nearest neighbor graph.
Background
In recent years, K-nearest neighbor graph construction is widely applied to the fields of data mining, machine learning and the like. The construction process of the K-nearest neighbor map can be summarized as follows: given a data set D, each node u in the data set finds the nearest K nodes in D\ { u }. The degree of similarity between the data needs to be measured by a metric, such as euclidean distance. K neighbor graphs are often used in K neighbor query, dimension reduction, clustering, outlier detection and other technologies. K neighbor query technology is widely applied to various information retrieval and recommendation systems, for example, the search of commodities in a map in a shopping platform is an application of the technology. In the current K neighbor query technology, DPG (Diversified Proximity Graph) and NSG (Navigating Spreading-out Graph) are used as the latest Graph index query technology, and the premise of constructing the K neighbor Graph is that the K neighbor Graph is constructed. With the development of society, the amount of data generated in production and life is rapidly increasing, and the data tends to be high in dimension, and in order to observe data more intuitively, it is generally necessary to map data in a high-dimensional space to a low-dimensional space so that the data can be visualized, which requires the use of dimension reduction. However, it is often not enough to observe the data, and we often need to find the information contained in the data. Clustering is an important technology for finding data rules and exploring internal relations of data. For example, merchants can formulate different marketing strategies for different groups of people by crowd dividing the purchasing habits of users. The K neighbor graph is also a precondition for some dimension reduction and clustering methods, which are the same as the two K neighbor query methods. In addition, the K-nearest neighbor graph is also applied to distance-based outlier detection. Therefore, the K neighbor graph has wide application in the fields of data mining and machine learning, and has important application value and significance for improving the performance of the K neighbor graph.
The existing method for constructing the approximate K neighbor graph is NN-Descent (Dong W, moses C, li K. Effect K-nearest neighbor graph construction for generic similarity measures [ C ]// Proceedings of the 20th international conference on World wide web.2011:577-586.), wherein the K neighbor graph which is generated at random initially is further updated through join and update operations, so that the K neighbor graph with high recall rate is obtained, but the NN-Descent method needs larger memory and has slower construction speed.
Disclosure of Invention
In order to overcome the disadvantages in the prior art, the invention aims to provide a method, a system, a device and a medium for constructing a K neighbor graph, which reduce bit required for representing each data by quantizing the original data; during the collection of nodes in the reverse neighbor lists rnn _new and rnn _old, unnecessary nodes are avoided being saved; compared with NN-device method in the prior art, the method does not use the neighbor list nn_new and nn_old structure any more, and uses new mark bits (new 'and old') to represent the nodes which are originally stored in the neighbor lists nn_new and nn_old, so that no extra data structure is needed to store the nodes in the K neighbor graph construction process, and the consumption of memory is reduced; by conducting blocking processing on the reverse neighbor lists rnn _new and rnn _old, the miss times of a cache in a CPU are reduced, the data access efficiency is accelerated, the composition efficiency of a K neighbor graph can be improved on the basis of guaranteeing the high recall rate of the K neighbor graph, and meanwhile, the memory required for constructing the K neighbor graph is reduced.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the construction method of the K neighbor graph comprises the following steps:
step 1: reading the data set D from the hard disk storage position appointed by the user to a memory, and quantifying the data set D;
step 2: initializing a neighbor candidate set pool and an inverse neighbor list rnn _new of each node u in the quantized data set D obtained in the step 1;
step 3, join operation: performing distance calculation on different nodes of the neighbor candidate set pool of each node u and the reverse neighbor lists rnn _new and rnn _old, and updating the neighbor candidate set pool of the corresponding nodes of the neighbor candidate set pool of the node u and the reverse neighbor lists rnn _new and rnn _old according to calculation results;
step 4: judging whether the recall rate set by the user or the number of times of executing join operation is met, if yes, ending iteration, and selecting the first K nodes u from neighbor candidate sets pool of each node u in the last iteration K Obtaining a K neighbor graph of the data set D; if not, entering the next step; the parameter K is used for controlling the degree of emergence of the K neighbor graph, namely the number of neighbors to be searched by each node;
step 5, update operation: updating the flag bit flag and the reverse neighbor lists rnn _new and rnn _old of each node in the neighbor candidate set pool of each node u, and returning to the step 3 after traversing each node u in the data set D.
The specific process of the step 1 is as follows:
reading a data set D from a hard disk storage position appointed by a user to a memory, wherein the data set D comprises n nodes u of D-dimensional space; in the dataset D, each node u is represented by a vector of dimension D; according to the number n of the nodes u, each vector in the data set D is read for n times, and the maximum value max and the minimum value min in n multiplied by D values of the whole data set D are recorded; each vector in the data set D is read for n times again, each numerical value is quantized to obtain a new numerical value, and the new numerical value forms the quantized data set D and is stored in a memory; the quantization formula is as follows:
wherein v represents an original value and nv represents a new value; max represents the maximum value of n x D values in the data set D and min represents the minimum value of n x D values in the data set D.
The specific process of the step 2 is as follows:
randomly selecting S nodes u from n nodes u of the quantized data set D s The selected node u s Different from the current node u i Respectively calculating current node u i And S nodes u s Distance between the current node u and the current node u according to the distance i Near-to-far order for S nodes u s Sequencing to obtain an initialized neighbor candidate set pool, wherein the state of a flag of each node u in the initialized neighbor candidate set pool is new;
randomly selecting 2S nodes u from n nodes u of the quantized data set D s Will 2S nodes u s Directly fill to current node u i Is a new of the reverse neighbor list rnn _new, an initialized reverse neighbor list rnn _new is obtained.
The specific process of the step 3 is as follows:
if a certain node u in the neighbor candidate set pool of the node u i The flag of the flag is new or old, and the current node u is skipped directly i
If a certain node u in the neighbor candidate set pool of the node u i The flag of the flag bit is new' and the node u is then set i And is arranged at the node u i The state of the latter flag is new 'or old', node u in the reverse neighbor lists rnn _new and rnn _old j Respectively calculating the distance; according to node u i And node u j Distance between, using node u i De-updating node u j Neighbor candidate set pool of (a), while using node u j De-updating node u i Neighbor candidate sets pool of (a); node u in the updated neighbor candidate set pool of node u i The flag state of the flag bit is reset to old;
if a certain node u in the neighbor candidate set pool of the node u i The flag of the flag bit is old' and the node u is then set i And is arranged at the node u i Node with status new' of flag bit flag at the back and node u in reverse neighbor list rnn _new j Respectively calculating the distance; according to node u i And node u j Distance between, using node u i De-updating node u j Neighbor candidate set pool of (a), while using node u j De-updating node u i Is a neighbor of (a)Candidate pool; node u in the updated neighbor candidate set pool of node u i The flag state of the flag bit is reset to old;
meanwhile, the distance calculation is respectively carried out on different nodes in the reverse neighbor list rnn _new and rnn _old of the node u, the block processing is carried out on the reverse neighbor list rnn _new and rnn _old of the node u before the calculation, so as to obtain the sub-blocks of the reverse neighbor lists rnn _new and rnn _old of the node u, wherein the number of the sub-blocks is respectivelyBlock and method for producing the sameA block, wherein num (rnn _new) and num (rnn _old) respectively represent the number of nodes existing in the reverse neighbor lists rnn _new and rnn _old of the node u, NIOB represents the number of nodes stored in each sub-block, and each sub-block of the reverse neighbor list rnn _new is denoted by N 1 ,N 2 ,...,N i Each sub-block of the reverse neighbor list rnn _old is labeled O 1 ,O 2 ,...,O j The method comprises the steps of carrying out a first treatment on the surface of the The calculation formula of the node number NIOB stored in each sub-block is as follows:
NIOB=L2CacheSize/2/(dim*2+sizeof(Neighbor))/2
wherein, L2CacheSize represents the size of L2cache in CPU, dim represents the dimension of node vector, sizeof (Neighbor) represents the size of space occupied by each Neighbor in Neighbor candidate set pool, the Neighbor contains id of current node, distance from current node u to node different from current node u, and flag of current node u;
traversing two of all sub-blocks of the reverse neighbor lists rnn _new and rnn _old for all nodes u of the two sub-blocks i And node u j Calculating the distance; according to node u i And node u j Distance between, using node u i De-updating node u j Neighbor candidate set pool of (a), while using node u j De-updating node u i Neighbor candidate sets pool of (a);
the two sub-blocks include the following:
(1) The same sub-block (N) of the reverse neighbor list rnn _new i And N i ) If the same sub-block (N) i And N i ) The number of nodes of (a) is m, the number of distance calculation isSecondary times;
(2) Different sub-blocks of the reverse neighbor list rnn _new (N i1 And N i2 ) If different sub-blocks (N i1 And N i2 ) The number of nodes of the network node is m and n respectively, and the number of distance calculation is m multiplied by n;
(3) Reverse neighbor list rnn _new sub-block (N i ) And rnn _old (O) j ) If subblock (N) i ) And sub-block (O) j ) The number of nodes of (a) is m and n, respectively, and the number of distance calculations is m×n.
According to the node u in the step 3 i And node u j Distance between, using node u j De-updating node u i Operation process of neighbor candidate set pool of (c) and using node u i De-updating node u j The operation process of the neighbor candidate set pool is the same; the using node u j De-updating node u i The specific procedure of the neighbor candidate set pool is as follows:
first, judging node u j Whether or not it is already at node u i If node u in neighbor candidate set pool j Has been at node u i In the neighbor candidate set pool of (a), then skip node u j The method comprises the steps of carrying out a first treatment on the surface of the If node u j Not at node u i Judging u in neighbor candidate set pool i Whether the number of nodes of the neighbor candidate set pool is smaller than L, wherein L is the maximum capacity of the neighbor candidate set pool;
if node u i If the number of nodes of the neighbor candidate set pool is less than L, then the node u is determined j Inserted into u i Is located in the neighbor candidate set pool such that node u i Nodes in neighbor candidate set pool according to distance node u i From near to far arrangement, node u j The flag state of the flag is new;
if node u i If the number of nodes of the neighbor candidate set pool is not less than L, comparing the node u i And node u j Distance between and node u i And node u i A relationship of distances between the last nodes in the neighbor candidate set pool;
if node u i And node u i The distance between the last nodes in the neighbor candidate set pool is greater than the node u i And node u j The distance between the two nodes is deleted i Is ranked in the last node in the neighbor candidate set pool, and then node u is ranked j Inserted into node u i Is located in the neighbor candidate set pool such that node u i Nodes in neighbor candidate set pool according to distance node u i From near to far arrangement, node u j The flag state of the flag is new;
if node u i And node u i The distance between the last nodes in the neighbor candidate set pool is not greater than node u i And node u j The distance between the two is directly skipped over the node u j
The specific process of the step 5 is as follows:
for each node u in the dataset D in step 2, clearing the reverse neighbor lists rnn _new and rnn _old of the node u, and updating the variable M according to the parameter S in step 2 and the flag state of each node u in the neighbor candidate set pool in step 3, so that there are at most S nodes marked as new in the first M nodes of the neighbor candidate set pool in step 3;
for the first M nodes u of the neighbor candidate set pool in step 3 M If the current node u M The status of the flag bit flag is new, the status of the flag bit flag is reset to new', and the node u in the data set D is stored into the corresponding node u M Is in the reverse neighbor list rnn _new; if the current node u M The state of the flag bit flag is old, the state of the flag bit flag is reset to old' and the node u in the data set D is stored in the corresponding node u M In the reverse neighbor list rnn _old; until the nodes are traversedu the first M nodes u in the neighbor candidate set pool of u M The method comprises the steps of carrying out a first treatment on the surface of the After traversing each node u in the dataset D, return to step 3.
The specific process of the storing in the step 5 is as follows:
judging the current node u M The relation between the number of nodes in the reverse neighbor list rnn _new or rnn _old and the parameter R, if the number of nodes is smaller than R, inserting the node u into the current node u directly M The last of the reverse neighbor list rnn _new or rnn _old; if the number of the nodes is not less than R, a random integer rand is obtained, and the random integer rand is 0 to the current node u is stored M An integer between the times of the reverse neighbor list rnn _new or rnn _old; if the random integer rand is smaller than R, the node u is used for replacing the current node u M The rand-th node of the reverse neighbor list rnn _new or rnn _old; the parameter R is used to represent the number of nodes that the reverse neighbor lists rnn _new and rnn _old can contain at most.
The invention also provides a system for constructing the K neighbor graph, which comprises the following steps:
and a quantization module: for quantifying a data set D read into memory from a user-specified hard disk storage location;
an initialization module: the method comprises the steps of initializing a neighbor candidate set pool and an inverse neighbor list rnn _new of each node u in a quantized data set D;
join module: the method comprises the steps of calculating the distance between different nodes of a neighbor candidate set pool of each node u and reverse neighbor lists rnn _new and rnn _old, and updating the neighbor candidate set pool of the corresponding nodes of the neighbor candidate sets pool of the nodes u and reverse neighbor lists rnn _new and rnn _old according to calculation results;
and a judging module: for judging whether the recall rate set by the user or the number of times of executing join operation is satisfied, if yes, ending the iteration, and selecting the first K nodes u from the neighbor candidate sets pool of each node u in the last iteration K Obtaining a K neighbor graph of the data set D; if not, entering an update module;
update module: the method comprises the steps of updating flag bits of each node in a neighbor candidate set pool of each node u and reverse neighbor lists rnn _new and rnn _old, traversing each node u in a data set D, and returning to a join module.
The invention also provides a construction device of the K neighbor graph, which comprises:
a memory: the computer program for storing the construction method of the K neighbor graph is equipment readable by a computer;
a processor: the method for constructing the K neighbor graph is used for realizing the method for constructing the K neighbor graph when the computer program is executed.
The invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program can realize the method for constructing the K neighbor graph when being executed by a processor.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the invention, by quantizing the original data, bit required by representing each data is reduced, so that the memory space required by storing the data set is reduced to half that of the NN-device method in the prior art, and more data points can be stored in the cache because each data needs fewer bit, therefore, when the CPU loads the data, the miss times of the cache in the CPU are reduced, and the construction efficiency of the K neighbor graph is improved;
2. in the invention, in the process of collecting the nodes in the reverse neighbor lists rnn _new and rnn _old, the number of the nodes in the reverse neighbor lists rnn _new and rnn _old is judged in real time, and when the number of the nodes is smaller than R (the parameter R represents the number of the nodes which can be contained in the reverse neighbor lists rnn _new and rnn _old at most), normal storage operation is carried out; when the number of the nodes is not less than R, whether to replace the new node with the node already in the reverse neighbor list rnn _new or rnn _old is judged according to the generated random integer, so that unnecessary nodes are avoided being saved greatly, and the required memory is reduced.
3. Compared with NN-device method in the prior art, the method does not use the neighbor list nn_new and nn_old structure, and uses new flag (new 'and old') to represent the node which is originally stored in the neighbor list nn_new and nn_old, so that no extra data structure is needed in the K neighbor graph construction process, and the memory consumption is reduced.
4. According to the invention, the reverse neighbor lists rnn _new and rnn _old are partitioned, and each sub-block is processed respectively, so that the cache miss times of the CPU are reduced, and the construction efficiency of the K neighbor graph is improved.
In summary, the invention avoids unnecessary node preservation, avoids unnecessary data structure use and reduces memory consumption by quantifying the data; the reverse neighbor lists rnn _new and rnn _old are segmented, the miss times of the cache in the CPU are reduced, the data access efficiency is accelerated, the composition efficiency of the K neighbor graph is improved on the basis of ensuring the high recall rate of the K neighbor graph, and meanwhile, the memory required for constructing the K neighbor graph is reduced.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a flow chart of Join operation for each node in the present invention.
FIG. 3 is a flowchart of the update operation of each node in the present invention.
FIG. 4 is the time required for the present invention to construct a K-nearest neighbor map on a Sift1M dataset.
Fig. 5 is the time required for the present invention to construct a K-nearest neighbor map on a Gist dataset.
FIG. 6 is the memory required for constructing a K neighbor map on a Sift1M dataset according to the present invention.
Fig. 7 shows the memory required for constructing K-nearest neighbor graphs on a Gist dataset according to the present invention.
Detailed Description
The technical scheme of the invention is further described in detail with reference to the accompanying drawings.
As shown in FIG. 1, a rapid K neighbor graph construction method comprises the following steps:
step 1: reading the data set D from the hard disk storage position appointed by the user to a memory, and quantifying the data set D;
reading a data set D from a hard disk storage position appointed by a user to a memory, wherein the data set D comprises n nodes u of D-dimensional space; in the dataset D, each node u is represented by a vector of dimension D; according to the number n of the nodes u, each vector in the data set D is read for n times, and the maximum value max and the minimum value min in n multiplied by D values of the whole data set D are recorded; each vector in the data set D is read for n times again, each numerical value is quantized to obtain a new numerical value, the new numerical value forms a quantized data set D, at the moment, the quantized data set D is stored in a memory, and the original data set D does not occupy memory space any more;
the quantization formula is as follows:
wherein v represents an original value and nv represents a new value; max represents the maximum value of n×d values in the data set D, and min represents the minimum value of n×d values in the data set D;
step 2: initializing a neighbor candidate set pool and an inverse neighbor list rnn _new of each node u in the quantized data set D obtained in the step 1;
randomly selecting S nodes u from n nodes u of the quantized data set D s The selected node u s Different from the current node u i Respectively calculating current node u i And S nodes u s Distance between the current node u and the current node u according to the distance i Near-to-far order for S nodes u s Sequencing to obtain an initialized neighbor candidate set pool, wherein the state of a flag of each node u in the initialized neighbor candidate set pool is new;
randomly selecting 2S nodes u from n nodes u of the quantized data set D s Will 2S nodes u s Directly fill to current node u i Obtaining an initialized reverse neighbor list rnn _new from the reverse neighbor list rnn _new; the parameter S is used for balancing the construction time of the K neighbor graph and the recall rate of the K neighbor graph;
step 3, join operation: the distance calculation is performed on different nodes of the neighbor candidate set pool of each node u and the reverse neighbor lists rnn _new and rnn _old, and the neighbor candidate set pool of the corresponding node of the neighbor candidate set pool of the node u and the reverse neighbor lists rnn _new and rnn _old is updated according to the calculation result, as shown in fig. 2, the specific process is as follows:
if a certain node u in the neighbor candidate set pool of the node u i The flag of the flag is new or old, and the current node u is skipped directly i
If a certain node u in the neighbor candidate set pool of the node u i The flag of the flag bit is new' and the node u is then set i And is arranged at the node u i The state of the latter flag is new 'or old', node u in the reverse neighbor lists rnn _new and rnn _old j Respectively calculating the distance; node u i And node u j After the distance calculation is completed, according to the node u i And node u j Distance between, using node u i De-updating node u j Neighbor candidate set pool of (a), while using node u j De-updating node u i Neighbor candidate sets pool of (a); node u in the updated neighbor candidate set pool of node u i The flag state of the flag bit is reset to old;
if a certain node u in the neighbor candidate set pool of the node u i The flag of the flag bit is old' and the node u is then set i And is arranged at the node u i Node with status new' of flag bit flag at the back and node u in reverse neighbor list rnn _new j Respectively calculating the distance; node u i And node u j After the distance calculation is completed, according to the node u i And node u j Distance between, using node u i De-updating node u j Neighbor candidate set pool of (a), while using node u j De-updating node u i Neighbor candidate sets pool of (a); node u in the updated neighbor candidate set pool of node u i The flag state of the flag bit is reset to old;
meanwhile, the distance calculation is respectively carried out on different nodes in the reverse neighbor list rnn _new and rnn _old of the node u, and the reverse direction of the node u is needed before the calculationThe neighbor lists rnn _new and rnn _old are subjected to block processing to obtain sub-blocks of the reverse neighbor lists rnn _new and rnn _old of the node u, wherein the number of the sub-blocks is respectivelyBlock and method for producing the sameA block, wherein num (rnn _new) and num (rnn _old) respectively represent the number of nodes existing in the reverse neighbor lists rnn _new and rnn _old of the node u, NIOB represents the number of nodes stored in each sub-block, and each sub-block of the reverse neighbor list rnn _new is denoted by N 1 ,N 2 ,...,N i Each sub-block of the reverse neighbor list rnn _old is labeled O 1 ,O 2 ,...,O j The method comprises the steps of carrying out a first treatment on the surface of the The calculation formula of the node number NIOB stored in each sub-block is as follows:
NIOB=L2CacheSize/2/(dim*2+sizeof(Neighbor))/2
wherein, L2CacheSize represents the size of L2cache in CPU, dim represents the dimension of node vector, sizeof (Neighbor) represents the size of space occupied by each Neighbor in Neighbor candidate set pool, the Neighbor contains id of current node, distance from current node u to node different from current node u, and flag of current node u;
traversing two of all sub-blocks of the reverse neighbor lists rnn _new and rnn _old for all nodes u of the two sub-blocks i And node u j Calculating the distance; node u i And node u j After the distance calculation is completed, according to the node u i And node u j Distance between, using node u i De-updating node u j Neighbor candidate set pool of (a), while using node u j De-updating node u i Neighbor candidate sets pool of (a);
the two sub-blocks include the following:
(1) The same sub-block (N) of the reverse neighbor list rnn _new i And N i ) If the same sub-block (N) i And N i ) The number of distance calculation times when the number of nodes is mIs thatSecondary times;
(2) Different sub-blocks of the reverse neighbor list rnn _new (N i1 And N i2 ) If different sub-blocks (N i1 And N i2 ) The number of nodes of the network node is m and n respectively, and the number of distance calculation is m multiplied by n;
(3) Reverse neighbor list rnn _new sub-block (N i ) And rnn _old (O) j ) If subblock (N) i ) And sub-block (O) j ) The number of nodes of the network node is m and n respectively, and the number of distance calculation is m multiplied by n;
said according to node u i And node u j Distance between, using node u j De-updating node u i Operation process of neighbor candidate set pool of (c) and using node u i De-updating node u j The operation process of the neighbor candidate set pool is the same; the using node u j De-updating node u i The specific procedure of the neighbor candidate set pool is as follows:
first, judging node u j Whether or not it is already at node u i If node u in neighbor candidate set pool j Has been at node u i In the neighbor candidate set pool of (a), then skip node u j The method comprises the steps of carrying out a first treatment on the surface of the If node u j Not at node u i Judging u in neighbor candidate set pool i Whether the number of nodes of the neighbor candidate set pool is smaller than L, wherein L is the maximum capacity of the neighbor candidate set pool;
if node u i If the number of nodes of the neighbor candidate set pool is less than L, then the node u is determined j Inserted into u i Is located in the neighbor candidate set pool such that node u i Nodes in neighbor candidate set pool according to distance node u i From near to far arrangement, node u j The flag state of the flag is new;
if node u i If the number of nodes of the neighbor candidate set pool is not less than L, comparing the node u i And node u j Distance between and node u i And node u i Neighbor candidate set pool middle rank of (2)A relationship of distances between the last nodes;
if node u i And node u i The distance between the last nodes in the neighbor candidate set pool is greater than the node u i And node u j The distance between the two nodes is deleted i Is ranked in the last node in the neighbor candidate set pool, and then node u is ranked j Inserted into node u i Is located in the neighbor candidate set pool such that node u i Nodes in neighbor candidate set pool according to distance node u i From near to far arrangement, node u j The flag state of the flag is new;
if node u i And node u i The distance between the last nodes in the neighbor candidate set pool is not greater than node u i And node u j The distance between the two is directly skipped over the node u j
Step 4: judging whether the recall rate set by the user or the number of times of executing join operation is met, if yes, ending iteration, and selecting the first K nodes u from neighbor candidate sets pool of each node u in the last iteration K Obtaining a K neighbor graph of the data set D; if not, entering the next step; the parameter K is used for controlling the degree of emergence of the K neighbor graph, namely the number of neighbors to be searched by each node;
step 5, update operation: updating the flag bit flag and the reverse neighbor lists rnn _new and rnn _old of each node in the neighbor candidate set pool of each node u in the data set D, traversing each node u in the data set D, and returning to the step 3, as shown in fig. 3;
for each node u in the data set D in step 2, clearing the reverse neighbor lists rnn _new and rnn _old of the node u, and updating the variable M according to the parameter S in step 2 and the flag state of each node u in the neighbor candidate set pool in step 3, so that at most S nodes marked as new in the first M nodes of the neighbor candidate set pool in step 3 can balance the efficiency and recall rate of constructing the K neighbor map;
for the first M nodes u of the neighbor candidate set pool in step 3 M If the current node u M Flag of flag (S)The state is new, the state of the flag bit is reset to new', and the node u of the data set D is stored in the corresponding node u M Is in the reverse neighbor list rnn _new; if the current node u M The state of the flag bit flag is old, the state of the flag bit flag is reset to old' and the node u of the data set D is stored in the corresponding node u M In the reverse neighbor list rnn _old; until traversing the first M nodes u in the neighbor candidate set pool of the node u M The method comprises the steps of carrying out a first treatment on the surface of the After traversing each node u in the dataset D, returning to the step 3;
the specific process of the storage is as follows:
judging the current node u M The relation between the number of nodes in the reverse neighbor list rnn _new or rnn _old and the parameter R, if the number of nodes is smaller than R, inserting the node u into the current node u directly M The last of the reverse neighbor list rnn _new or rnn _old; if the number of the nodes is not less than R, a random integer rand is obtained, and the random integer rand is 0 to the current node u is stored M An integer between the times of the reverse neighbor list rnn _new or rnn _old; if the random integer rand is smaller than R, the node u is used for replacing the current node u M The rand-th node of the reverse neighbor list rnn _new or rnn _old; the parameter R is used to represent the number of nodes that the reverse neighbor lists rnn _new and rnn _old can contain at most.
A system for constructing a K-nearest neighbor map, comprising:
and a quantization module: for quantifying a data set D read into memory from a user-specified hard disk storage location;
an initialization module: the method comprises the steps of initializing a neighbor candidate set pool and an inverse neighbor list rnn _new of each node u in a quantized data set D;
join module: the method comprises the steps of calculating the distance between different nodes of a neighbor candidate set pool of each node u and reverse neighbor lists rnn _new and rnn _old, and updating the neighbor candidate set pool of the corresponding nodes of the neighbor candidate sets pool of the nodes u and reverse neighbor lists rnn _new and rnn _old according to calculation results;
and a judging module: for determining whether a user-set recall is met or a join operation is performedAnd if yes, ending the iteration, and selecting the first K nodes u from the neighbor candidate sets pool of each node u of the last iteration K Obtaining a K neighbor graph of the data set D; if not, entering an update module;
update module: the method comprises the steps of updating flag bits of each node in a neighbor candidate set pool of each node u and reverse neighbor lists rnn _new and rnn _old, traversing each node u in a data set D, and returning to a join module.
A K-nearest neighbor graph construction device, comprising:
a memory: the computer program for storing the construction method of the K neighbor graph is equipment readable by a computer;
a processor: the method for constructing the K neighbor graph is used for realizing the method for constructing the K neighbor graph when the computer program is executed.
A computer readable storage medium storing a computer program which, when executed by a processor, enables a method of constructing a K-nearest neighbor map as described above.
The dataset D used in this embodiment is a Sift1M dataset and a Gist dataset, which are commonly used high-dimensional datasets from the real world, sift1M contains one million vectors of 128 dimensions, gist contains one million vectors of 960 dimensions, where the parameter K is 20; the beneficial effects of the construction method of the K neighbor map are further verified by using the K neighbor map constructed by the existing method NN-Descent as a control group; the horizontal axis of fig. 4 and 5 represents time, and the vertical axis represents recall (recovery) rate for constructing a K-nearest neighbor map. The horizontal axis of fig. 6 and 7 represents the number of iterations, and the vertical axis is the memory required for constructing the K-nearest neighbor graph.
FIG. 4 is the time required for constructing a K neighbor graph using the prior art method NN-Descent and the method of the present invention, respectively, on a Sift1M dataset. As can be seen from FIG. 4, compared with the existing NN-Descent method, the method of the invention is used for constructing the K-nearest neighbor map with the same recall rate, so that the time required for constructing the Sift1M data set is obviously reduced, namely, the time cost can be reduced by 18%, and the speed of the construction method is greatly improved.
FIG. 5 is the time required for constructing a K-nearest neighbor map using the prior art method NN-Descent and the present invention, respectively, on a Gist dataset. As can be seen from FIG. 5, compared with the existing NN-Descent method, the method of the invention is used for constructing the K-nearest neighbor map with the same recall rate, so that the time required for constructing the Gist data set is obviously reduced, namely, the time cost can be reduced by 31%, and the speed of the construction method is greatly improved.
FIG. 6 is the memory required for the method of constructing K neighbor graphs using the prior art method NN-Descent and the present invention, respectively, on the Sift1M dataset. As can be seen from FIG. 6, compared with the prior art NN-Descent, the method of the present invention is used to construct K-nearest neighbor graphs with the same recall rate, and the memory required for construction is significantly reduced on the Sift1M data set, i.e. the memory overhead can be reduced by 29%.
Fig. 7 shows the memory required for the method of constructing K-nearest neighbor graphs using the conventional method NN-Descent and the present invention, respectively, on a Gist dataset. As can be seen from FIG. 7, compared with the prior art NN-Descent, the method of the present invention is used to construct K-nearest neighbor graphs with the same recall rate, and the memory required for construction is significantly reduced on the Gist data set, i.e. the memory overhead can be reduced by 42%.
In summary, compared with the existing method NN-Descent, the method provided by the invention is used for constructing the K neighbor graph with the same recall rate, and the time cost is reduced by 18% and 31% on Sift1M and Gist data sets respectively; the memory overhead of 29% and 42% is reduced on the Sift1M and the Gist data sets respectively, the memory required in the construction process of the K neighbor graph is reduced, and the K neighbor graph is constructed at a higher speed.

Claims (10)

1. The construction method of the K neighbor graph is characterized by comprising the following steps:
step 1: reading the data set D from the hard disk storage position appointed by the user to a memory, and quantifying the data set D;
step 2: initializing a neighbor candidate set pool and an inverse neighbor list rnn _new of each node u in the quantized data set D obtained in the step 1;
step 3, join operation: performing distance calculation on different nodes of the neighbor candidate set pool of each node u and the reverse neighbor lists rnn _new and rnn _old, and updating the neighbor candidate set pool of the corresponding nodes of the neighbor candidate set pool of the node u and the reverse neighbor lists rnn _new and rnn _old according to calculation results;
step 4: judging whether the recall rate set by the user or the number of times of executing join operation is met, if yes, ending iteration, and selecting the first K nodes u from neighbor candidate sets pool of each node u in the last iteration K Obtaining a K neighbor graph of the data set D; if not, entering the next step; the parameter K is used for controlling the degree of emergence of the K neighbor graph, namely the number of neighbors to be searched by each node;
step 5, update operation: updating the flag bit flag and the reverse neighbor lists rnn _new and rnn _old of each node in the neighbor candidate set pool of each node u, and returning to the step 3 after traversing each node u in the data set D.
2. The method for constructing a K-nearest neighbor map according to claim 1, wherein the specific process of step 1 is as follows:
reading a data set D from a hard disk storage position appointed by a user to a memory, wherein the data set D comprises n nodes u of D-dimensional space; in the dataset D, each node u is represented by a vector of dimension D; according to the number n of the nodes u, each vector in the data set D is read for n times, and the maximum value max and the minimum value min in n multiplied by D values of the whole data set D are recorded; each vector in the data set D is read for n times again, each numerical value is quantized to obtain a new numerical value, and the new numerical value forms the quantized data set D and is stored in a memory; the quantization formula is as follows:
wherein v represents an original value and nv represents a new value; max represents the maximum value of n x D values in the data set D and min represents the minimum value of n x D values in the data set D.
3. The method for constructing a K-nearest neighbor map according to claim 1, wherein the specific process of step 2 is as follows:
randomly selecting S nodes u from n nodes u of the quantized data set D s The selected node u s Different from the current node u i Respectively calculating current node u i And S nodes u s The distance between the two nodes is calculated according to the sequence from near to far s Sequencing to obtain an initialized neighbor candidate set pool, wherein the state of a flag of each node u in the initialized neighbor candidate set pool is new;
randomly selecting 2S nodes u from n nodes u of the quantized data set D s Will 2S nodes u s Directly fill to current node u i Is a new of the reverse neighbor list rnn _new, an initialized reverse neighbor list rnn _new is obtained.
4. The method for constructing a K-nearest neighbor map according to claim 1, wherein the specific process of step 3 is as follows:
if a certain node u in the neighbor candidate set pool of the node u i The flag of the flag is new or old, and the current node u is skipped directly i
If a certain node u in the neighbor candidate set pool of the node u i The flag of the flag bit is new' and the node u is then set i And is arranged at the node u i The state of the latter flag is new 'or old', node u in the reverse neighbor lists rnn _new and rnn _old j Respectively calculating the distance; according to node u i And node u j Distance between, using node u i De-updating node u j Neighbor candidate set pool of (a), while using node u j De-updating node u i Neighbor candidate sets pool of (a); node u in the updated neighbor candidate set pool of node u i The flag state of the flag bit is reset to old;
if a certain node u in the neighbor candidate set pool of the node u i Flag bit f of (2)The state of lag is old' and then the node u is obtained i And is arranged at the node u i Node with status new' of flag bit flag at the back and node u in reverse neighbor list rnn _new j Respectively calculating the distance; according to node u i And node u j Distance between, using node u i De-updating node u j Neighbor candidate set pool of (a), while using node u j De-updating node u i Neighbor candidate sets pool of (a); node u in the updated neighbor candidate set pool of node u i The flag state of the flag bit is reset to old;
meanwhile, the distance calculation is respectively carried out on different nodes in the reverse neighbor list rnn _new and rnn _old of the node u, the block processing is carried out on the reverse neighbor list rnn _new and rnn _old of the node u before the calculation, so as to obtain the sub-blocks of the reverse neighbor lists rnn _new and rnn _old of the node u, wherein the number of the sub-blocks is respectivelyBlock and->A block, wherein num (rnn _new) and num (rnn _old) respectively represent the number of nodes existing in the reverse neighbor lists rnn _new and rnn _old of the node u, NIOB represents the number of nodes stored in each sub-block, and each sub-block of the reverse neighbor list rnn _new is denoted by N 1 ,N 2 ,...,N i Each sub-block of the reverse neighbor list rnn _old is labeled O 1 ,O 2 ,...,O j The method comprises the steps of carrying out a first treatment on the surface of the The calculation formula of the node number NIOB stored in each sub-block is as follows:
NIOB=L2CacheSize/2/(dim*2+sizeof(Neighbor))/2
wherein, L2CacheSize represents the size of L2cache in CPU, dim represents the dimension of node vector, sizeof (Neighbor) represents the size of space occupied by each Neighbor in Neighbor candidate set pool, the Neighbor contains id of current node, distance from current node u to node different from current node u, and flag of current node u;
traversing two of all sub-blocks of the reverse neighbor lists rnn _new and rnn _old for all nodes u of the two sub-blocks i And node u j Calculating the distance; according to node u i And node u j Distance between, using node u i De-updating node u j Neighbor candidate set pool of (a), while using node u j De-updating node u i Neighbor candidate sets pool of (a);
the two sub-blocks include the following:
(1) The same sub-block (N) of the reverse neighbor list rnn _new i And N i ) If the same sub-block (N) i And N i ) The number of nodes of (a) is m, the number of distance calculation isSecondary times;
(2) Different sub-blocks of the reverse neighbor list rnn _new (N i1 And N i2 ) If different sub-blocks (N i1 And N i2 ) The number of nodes of the network node is m and n respectively, and the number of distance calculation is m multiplied by n;
(3) Reverse neighbor list rnn _new sub-block (N i ) And rnn _old (O) j ) If subblock (N) i ) And sub-block (O) j ) The number of nodes of (a) is m and n, respectively, and the number of distance calculations is m×n.
5. The method for constructing a K-nearest neighbor graph according to claim 4, wherein in the step 3, the node u is i And node u j Distance between, using node u j De-updating node u i Operation process of neighbor candidate set pool of (c) and using node u i De-updating node u j The operation process of the neighbor candidate set pool is the same; the using node u j De-updating node u i The specific procedure of the neighbor candidate set pool is as follows:
first, judging node u j Whether or not it is already at node u i If node u in neighbor candidate set pool j Has been at node u i In the neighbor candidate set pool of (c),then node u is skipped j The method comprises the steps of carrying out a first treatment on the surface of the If node u j Not at node u i Judging node u in neighbor candidate set pool of (2) i Whether the number of nodes of the neighbor candidate set pool is smaller than L, wherein L is the maximum capacity of the neighbor candidate set pool;
if node u i If the number of nodes of the neighbor candidate set pool is less than L, then the node u is determined j Inserted into u i Is located in the neighbor candidate set pool such that node u i Nodes in neighbor candidate set pool according to distance node u i From near to far arrangement, node u j The flag state of the flag is new;
if node u i If the number of nodes of the neighbor candidate set pool is not less than L, comparing the node u i And node u j Distance between and node u i And node u i A relationship of distances between the last nodes in the neighbor candidate set pool;
if node u i And node u i The distance between the last nodes in the neighbor candidate set pool is greater than the node u i And node u j The distance between the two nodes is deleted i Is ranked in the last node in the neighbor candidate set pool, and then node u is ranked j Inserted into node u i Is located in the neighbor candidate set pool such that node u i Nodes in neighbor candidate set pool according to distance node u i From near to far arrangement, node u j The flag state of the flag is new;
if node u i And node u i The distance between the last nodes in the neighbor candidate set pool is not greater than node u i And node u j The distance between the two is directly skipped over the node u j
6. The method for constructing K-nearest neighbor map according to claim 1, wherein the specific process of step 5 is as follows:
for each node u in the dataset D in step 2, clearing the reverse neighbor lists rnn _new and rnn _old of the node u, and updating the variable M according to the parameter S in step 2 and the flag state of each node u in the neighbor candidate set pool in step 3, so that there are at most S nodes marked as new in the first M nodes of the neighbor candidate set pool in step 3;
for the first M nodes u of the neighbor candidate set pool in step 3 M If the current node u M The status of the flag bit flag is new, the status of the flag bit flag is reset to new', and the node u in the data set D is stored into the corresponding node u M Is in the reverse neighbor list rnn _new; if the current node u M The state of the flag bit flag is old, the state of the flag bit flag is reset to old' and the node u in the data set D is stored in the corresponding node u M In the reverse neighbor list rnn _old; until traversing the first M nodes u in the neighbor candidate set pool of the node u M The method comprises the steps of carrying out a first treatment on the surface of the After traversing each node u in the dataset D, return to step 3.
7. The method for constructing K-nearest neighbor map according to claim 6, wherein the specific procedure of storing in step 5 is as follows:
judging the current node u M The relation between the number of nodes in the reverse neighbor list rnn _new or rnn _old and the parameter R, if the number of nodes is smaller than R, inserting the node u into the current node u directly M The last of the reverse neighbor list rnn _new or rnn _old; if the number of the nodes is not less than R, a random integer rand is obtained, and the random integer rand is 0 to the current node u is stored M An integer between the times of the reverse neighbor list rnn _new or rnn _old; if the random integer rand is smaller than R, the node u is used for replacing the current node u M The rand-th node of the reverse neighbor list rnn _new or rnn _old; the parameter R is used to represent the number of nodes that the reverse neighbor lists rnn _new and rnn _old can contain at most.
8. A system for constructing a K-nearest neighbor map, comprising:
and a quantization module: for quantifying a data set D read into memory from a user-specified hard disk storage location;
an initialization module: the method comprises the steps of initializing a neighbor candidate set pool and an inverse neighbor list rnn _new of each node u in a quantized data set D;
join module: the method comprises the steps of calculating the distance between different nodes of a neighbor candidate set pool of each node u and reverse neighbor lists rnn _new and rnn _old, and updating the neighbor candidate set pool of the corresponding nodes of the neighbor candidate sets pool of the nodes u and reverse neighbor lists rnn _new and rnn _old according to calculation results;
and a judging module: for judging whether the recall rate set by the user or the number of times of executing join operation is satisfied, if yes, ending the iteration, and selecting the first K nodes u from the neighbor candidate sets pool of each node u in the last iteration K Obtaining a K neighbor graph of the data set D; if not, entering an update module;
update module: the method comprises the steps of updating flag bits of each node in a neighbor candidate set pool of each node u and reverse neighbor lists rnn _new and rnn _old, traversing each node u in a data set D, and returning to a join module.
9. A K-nearest neighbor map construction apparatus, comprising:
a memory: a computer program storing a method of constructing a K-nearest neighbor map according to any one of claims 1 to 7, being a computer readable device;
a processor: a method of constructing a K-nearest neighbor map according to any one of claims 1-7 when executed by said computer program.
10. A computer-readable storage medium, characterized by:
the computer readable storage medium stores a computer program which, when executed by a processor, enables a method of constructing a K-nearest neighbor map according to any one of claims 1-7.
CN202410013729.6A 2024-01-04 2024-01-04 K neighbor graph construction method, system, equipment and medium Pending CN117828114A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410013729.6A CN117828114A (en) 2024-01-04 2024-01-04 K neighbor graph construction method, system, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410013729.6A CN117828114A (en) 2024-01-04 2024-01-04 K neighbor graph construction method, system, equipment and medium

Publications (1)

Publication Number Publication Date
CN117828114A true CN117828114A (en) 2024-04-05

Family

ID=90511228

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410013729.6A Pending CN117828114A (en) 2024-01-04 2024-01-04 K neighbor graph construction method, system, equipment and medium

Country Status (1)

Country Link
CN (1) CN117828114A (en)

Similar Documents

Publication Publication Date Title
Achtert et al. Efficient reverse k-nearest neighbor search in arbitrary metric spaces
Lu et al. VHP: approximate nearest neighbor search via virtual hypersphere partitioning
CN112765477B (en) Information processing method and device, information recommendation method and device, electronic equipment and storage medium
JP3903610B2 (en) Search device, search method, and computer-readable recording medium storing search program
CN107145519B (en) Image retrieval and annotation method based on hypergraph
US6910030B2 (en) Adaptive search method in feature vector space
Weng et al. Online hashing with bit selection for image retrieval
CN110083732B (en) Picture retrieval method and device and computer storage medium
CN117828114A (en) K neighbor graph construction method, system, equipment and medium
Achtert et al. Approximate reverse k-nearest neighbor queries in general metric spaces
CN115757896A (en) Vector retrieval method, device, equipment and readable storage medium
JP2004046612A (en) Data matching method and device, data matching program, and computer readable recording medium
CN112199461B (en) Document retrieval method, device, medium and equipment based on block index structure
Shen et al. Speed up interactive image retrieval
Arevalillo-Herráez et al. Improving distance based image retrieval using non-dominated sorting genetic algorithm
Zhang et al. An evolutionary K-means algorithm for clustering time series data
Barthel et al. Combining Semantic and Visual Image Graphs for Efficient Search and Exploration of Large Dynamic Image Collections
Chen et al. Accelerating frank-wolfe via averaging step directions
Wang et al. Bit selection via walks on graph for hash-based nearest neighbor search
WO2019230465A1 (en) Similarity assessment device, method therefor, and program
Liu et al. Searching motion graphs for human motion synthesis
Kim et al. Probabilistic cost model for nearest neighbor search in image retrieval
Li et al. Scale balance for prototype-based binary quantization
CN116932802B (en) Image retrieval method
CN113312523B (en) Dictionary generation and search keyword recommendation method and device and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination