CN114329094A

CN114329094A - Spark-based large-scale high-dimensional data approximate neighbor query system and method

Info

Publication number: CN114329094A
Application number: CN202111672312.3A
Authority: CN
Inventors: 徐姚亨; 姚斌; 张鹏程; 唐飞龙; 沈耀; 郑文立
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-12
Anticipated expiration: 2041-12-31

Abstract

The invention provides a Spark-based large-scale high-dimensional data approximate neighbor query system and method. Firstly, clustering partitions are carried out according to the similarity of the vectors, and each clustering partition corresponds to one partition of the Spark elastic distributed data set. The data for each partition is scaled and labeled. And establishing a global index on the main node by using the sampling data, and establishing a partition index on the corresponding partition. During query, a plurality of corresponding partitions needing to be queried are found through the global index, and the results of the partitions are summarized and sorted to obtain a final result. The technical scheme of the invention provides a highly extensible distributed approximate neighbor query scheme based on a Spark system, and simultaneously realizes the characteristics of low delay and high throughput.

Description

Spark-based large-scale high-dimensional data approximate neighbor query system and method

Technical Field

The invention relates to the technical field of computer data management, in particular to a large-scale high-dimensional data rapid retrieval method and a large-scale high-dimensional data rapid retrieval system.

Background

Neighbor searching is an important operation in many applications, such as image retrieval, recommendation systems, and data mining. With the rapid development of the related fields of artificial intelligence, the machine learning algorithm makes a major breakthrough in the application fields of computer vision, speech recognition, natural language processing and the like, and a large amount of unstructured data (pictures, speech and texts) can be converted into a vector, which is a more efficient data representation, resulting in the generation of massive vector data. The vector neighbor search algorithm needs to meet the requirements of expandability, high throughput, low delay and the like to efficiently process massive vector data. For example, in the Taobao recommendation, the number of pictures in the database exceeds the billion level. On one hand, in the presence of mass data, a single machine algorithm is not applicable, and a retrieval method under a distributed environment must be considered to meet the expandability of the system; on the other hand, the retrieval service needs to respond within a few milliseconds to meet the real-time requirements of the user, and the accurate neighbor search cannot meet the actual requirements on efficiency and cost, but the approximate neighbor search can improve throughput and meet the requirements under the condition of slightly relaxing the accuracy limit.

Apache Spark is a distributed open source processing system for big data processing based on the mapReduce paradigm. Compared with the Hadoop oriented to disk and batch processing, Spark can use distributed memory storage and calculation, and has the characteristics of low query delay and high throughput. The Spark architecture is shown in fig. 1, where the driver node is the master node and the worker node is the slave node.

Before introducing the related art, some necessary definitions are given:

x is a data set comprising n vectors, X ═ X₁,x₂,x₃,x₄,x₅,x₆.....x_nD, dimension of each vector is d, dist (a, b) | | a, b | | a₂Representing the euclidean distance of any two vectors a and b, given a query vector q.

Define 1KNN (k-nearest Neighbors): knn (q) is the query result, and contains a set of k vectors V belonging to X, i.e., knn (q) ═ V, where the distance from q to any other vector in X is greater than that in V.

Define 2AKNN (approximate k-nearest neighbor queries, impropriate k-nearest Neighbors):

the result returned by AKNN is an approximate result, denoted as AKNN (q), the accurate result is denoted as KNN (q), and the recall rate is used as the precision measure, namely recall ═ AKKN (q) n (q).

The GeoSpack is a memory cluster computing framework for processing large-scale spatial data and supports spatial objects such as different geometric points, polygons, rectangles and the like. It implements various query processing algorithms, including kNN. GeoPark uses a net result as a partitioned global index. But this does not apply to non-space vector data: all partitions are searched during kNN query, then heap sorting is performed, and calculation cost is huge.

The Location Spark uses grids and qd trees as global indexes, supports space query analysis such as range query and kNN, designs a memory management algorithm, and can dynamically store frequently accessed data into a memory and store infrequently accessed data into a disk. STARK Hagedorn uses an R-tree for spatial indexing, but provides different options for a user to choose whether to build an index. The query operation of kNN and clustering was also developed in STARK.

The above Spark-based work has achieved positive results in a certain field, but there are three disadvantages in the application of neighbor query:

1) the method is specially designed for the characteristics of space data vectors, does not consider the requirements of non-space high-dimensional vectors, cannot support the existing large amount of non-space high-dimensional vector data, needs to mine the relationship among the vectors, and designs a more effective query method.

2) In query, all the partitions are often traversed, the results of the partitions are collected, and the final result is generated after centralized processing, so that the workload is large, and the time consumption is long.

3) All the data are accurately queried, the accuracy is higher than the actual requirement, but the performances such as throughput, delay and the like cannot meet the actual requirement.

Disclosure of Invention

The technical problems solved by the invention are as follows:

1) the method has the advantages that the relation among the vectors is fully mined, a proper index structure is provided for the characteristics of the non-space high-dimensional vectors, the support for a large amount of non-space high-dimensional vector data is realized, and effective partition query is carried out.

2) Before query, screening out relevant partitions by acquiring information of each partition; when in query, only the related partitions are queried, thereby effectively improving the query throughput of the system.

3) The AKNN approximate neighbor query is realized in the Spark architecture, extra acceleration is obtained by sacrificing the accuracy of the acceptable application degree, and the query throughput is greatly improved.

In a first aspect, an embodiment of the present invention provides a large-scale high-dimensional data approximate neighbor query system based on Spark, where the system includes:

the device comprises a vector acquisition module, an index construction module and a query module.

The vector acquisition module is used for acquiring a to-be-processed vector to be processed by the system, namely a to-be-processed data set, which comprises the to-be-processed vector converted from the to-be-processed unstructured data; one of the vectors to be processed can be considered as a point in the system.

The index building module comprises:

the system comprises a clustering partitioning module, a global index building module and a partitioning index building module.

The clustering partitioning module is used for calculating m partition centroids of the data set, and dividing the data set into m different partitions, so that the vectors to be processed in each partition are isomorphic, namely the points in each partition are close to each other; m is more than or equal to 2, and k belongs to N.

The global index building module is used for building a global index on a main node of the system; the global index building module comprises: the data marking device comprises a data sampling unit, a data marking unit and an index establishing unit; the data sampling unit is configured to uniformly sample n to-be-processed vectors, that is, the point, from each partition according to the data set and the resource condition of the master node to obtain sampling data, and represent the distribution of the to-be-processed vectors of the partition by the sampling data; the data marking unit is used for marking the sampling data with the partition label of the partition; the index establishing unit is used for establishing a global index in the main node by using the sampling data and storing the global index in a memory; the index structure of the global index adopts HNSW (Hierarchical neighbor lookup map index).

The partition index constructing module is used for creating a partition index for each partition and storing the partition index in the memory; the index structure of the partition index employs the HNSW.

The query module comprises:

the system comprises a query initiating module, a global query module, a partition query module and a sequencing module.

The query initiating module is used for initiating a query Q; the query Q specifies a query vector, the number s of partitions to be queried, and the number k of result vectors; the query vector is set by a user; s is more than or equal to 1 and less than or equal to m, and s belongs to N; k is more than or equal to 1, and k belongs to N.

The global index query module is used for searching p sampling data closest to the point represented by the query vector in the global index to obtain a preliminary result vector, counting the number of the preliminary result vectors contained in each partition according to the partition label, sequentially sorting from a plurality of partitions to a plurality of partitions, and selecting the s partitions in the front row containing nonzero number of the preliminary result vectors to become the partitions to be queried; p is not less than s, and p belongs to N.

When the number of the partitions selected by the global index query module is less than s, assigning values to the s according to the actual number of the partitions to be queried, and querying the partitions to be queried; the number of the preliminary result vectors may also be increased, i.e., the value of p is increased, until the s partitions to be queried are selected.

The partition querying module is configured to query the points in the s partitions to be queried to obtain the k points closest to the query vector in each partition to be queried, that is, the partition result vector.

The sorting module is configured to sort the obtained s × k partition result vectors after the partition querying module queries the s partitions to be queried, and select k points closest to the point represented by the query vector to obtain the result vector.

In some embodiments, the present invention provides a Spark-based large-scale high-dimensional data approximate neighbor query system: the number of the sample data, i.e., the value of n, may be set by the user.

In some embodiments, the present invention provides a Spark-based large-scale high-dimensional data approximate neighbor query system: the number of the preliminary result vectors, i.e. the value of p, can be set by the user; the magnitude of the increase in the number of preliminary result vectors may also be set by the user.

In some embodiments, the present invention provides a Spark-based large-scale high-dimensional data approximate neighbor query system: the number of partitions to be queried, i.e. the value of s, can be set by the user.

In some embodiments, the present invention provides a Spark-based large-scale high-dimensional data approximate neighbor query system: the number of partitions, i.e. the value of m, may be set by the user.

In a second aspect, an embodiment of the present invention provides a Spark-based large-scale high-dimensional data approximate neighbor query method, where the method includes:

vector acquisition, index construction and query;

the vector acquisition means acquiring a to-be-processed vector to be processed by the system, namely a to-be-processed data set, including the to-be-processed vector converted from the to-be-processed unstructured data; one of the vectors to be processed can be regarded as a point in the system;

the index construction comprises the following steps:

clustering partitions, constructing a global index and constructing a partition index;

the clustering partition means that m partition centroids of the data set are calculated, and one data set is divided into m different partitions, so that the vectors to be processed in each partition are isomorphic, that is, the points in each partition are close to each other; m is more than or equal to 2, and m belongs to N;

the global index construction means that a global index is constructed on a main node of the system; the global index construction comprises the following steps: sampling data, marking data and establishing an index; the data sampling is to uniformly sample n vectors to be processed from each partition, namely the point, into sampling data according to the data set and the resource condition of the master node, and the distribution of the vectors to be processed of the partition is represented by the sampling data; the data marking is to mark the sampling data with the partition label of the partition; the index establishment is to establish a global index in the main node by using the sampling data and store the global index in a memory; the index structure of the global index adopts HNSW;

the partition index construction means that a partition index is created for each partition and the partition index is stored in the memory; the index structure of the partition index adopts the HNSW;

the query, comprising:

query initiation, global query, partition query and sequencing;

the query initiation is to initiate a query Q; the query Q specifies a query vector, the number s of partitions to be queried, and the number k of result vectors; the query vector is set by a user; s is more than or equal to 1 and less than or equal to m, and s belongs to N; k is more than or equal to 1, and k belongs to N;

the global index query is to search p sampling data closest to the point represented by the query vector in the global index to obtain a preliminary result vector, count the number of the preliminary result vectors contained in each partition according to the partition label, sort the preliminary result vectors from the number of the preliminary result vectors to the number of the partitions, select the s partitions in the front row containing the number of the preliminary result vectors which is nonzero, and form the partitions to be queried; p is more than or equal to s, and belongs to N;

when the number of the partitions selected by the global index query is less than the s,

assigning values to the partitions according to the actual number of the partitions to be queried, and querying the partitions to be queried;

the number of the preliminary result vectors can also be increased, i.e. the value of p is increased, until the s partitions to be queried are selected;

the partition query is to query the points in the s partitions to be queried to obtain the k points closest to the query vector in each partition to be queried, namely the partition result vector;

the sorting is to sort the obtained s × k partition result vectors after the partition query module queries the s partitions to be queried, and select k points closest to the point represented by the query vector to obtain the result vector.

In some embodiments, the present invention provides a method for large-scale high-dimensional data approximate neighbor query based on Spark: the number of the sample data, i.e., the value of n, may be set by the user.

In some embodiments, the present invention provides a method for large-scale high-dimensional data approximate neighbor query based on Spark: the number of the preliminary result vectors, i.e. the value of p, can be set by the user; the magnitude of the increase in the number of preliminary result vectors may also be set by the user.

In some embodiments, the present invention provides a method for large-scale high-dimensional data approximate neighbor query based on Spark: the number of partitions to be queried, i.e. the value of s, can be set by the user.

In some embodiments, the present invention provides a method for large-scale high-dimensional data approximate neighbor query based on Spark: the number of partitions, i.e. the value of m, may be set by the user.

The invention has the beneficial effects that: aiming at the problems that only accurate neighbor query can be carried out and approximate neighbor query cannot be carried out in a Spark framework, the requirements of non-space high-dimensional vectors are not considered, and the existing large amount of non-space high-dimensional vector data cannot be supported, the relationship among vectors is fully mined, and aiming at the characteristics of the non-space high-dimensional vectors, a proper index structure based on Spark is provided, so that the support on the large amount of non-space high-dimensional vector data is realized, the method is suitable for approximate neighbor search, and a query mechanism corresponding to the method is provided; aiming at the problems of large workload and long time consumption caused by traversing all partitions during query, collecting results of all partitions, and generating final results after centralized processing, a double-layer index architecture of global index and partition index is provided, and related partitions are screened out by acquiring information of all partitions before query; when in query, only the related partitions are queried, thereby effectively improving the query throughput of the system. By using the technical scheme provided by the invention, high-throughput, low-delay and extensible approximate neighbor query can be realized, extra acceleration is obtained by sacrificing the accuracy of the acceptable degree of application, and the query throughput is greatly improved.

Spark is used as a mainstream distributed memory computing engine, has strong expandability and has the most active open source community and huge users. Since release, Spark has become a general purpose processing engine, playing an important role in machine learning, graphics processing, and stream data processing. Therefore, a large-scale high-dimensional data approximate neighbor query system and method based on Spark system is provided, and the system and method have wide applicability and great significance.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

Fig. 1 is a block diagram of the Spark system.

Fig. 2 is a schematic diagram of the main structure of the large-scale high-dimensional data approximate neighbor query system based on Spark stored in the memory according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of HNSW.

Fig. 4 is a structural diagram of an index building module of an embodiment of the Spark-based large-scale high-dimensional data approximate neighbor query system of the present invention.

FIG. 5 is a block diagram of a query module according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of the overall structure of an embodiment of the Spark-based large-scale high-dimensional data approximate neighbor query system of the present invention.

FIG. 7 is a comparison graph of query delays for two query methods with different data amounts.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It should be understood by those skilled in the art that the embodiments described are a part of the embodiments of the present invention, and not all embodiments. Based on the embodiments in this application, a person skilled in the art can make any suitable modification or variation to obtain all other embodiments.

The index building module comprises:

The clustering partitioning module is used for calculating m partition centroids of the data set, and dividing the data set into m different partitions, so that the vectors to be processed in each partition are isomorphic, namely the points in each partition are close to each other; m is more than or equal to 2, and m belongs to N.

The global index building module is used for building a global index on a main node of the system; the global index building module comprises: the data marking device comprises a data sampling unit, a data marking unit and an index establishing unit; the data sampling unit is configured to uniformly sample n to-be-processed vectors, that is, the point, from each partition according to the data set and the resource condition of the master node to obtain sampling data, and represent the distribution of the to-be-processed vectors of the partition by the sampling data; the data marking unit is used for marking the sampling data with the partition label of the partition; the index establishing unit is used for establishing a global index in the main node by using the sampling data and storing the global index in a memory; the index structure of the global index employs HNSW.

The query module comprises:

In this embodiment, the system includes a vector acquisition module, an index construction module, and a query module. Fig. 2 is a schematic diagram showing the main structure of the system stored in the memory. The vector acquisition module is used for acquiring a vector to be processed of the system, namely a data set to be processed. The data contained in the data set to be processed are vector data, including massive vector data converted from unstructured data such as pictures, voice, text and the like.

HNSW is a graph-based approximate neighbor search method. HNSW introduces a hierarchical structure on the basis of NSW, the idea is similar to a skip list structure under a one-dimensional space, edges are layered, and the average degree of the layer where each vertex is located is a constant. The specific structure of HNSW is shown in fig. 3. In HNSW, the bottom map contains all vector data, the vector data of the upper map decreases as the hierarchy increases, and the data of the upper map is contained in the bottom map. When in query, the nearest neighbor queried in the previous layer is used as the starting point of the next layer to start the query till the bottommost layer. In fig. 3, point a is a query vector, and point B is a second-layer result vector found at layer 2; continuously searching by taking the point B as the initial point of the layer 1 to find the point C as a layer 1 result vector; and taking the point C as the starting point of the layer 0, continuing to search, and finding the point D as a layer 0 result vector, namely the result vector in all vectors included by the index. And a partition index is constructed by adopting HNSW, so that the function of performing approximate neighbor query in a Spark system is realized.

The index construction is divided into three steps: clustering partition, partition index construction, and global index construction, as shown in fig. 4.

The clustering partition stage aims to partition a data set into the m different partitions so that the vectors to be processed in each of the partitions are isomorphic. E.g., the minimum euclidean distance, the points in each partition are close to each other. In the clustering and partitioning process, m partition centroids of the data set are calculated by using a k-means method, and one data set is divided into different partitions, so that the formed partitions are isomorphic. And the partition index constructing stage is used for creating a local partition index for each partition, storing the partition index in a memory, and adopting HNSW as an index structure of the partition index.

The global index construction, as shown in fig. 5, refers to constructing a global index at the master node of the Spark cluster. The global index construction comprises three steps: data sampling, data marking and index establishment. According to the data volume and the condition of the main node resource, sampling data in a certain proportion from different partitions, and then marking the sampled data with the label of the partition where the sampled data is located. For example, if the number of partitions is N, the tag numbers are 0 to (N-1). Because of random sampling, the data can largely represent the data distribution, data quantity and so on information of the partition. Next, a global index is further established based on the sample data.

The global index is established to effectively cut irrelevant partitions as much as possible and reasonably reduce the number of partitions to be queried, thereby saving more CPU resources, reducing the network cost and improving the query throughput. But on the other hand it results in additional space overhead and time overhead. In order to get a balance between the two, the global index should have the following characteristics: efficient space efficiency, fast retrieval and strong pruning capacity. Partition clipping is a great advantage of clustering partitions, but it is not entirely reasonable to select a partition to be accessed considering only the distance between the query vector and the cluster center. Although the vectors of each partition are similar after clustering, the center of the cluster cannot completely represent the distribution of the partition vector. It is possible that the center of the cluster is relatively far from the query vector, but the partition contains several results similar to the query vector, which may lead to a situation of partition clipping failure. When the global index is constructed, the data of each partition is partitioned by a clustering method, a small amount of data can be uniformly sampled from each partition, the data distribution of partition vectors can be effectively represented by the sampled data, the sampled vectors are marked with the partition labels, and then the HNSW global index is constructed on the main node. If the sampling data amount is too large, the global index is too much, and the efficiency of the global index is affected. However, if the sample data is too small, the sample data cannot effectively represent the data distribution of the partition vector, and the establishment of the global index is meaningless. Therefore, a balance must be struck between the two so that the sampled data is just large enough to effectively represent the data distribution of the partition vector.

Accordingly, the index building module includes a cluster partitioning module, a partitioned index building module, and a global index building module, as shown in fig. 4.

The query module includes the query initiation module, the global index query module, the partition query module and the sorting module, as shown in fig. 5. The query initiating module initiates a query Q after receiving a query vector set by a user. The query vector herein may refer to a vector converted from a text input by a user when querying an APP (application program) such as pan or may be a vector converted from a client preference inferred by a recommendation program of the APP such as pan according to collected data. The user here refers to a user of the system provided by the present invention, and may be an end user, or may refer to a program of a client, and in short, refers to a specified person, program, device, or the like, which the system receives a query vector. When inquiring, firstly, a plurality of vectors adjacent to the inquiring vector are searched in the global index. Because the global index is also queried according to the global index constructed by HNSW, the aforementioned method of HNSW is also used, that is: when inquiring, the nearest neighbor inquired by the previous layer is used as the starting point of the next layer to start inquiring till the bottommost layer, and the partition result vector is obtained. Since the similar vectors are already distributed in the same partition by clustering, usually the top similar vectors are in the same partition, so that the partition labels of the several partitions in which the nearest neighbor vectors are most likely to be located can be obtained. The number s of the partitions to be queried and the number k of the result vectors may be specified by a user, or may be set in a query initiating module, or may be a default value if set in the module, or may be a selected logic, and a value may be selected. When the query range is all partitions, the system takes the longest time, the result is the most accurate, and the throughput is reduced correspondingly. If the query range is specified according to a certain principle, the query efficiency is improved, and meanwhile, the precision of the query result depends on the principle. The partition query module queries according to a partition index constructed by adopting the HNSW, and uses the HNSW method, namely: when inquiring, the nearest neighbor inquired by the previous layer is used as the starting point of the next layer to start inquiring till the bottommost layer, and the partition result vector is obtained. The sorting module is configured to sort the obtained s × k partition result vectors after the partition querying module queries s partitions to be queried, and select k points closest to the point represented by the query vector to obtain the result vector. Fig. 6 is a schematic diagram of the overall structure of the system.

The technical solution of this embodiment mainly executes the approximate neighbor query in the memory. Firstly, clustering partitions are carried out according to the similarity of the vectors, and each clustering partition corresponds to one partition of the Spark elastic distributed data set. By sampling the data of each partition proportionally and labeling the partitions. The sampling data establishes the global index on the main node and establishes the partition index on the corresponding partition. And finding a plurality of corresponding partitions needing to be queried through the global index during query, and summarizing and sequencing the results of the partitions to obtain a final result. The technical scheme of the invention provides a highly extensible distributed approximate neighbor query scheme based on a Spark system, and simultaneously realizes the characteristics of low delay and high throughput.

In some embodiments, the present invention provides a Spark-based large-scale high-dimensional data approximate neighbor query system: the number of the sample data, i.e., the value of n, may be set by a user.

In this embodiment, the number of the sampling data, that is, the value of n, may be set by a user. Because the user can be clearer about the number of the user, the data distribution and the like, the permission is set for the user in an open mode, and the requirement that the user carries out user-defined query with larger permission can be met.

In this embodiment, the number of the preliminary result vectors, i.e., the value of p, may be set by a user; the magnitude of the increase in the number of preliminary result vectors may also be set by the user. Because the user can be clearer about the number of the user, the data distribution and the like, the permission is set for the user in an open mode, and the requirement that the user carries out user-defined query with larger permission can be met.

In this embodiment, the number of the partitions to be queried, that is, the value of s, may be set by a user. Because the user can be clearer about the number of the user, the data distribution and the like, the permission is set for the user in an open mode, and the requirement that the user carries out user-defined query with larger permission can be met.

In this embodiment, the number of partitions, i.e., the value of m, may be set by a user. Because the user can be clearer about the number of the user, the data distribution and the like, the permission is set for the user in an open mode, and the requirement that the user carries out user-defined query with larger permission can be met.

vector acquisition, index construction and query;

the index construction comprises the following steps:

the query, comprising:

query initiation, global query, partition query and sequencing;

the query initiation is to initiate a query Q; the query Q designates a query vector, the number s of partitions to be queried and the number k of result vectors, and the total number of partitions is m; the query vector is set by a user; s is more than or equal to 1 and less than or equal to m, and s belongs to N; k is more than or equal to 1, and k belongs to N;

The global index construction means that a global index is constructed in a master node of a Spark cluster. The global index construction comprises three steps: data sampling, data marking and index establishment. According to the data volume and the condition of the main node resource, sampling data in a certain proportion from different partitions, and then marking the sampled data with the label of the partition where the sampled data is located. For example, if the number of partitions is N, the tag numbers are 0 to (N-1). Because of random sampling, the data can largely represent the data distribution, data quantity and so on information of the partition. Next, a global index is further established based on the sample data.

The query module includes the query initiation module, the global index query module, the partition query module and the sorting module, as shown in fig. 5. The query initiating module initiates a query Q after receiving a query vector set by a user. The query vector herein may refer to a vector converted from a text input by a user when querying an APP (application program) such as pan or may be a vector converted from a client preference inferred by a recommendation program of the APP such as pan according to collected data. The user here refers to a user of the system provided by the present invention, and may be an end user, or may refer to a program of a client, and in short, refers to a specified person, program, device, or the like, which the system receives a query vector. When inquiring, firstly, a plurality of vectors adjacent to the inquiring vector are searched in the global index. Because the global index is also queried according to the global index constructed by HNSW, the aforementioned method of HNSW is also used, that is: when inquiring, the nearest neighbor inquired by the previous layer is used as the starting point of the next layer to start inquiring till the bottommost layer, and the partition result vector is obtained. Since the similar vectors are already distributed in the same partition by clustering, usually the top similar vectors are in the same partition, so that the partition labels of the several partitions in which the nearest neighbor vectors are most likely to be located can be obtained. The number s of the partitions to be queried and the number k of the result vectors may be specified by a user, or may be set in a query initiating module, or may be a default value if set in the module, or may be a selected logic, and a value may be selected. When the query range is all partitions, the system takes the longest time, the result is the most accurate, and the throughput is reduced correspondingly. If the query range is specified according to a certain principle, the query efficiency is improved, and meanwhile, the precision of the query result depends on the principle. The partition query module queries according to a partition index constructed by adopting the HNSW, and uses the HNSW method, namely: when inquiring, the nearest neighbor inquired by the previous layer is used as the starting point of the next layer to start inquiring till the bottommost layer, and the partition result vector is obtained. The sorting module is configured to sort the obtained s × k partition result vectors after the partition querying module queries s partitions to be queried, and select k points closest to the point represented by the query vector to obtain the result vector. FIG. 6 is a schematic diagram of the overall structure of the system

After the technical scheme provided by the invention is used, the experimental result fully shows that the technical scheme provided by the invention can simultaneously realize the characteristics of low delay and high throughput.

The software versions used in the experiment are Spark version 2.3.0, Scala version 2.11.8, OpenJDk 64-Bit Server Vk, Java 1.8.0_232, and Ubuntu 18.04.5 LTS.

The machine used for the experiment was configured: AMD Ryzen 75800H 8core, 16GB, Spark local mode.

The experimental data set was the SIFT data set. The SIFT data set contains a 10 hundred million vector set for evaluation of approximate neighbor search methods.

The method for comparing the effects respectively comprises the following steps: inquiring I, namely violence search inquiry, namely that data are random partitions, no index is established in the partitions, heap sorting is adopted for each partition during inquiry to obtain the most front similar vector, and results of each partition are collected and sorted to find out the final result; and querying II, namely clustering partitions, global indexes and partition index query, namely, distributing data to each partition after clustering partitions, establishing HNSW indexes of the partitions on the corresponding partitions, establishing the global indexes on the master node, finding a plurality of corresponding partitions needing to be queried through the global indexes during query, querying similar results by utilizing the HNSW indexes of each partition, and summarizing and sequencing the results of each partition to obtain a final result.

The parameters used for the experiment were: the number of the clustering partitions is 10, the number of the partitions to be inquired of the inquiry I is 10, and the number of the partitions to be inquired of the inquiry II is 3.

Fig. 7 shows how to finish querying 500 query vectors by two methods when the recall rate is kept at 90% when the query vectors are 500 and the base vectors are 10 ten thousand, 20 ten thousand and 40 ten thousand respectively. Figure 7 shows that query ii greatly improves the throughput of the system compared to query i. By using the technical scheme provided by the invention, high-throughput, low-delay and extensible approximate neighbor query can be realized, extra acceleration is obtained by sacrificing the accuracy of the acceptable degree of application, the query throughput is greatly improved, and the effectiveness of the technical scheme provided by the invention is fully displayed.

Claims

1. A Spark-based large-scale high-dimensional data approximate neighbor query system, comprising:

the system comprises a vector acquisition module, an index construction module and a query module;

the vector acquisition module is used for acquiring a to-be-processed vector to be processed by the system, namely a to-be-processed data set, which comprises the to-be-processed vector converted from the to-be-processed unstructured data; one of the vectors to be processed is visible as a point in the system;

the index building module comprises:

the system comprises a clustering partitioning module, a global index construction module and a partitioning index construction module;

the clustering partitioning module is used for calculating m partition centroids of the data sets and dividing one data set into m different partitions, so that the vectors to be processed in each partition are isomorphic, namely the points in each partition are close to each other; m is more than or equal to 2, and m belongs to N;

the global index building module is used for building a global index on a main node of the system; the global index building module comprises: the data marking device comprises a data sampling unit, a data marking unit and an index establishing unit; the data sampling unit is configured to uniformly sample n to-be-processed vectors, that is, the points, from each partition into sampling data according to the data set and the resource condition of the master node, and represent the distribution of the to-be-processed vectors of the partition by the sampling data; the data marking unit is used for marking the sampling data with the partition labels of the partitions; the index establishing unit is used for establishing a global index in the main node by using the sampling data and storing the global index in a memory; the index structure of the global index adopts HNSW;

the partition index constructing module is used for creating a partition index for each partition and storing the partition index in the memory; the index structure of the partition index adopts the HNSW;

the query module comprises:

the system comprises a query initiating module, a global query module, a partition query module and a sequencing module;

the query initiating module is used for initiating a query Q; the query Q specifies a query vector, the number s of partitions to be queried, and the number k of result vectors; the query vector is set by a user; s is more than or equal to 1 and less than or equal to m, and s belongs to N; k is more than or equal to 1, and k belongs to N;

the global index query module is configured to search p pieces of sampling data closest to the point represented by the query vector in the global index to obtain a preliminary result vector, count the number of the preliminary result vectors included in each partition according to the partition label, sort the preliminary result vectors from a plurality of preliminary result vectors to a plurality of preliminary result vectors, and select the s partitions in the front row, which include the number of the preliminary result vectors that is nonzero, to become the partitions to be queried; p is more than or equal to s, and belongs to N;

when the number of the partitions selected by the global index query module is less than the number s,

assigning values to the s according to the actual number of the partitions to be queried, and querying the partitions to be queried;

the partition query module is configured to query the points in the s partitions to be queried to obtain the k points closest to the query vector in each partition to be queried, that is, the partition result vector;

2. The system of claim 1, wherein the number of the sampled data, i.e., the value of n, is settable by the user.

3. The system according to claim 1, wherein the number of said preliminary result vectors, i.e. the value of said p, can be set by a user; the magnitude of the increase in the number of preliminary result vectors may also be set by the user.

4. A system according to any of claims 1-3, characterized in that the number of partitions to be queried, i.e. the value of s, can be set by the user.

5. A system according to any of claims 1-3, characterized in that the number of partitions, i.e. the value of k, can be set by the user.

6. A Spark-based large-scale high-dimensional data approximate neighbor query method is characterized by comprising the following steps:

vector acquisition, index construction and query;

the vector acquisition refers to acquiring a to-be-processed vector to be processed by the system, namely a to-be-processed data set, and the to-be-processed vector comprises the to-be-processed vector converted from the to-be-processed unstructured data; one of the vectors to be processed is visible as a point in the system;

the index construction comprises the following steps:

the clustering partition refers to calculating m partition centroids of the data sets, and dividing one data set into m different partitions, so that the vectors to be processed in each partition are isomorphic, that is, the points in each partition are close to each other; m is more than or equal to 2, and m belongs to N;

the global index construction means that a global index is constructed on a main node of the system; the global index construction comprises the following steps: sampling data, marking data and establishing an index; the data sampling is to uniformly sample n vectors to be processed, namely the points, from each partition into sampling data according to the data set and the resource condition of the master node, and the distribution of the vectors to be processed of the partitions is represented by the sampling data; the data marking is to mark the sampling data with the partition label of the partition; the index establishment is to establish a global index in the main node by using the sampling data and store the global index in a memory; the index structure of the global index adopts HNSW;

the partition index construction means that a partition index is created for each partition, and the partition index is stored in the memory; the index structure of the partition index adopts the HNSW;

the query includes:

query initiation, global query, partition query and sequencing;

the global index query is to search p sampling data closest to the point represented by the query vector in the global index to obtain a preliminary result vector, count the number of the preliminary result vectors contained in each partition according to the partition label, sort the preliminary result vectors from a plurality of preliminary result vectors to a plurality of preliminary result vectors, and select the s partitions in the front row containing nonzero number of the preliminary result vectors to become the partitions to be queried; p is more than or equal to s, and belongs to N;

and the sorting is to sort the obtained s × k partition result vectors after the partition query module queries the s partitions to be queried, and select k points closest to the point represented by the query vector to obtain the result vector.

7. The method of claim 6, wherein the number of the sampled data, i.e. the value of n, can be set by the user.

8. The method according to claim 6, wherein the number of preliminary result vectors, i.e. the value of p, can be set by a user; the magnitude of the increase in the number of preliminary result vectors may also be set by the user.

9. Method according to any of claims 6-8, wherein the number of partitions to be queried, i.e. the value of s, can be set by the user.

10. Method according to any of claims 6-8, wherein the number of partitions, i.e. the value of m, can be set by the user.