CN106055674A

CN106055674A - top-k arrangement query method based on metric space in distributed environment

Info

Publication number: CN106055674A
Application number: CN201610393610.1A
Authority: CN
Inventors: 何洁月; 罗浩
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-06-03
Filing date: 2016-06-03
Publication date: 2016-10-26
Anticipated expiration: 2036-06-03
Also published as: CN106055674B

Abstract

The invention discloses a top-k domination query method based on a metric space in a distributed environment, which includes the following steps in turn: Step 1: Given a query input set Q and a distance formula d() in the metric space, the distance formula is used to measure The distance between the entire data object and the query object Q; Step 2: According to Step 1, a parallel algorithm based on ensemble ANN and k-skyband is proposed. By making full use of the characteristics of parallel computing between nodes in a distributed environment, pruning and sorting greatly improve the performance of top-k dominated query based on metric space in a large data set environment, and speed up the query. User's decision to provide services.

Description

A Top-k Domination Query Method Based on Metric Space in Distributed Environment

技术领域technical field

本发明涉及一种查询方法，具体涉及一种在海量数据集中分布式环境下基于度量空间的并行top-k支配查询方法。The invention relates to a query method, in particular to a parallel top-k dominance query method based on metric space in a distributed environment of mass data concentration.

背景技术Background technique

基于度量空间的top-k支配查询作为一种重要的复杂查询越来越得到更多的关注，它从海量多维数据集中返回一部分满足用户需求的数据。这种类型的查询为用户提供决策，例如在网页搜索、多媒体检索、电子商务等领域有广泛的应用。该查询不需要用户给定评价函数且结果集可控，计算每个对象支配分数，返回支配分数最高的k个结果集。As an important complex query, the top-k domination query based on metric space is getting more and more attention. It returns a part of the data that meets the user's needs from the massive multi-dimensional data set. This type of query provides decision-making for users, for example, it has a wide range of applications in web search, multimedia retrieval, e-commerce and other fields. This query does not require the user to give an evaluation function and the result set is controllable. It calculates the dominance score of each object and returns the k result sets with the highest dominance scores.

基于度量空间的top-k支配查询定义如下：用O＝{o₁,o₂,…,o_n}表示所有数据对象的集合，o_i表示其中第i个数据对象，每个数据对象有D维，且都是空间中的一个点。对于一个度量空间的top-k支配查询，Q表示查询输入集合，d()表示度量空间中距离公式，这种距离公式可以自己定义，例如图中的最短路径、网络中的最大流量、曼哈顿距离等，k表示返回支配分数最高的k个结果。支配含义是：若存在o_i∈O,o_i'∈O，用符号＜表示两个对象之间的支配关系，若o_i＜o_i'，则有：The top-k domination query based on metric space is defined as follows: O={o ₁ ,o ₂ ,…,o _n } represents the set of all data objects, o _i represents the i-th data object, and each data object has D dimension, and is a point in space. For a top-k dominant query in a metric space, Q represents the query input set, and d() represents the distance formula in the metric space. This distance formula can be defined by yourself, such as the shortest path in the graph, the maximum flow in the network, and the Manhattan distance. etc., k means return the k results with the highest dominance score. The meaning of dominance is: if there is o _i ∈ O, o _i' ∈ O, use the symbol < to indicate the dominance relationship between two objects, if o _i <o _i' , then:

给定一个数据对象o_i∈O，对象o_i的支配分数dscore为整个数据集中被它支配对象的个数，如下：Given a data object o _i ∈ O, the dominance score dscore of the object o _i is the number of objects dominated by it in the entire data set, as follows:

dscore＝|{o_j∈O|o_i＜o_j}|dscore＝|{o _j ∈ O|o _i ＜o _j }|

基于度量空间的top-k支配查询最后只要得到支配分数最高k个元素，是一种动态的top-k支配查询。Tiakas E等人最先提出该概念，但也只是在传统的单机模式下的研究，目前随着数据集急剧增加，传统的单机算法遇到性能瓶颈，且Tiakas E等人使用M-tree这种索引存储结构对于大数据集完全不适用，会导致大量的数据冗余，所以研究基于度量空间的并行top-k支配算法迫在眉睫。The top-k domination query based on the metric space finally only needs to get the k elements with the highest domination score, which is a dynamic top-k domination query. Tiakas E and others first proposed this concept, but it was only studied in the traditional stand-alone mode. At present, with the rapid increase of data sets, the traditional stand-alone algorithm encounters performance bottlenecks, and Tiakas E and others use M-tree. The index storage structure is completely unsuitable for large data sets, which will lead to a large amount of data redundancy, so it is imminent to study the parallel top-k dominance algorithm based on metric space.

发明内容Contents of the invention

发明目的：本发明的目的在于解决现有技术中存在的不足，提供及一种分布式环境下基于度量空间的并行top-k支配查询方法。Purpose of the invention: the purpose of the present invention is to solve the deficiencies in the prior art, and provide a parallel top-k dominance query method based on metric space in a distributed environment.

技术方案：本发明所述的一种分布式环境下基于度量空间的并行top-k支配查询方法，依次包括以下顺序执行的步骤：Technical solution: The parallel top-k dominating query method based on metric space in a distributed environment according to the present invention comprises the following steps in order:

(1)给定查询输入数据对象集合Q以及度量空间中的距离公式d()，距离公式d()用来衡量整个数据对象O与查询输入数据对象集合Q之间的距离；(1) Given the query input data object set Q and the distance formula d() in the metric space, the distance formula d() is used to measure the distance between the entire data object O and the query input data object set Q;

(2)根据步骤(1)提出基于集合ANN和k-skyband并行算，该并行算法的具体内容为：(2) According to step (1), a parallel calculation based on set ANN and k-skyband is proposed. The specific content of the parallel algorithm is:

(21)利用ANN(Q,k)剪枝：(21) Use ANN(Q,k) for pruning:

根据距离度量函数d()和查询输入Q计算所有数据对象与查询输入对象之间的距离Deal_Data_RDD并将其保存在各个分区中然后每个分区单独并行求解该分区的中ANN(Q,k)，最后将每个分区的ANN(Q,k)结果通过reduce接口进行筛选得到全局的ANN(Q,k)；将获取的全局ANN(Q,k)广播到各个节点上，利用ANN(Q,k)去过滤原始的数据集，最后得到候选集KANN(Q,k)_RDD，KANN(Q,k)_RDD中一定包含最后的top-k支配结果集D，过滤的规则是不被ANN(Q,k)中对象所支配；According to the distance measurement function d() and the query input Q, calculate the distance Deal_Data_RDD between all data objects and the query input object and save it in each partition, and then solve the middle ANN(Q,k) of the partition separately in parallel for each partition, Finally, the ANN(Q,k) results of each partition are filtered through the reduce interface to obtain the global ANN(Q,k); the acquired global ANN(Q,k) is broadcast to each node, and the ANN(Q,k ) to filter the original data set, and finally get the candidate set KANN(Q,k)_RDD, KANN(Q,k)_RDD must contain the final top-k dominant result set D, the filtering rule is not to be used by ANN(Q,k)_RDD K) is dominated by the object;

(22)利用k-skyband剪枝：(22) Use k-skyband pruning:

由于得到的KANN(Q,k)_RDD有可能非常的大，如果直接计算KANN(Q,k)_RDD中所有对象的支配分数也是非常耗时的，所以利用k-skyband思想，找到KANN(Q,k)_RDD中的k-skyband进一步剪枝得到最终的候选集GlobalCandidate(k-skyband)；Since the obtained KANN(Q,k)_RDD may be very large, it would be very time-consuming to directly calculate the dominance scores of all objects in KANN(Q,k)_RDD, so use the k-skyband idea to find KANN(Q,k)_RDD The k-skyband in k)_RDD is further pruned to obtain the final candidate set GlobalCandidate(k-skyband);

(23)获取top-k支配：(23) Obtain top-k dominance:

计算GlobalCandidate(k-skyband)中所有对象的支配分数，然后找出top-k个支配分数最高的，返回作为top-k支配结果。Calculate the dominance scores of all objects in GlobalCandidate(k-skyband), and then find the top-k ones with the highest domination scores, and return them as the top-k domination results.

进一步的，所述步骤(21)中，由于每个分区的ANN(Q,k)不一定是全局的ANN(Q，k)，则需要将各个分区的ANN(Q,k)一一比较距离的远近最终得到全局的ANN(Q，k)。Further, in the step (21), since the ANN(Q,k) of each partition is not necessarily the global ANN(Q,k), it is necessary to compare the distances of the ANN(Q,k) of each partition one by one The far and near finally get the global ANN(Q, k).

进一步的，所述步骤(23)的详细内容为：将步骤(22)中得到的候选集与原始数据集进行笛卡尔积运算，然后使用Spark提供的ReduceByKey的API接口，得到每个候选集的支配分数。Further, the detailed content of described step (23) is: carry out Cartesian product operation with the candidate set obtained in step (22) and original data set, then use the API interface of the ReduceByKey that Spark provides, obtain each candidate set dominate the score.

有益效果：本发明提供在分布式环境下基于度量空的top-k支配查询，并提出三种分布式算法去求解top-k支配，通过在分布式环境下充分利用各个节点之间的并行计算的特点，通过剪枝、排序极大的改善在大数据集环境下基于度量空间的top-k支配查询性能，加快查询速度，为用户的决策提供服务；具体包括以下优点：Beneficial effects: the present invention provides top-k dominance query based on metric space in a distributed environment, and proposes three distributed algorithms to solve top-k domination, by making full use of parallel computing between nodes in a distributed environment Its features, through pruning and sorting, greatly improve the performance of top-k dominated query based on metric space in a large data set environment, speed up query, and provide services for users' decision-making; specifically, it includes the following advantages:

(1)提出并行计算skyline方法，可以使每个分区都同时进行求解skyline，这样可以快速求解skyline从而得到top-k支配结果集；(1) A parallel calculation skyline method is proposed, so that each partition can solve the skyline at the same time, so that the skyline can be solved quickly to obtain the top-k dominant result set;

(2)提出并行计算k-skyband方法，每个分区单独求解k-skyband，互不影响，利用k-skyband的特性不需要循环就可以得到结果；(2) A method of parallel computing k-skyband is proposed, and each partition solves k-skyband independently without affecting each other, and the result can be obtained without looping by using the characteristics of k-skyband;

(3)提出先利用集合ANN剪枝，然后并行计算k-skyband方法。有效的剪枝，减少了数据之间的比较操作，从而加快了查询速度。(3) It is proposed to use the collective ANN pruning first, and then calculate the k-skyband method in parallel. Effective pruning reduces the comparison operations between data, thereby speeding up the query.

附图说明Description of drawings

图1本发明中DAKDA算法的流程图；The flowchart of DAKDA algorithm in Fig. 1 the present invention;

图2为实施例中k的大小对查询影响示意图；Fig. 2 is a schematic diagram of the impact of the size of k on the query in the embodiment;

图3为实施例中m的大小对查询影响示意图；Fig. 3 is a schematic diagram of the influence of the size of m on the query in the embodiment;

图4为实施例中c的大小查询影响示意图；Fig. 4 is a schematic diagram of the influence of the size query of c in the embodiment;

图5为本发明中各个算法的可扩展性对比图；Fig. 5 is the extensibility contrast chart of each algorithm among the present invention;

图6为本发明分布式处理图；Fig. 6 is a distributed processing diagram of the present invention;

图7为本发明的示例图。Fig. 7 is an example diagram of the present invention.

具体实施方式detailed description

下面对本发明技术方案进行详细说明，但是本发明的保护范围不局限于所述实施例。The technical solutions of the present invention will be described in detail below, but the protection scope of the present invention is not limited to the embodiments.

下文中所涉及符号和参数的定义如表1：The definitions of the symbols and parameters involved in the following are shown in Table 1:

表1符号说明Table 1 Symbol Description

定义1(KNN(q,k))：给定一个数据集O，d()为度量函数，且o∈O，对象o的k-近邻为KNN(o,k)，KNN(o,k)表示距离对象o最近的k个对象。Definition 1 (KNN(q,k)): Given a data set O, d() is a measure function, and o∈O, the k-nearest neighbor of object o is KNN(o,k), KNN(o,k) Indicates the k closest objects to object o.

定义2(ANN(Q,k))：给定一个数据集O，d()为度量函数，Q表示一组查询输入对象集合Q＝{q₁,q₂,…,q_m}，ANN(Q,k)表示距离Q最近的k个对象。选择合理集合距离函数d()会影响查询，一般来说集合距离函数有：最小，最大，平均值等。Definition 2 (ANN(Q,k)): Given a data set O, d() is a measurement function, Q represents a set of query input objects Q={q ₁ ,q ₂ ,…,q _m }, ANN( Q,k) represents the k closest objects to Q. Choosing a reasonable set distance function d() will affect the query. Generally speaking, set distance functions include: minimum, maximum, average, etc.

定义3(度量空间中的支配)：若(O,d())是一个度量空间，Q表示一组查询输入对象集合Q＝{q₁,q₂,…,q_m}。那么对于对象o∈O，它与Q中所有对象距离集合为：Definition 3 (Domination in Metric Space): If (O,d()) is a metric space, Q represents a set of query input objects Q={q ₁ ,q ₂ ,...,q _m }. Then for an object o∈O, the set of distances between it and all objects in Q is:

adist(o,Q)＝{d(o,q₁),d(o,q₂),…,d(o,q_m)}adist(o,Q)={d(o,q ₁ ),d(o,q ₂ ),…,d(o,q _m )}

当对象p∈O，若o＜p，则有：When the object p∈O, if o<p, then:

这种支配是通过距离的大小来衡量的。This dominance is measured by the size of the distance.

定义4(基于度量的top-k支配)：给定一组查询输入Q，和距离度量函数d()。根据度量空间中的支配关系，若数据对象o_i∈O，对象o_i的支配分数为：Definition 4 (metric-based top-k domination): Given a set of query inputs Q, and a distance metric function d(). According to the dominance relationship in the metric space, if the data object o _i ∈ O, the dominance score of the object o _i is:

dscore＝|{p∈O|o＜p}|，其中返回其中支配分数最高的k个对象，就是基于度量空间的top-k支配查询结果集。dscore=|{p∈O|o<p}|, where Return the k objects with the highest dominance score, which is the top-k dominance query result set based on the metric space.

如图7,所示，本实施例中的基于度量空间的top-k支配查询，首先查询输入Q＝{q₁,q₂}，使用的距离度量函数d()为欧氏距离，top-1支配结果为o₁，因为o₁到q₁,q₂的距离均小于圆外(包括圆上)所有点，只有o₂对象不被o₁支配(因为o₂到q₁距离小于o₁到q₁距离)，如果空间中有n个数据对象o₁的支配分数为dscore(o₁)＝n-1，而o₂至少不支配对象o₁,o₃，所以o₂的支配分数dscore(o₂)≤n-2，则dscore(o₁)>dscore(o₂)所以top-1支配为o₁。As shown in Figure 7, the top-k dominant query based on the metric space in this embodiment first queries the input Q={q ₁ , q ₂ }, the distance metric function d() used is the Euclidean distance, top-k The domination result of 1 is o ₁ , because the distances from o ₁ to q ₁ and q ₂ are smaller than all points outside the circle (including on the circle), only the object o ₂ is not dominated by o ₁ (because the distance from o ₂ to q ₁ is less than o ₁ distance to q ₁ ), if there are n data objects in the space, the dominance score of o ₁ is dscore(o ₁ )=n-1, and o ₂ does not dominate objects o ₁ and o ₃ at least, so the dominance score of o ₂ dscore (o ₂ )≤n-2, then dscore(o ₁ )>dscore(o ₂ ), so top-1 is dominated by o ₁ .

定义5(k-skyband)整个数据空间，至多k-1个对象支配对象o，这一系列o组成的集合就是k-skyband。Define 5 (k-skyband) the entire data space, At most k-1 objects dominate the object o, and the set of this series of o is k-skyband.

定理1：top-k支配结果集 Theorem 1: top-k dominates the result set

证明.反证法，假设存在一个对象o₁∈D，且支配o₁的对象个数>k-1，因此一定存在k个对象的支配分数dscore≥o.dscore+1，此时矛盾，因此top-k支配结果集得证。Proof. Counter-evidence, assuming there is an object o ₁ ∈ D, and the number of objects dominating o ₁ >k-1, so there must be k objects whose dominance score dscore≥o.dscore+1, at this time Contradiction, so top-k dominates the result set Proven.

定理2：查询输入集合Q，ANN(Q,k)的k个对象{o₁,o₂,…,o_k}∈O，由(其中表示不支配)组成集合KANN(Q,k)，其中kANN(Q,k)包含对象ANN(Q,k)本身，top-k支配结果集 Theorem 2: Query input set Q, k objects {o ₁ ,o ₂ ,…,o _k }∈O of ANN(Q,k), by (in Indicates not dominated) to form a set KANN(Q,k), where kANN(Q,k) contains the object ANN(Q,k) itself, and top-k dominates the result set

证明.设ANN(Q,1)查询对象Q的1-近邻对象为o，因为D-1ANN(Q,1)中所有的对象均被对象o所支配，所以top-1支配一定在1ANN(Q,1)。若top-1支配不是对象o，则由上可知支配分数第二高的对象一定在集合1ANN(Q,1)中；若top-1支配是对象o，则由上可知支配分数第二高的对象一定在集合2ANN(Q,2)中，依次类推我们知道top-k支配结果集得证。Proof. Let the 1-neighbor object of ANN(Q,1) query object Q be o, because all objects in D-1ANN(Q,1) are dominated by object o, so the top-1 dominance must be in 1ANN(Q ,1). If the top-1 dominance is not object o, it can be seen from the above that the object with the second highest dominance score must be in the set 1ANN(Q,1); if the top-1 domination is object o, then it can be seen from the above that the object with the second highest domination The object must be in the set 2ANN(Q,2), and so on, we know that top-k dominates the result set Proven.

以下所有的算法均在spark平台上实现：：All the following algorithms are implemented on the spark platform::

(1)基于skyline的top-k支配算法(DSDA)(1) Skyline-based top-k dominance algorithm (DSDA)

现有的DSDA中，首先将数据集随机分配到各个节点中，然后使用spark中的Mappartition接口，在Mappartition接口中实现计算skyline算法，这样可以得到每个分区的skyline，最后将每个分区的skyline两两比较获取全局skyline，返回skyline中支配分数最高的对象就是top-k支配的结果集。依次进行k次循环就可以得到最终的结果集。In the existing DSDA, firstly, the data set is randomly assigned to each node, and then the mappartition interface in spark is used to implement the calculation skyline algorithm in the Mappartition interface, so that the skyline of each partition can be obtained, and finally the skyline of each partition can be obtained The pairwise comparison obtains the global skyline, and the object with the highest dominance score in the returned skyline is the result set dominated by top-k. The final result set can be obtained by performing k loops sequentially.

(2)基于k-skyband的top-k支配算法(DKDA)(2) top-k dominance algorithm (DKDA) based on k-skyband

现有的DKDA，将该算法在spark集群中并行化，并行算法的思想类似于skyline。根据定理1可知top-k dominating结果集所以先求k-skyband，然后从k-skyband中返回支配分数最高的k个对象为top-k dominating结果集。The existing DKDA parallelizes the algorithm in the spark cluster, and the idea of the parallel algorithm is similar to skyline. According to Theorem 1, it can be known that the top-k dominating result set So first find the k-skyband, and then return the k objects with the highest dominance score from the k-skyband as the top-k dominating result set.

首先将数据集随机分配到各个节点中，然后使用spark中的Mappartition接口，在Mappartition接口中实现计算k-skyband算法，这样可以得到每个分区的k-skyband，最后将每个分区的k-skyband两两比较获取全局k-skyband，返回k-skyband中支配分数最高的对象就是top-k支配的结果集。该方法对比于skyline方法优点在于不需要进行k次循环，但是求解原始数据集的k-skyband非常耗时。(3)基于集合ANN剪枝和k-skyband的并行top-k支配算法(DAKDA)First, the data set is randomly assigned to each node, and then the Mappartition interface in spark is used to implement the calculation k-skyband algorithm in the Mappartition interface, so that the k-skyband of each partition can be obtained, and finally the k-skyband of each partition The pairwise comparison obtains the global k-skyband, and the object with the highest dominance score in the returned k-skyband is the result set dominated by top-k. Compared with the skyline method, the advantage of this method is that it does not need to perform k cycles, but it is very time-consuming to solve the k-skyband of the original data set. (3) Parallel top-k dominance algorithm (DAKDA) based on collective ANN pruning and k-skyband

由于算法1需要进行k次循环，导致查询时间随k增大而增大，而算法2求解原始数据集k-skyband非常耗时，所以本发明可以进行剪枝。Because Algorithm 1 needs to perform k cycles, the query time increases with the increase of k, and Algorithm 2 is very time-consuming to solve the original data set k-skyband, so the present invention can perform pruning.

本发明中，根据定理1可知top-k支配结果集而根据定理2可知top-k支配结果集由于求解k-skyband比求解KANN(Q,k)费时，所以先利用集合ANN去掉不是候选集的数据，得到候选集KANN(Q,k)，然后求解KANN(Q,k)中的k-skyband，最后从k-skyband中返回支配分数最高的k个结果为top-k支配。步骤如图1所示：In the present invention, according to Theorem 1, it can be seen that top-k dominates the result set According to Theorem 2, we know that top-k dominates the result set Since solving k-skyband is more time-consuming than solving KANN(Q,k), first use the set ANN to remove the data that is not a candidate set to obtain the candidate set KANN(Q,k), and then solve k-skyband in KANN(Q,k) , and finally return the k results with the highest domination scores from the k-skyband as top-k dominance. The steps are shown in Figure 1:

步骤1：利用ANN(Q,k)剪枝Step 1: Pruning with ANN(Q,k)

如下图1阶段一所示，需要根据距离度量函数d()和查询输入Q对数据进行处理得到各个对象与查询对象之间的距离Deal_Data_RDD保存在各个分区中，然后求每个分区的中ANN(Q,k)，最后得到全局的ANN(Q,k)。利用全局的ANN(Q,k)去filter原始的数据集得到候选集KANN(Q,k)_RDD，根据定理2可以知道KANN(Q,k)_RDD中一定包含最后的top-k支配结果集D。As shown in Phase 1 of Figure 1 below, the data needs to be processed according to the distance measurement function d() and the query input Q to obtain the distance between each object and the query object. Deal_Data_RDD is stored in each partition, and then the median ANN( Q,k), and finally get the global ANN(Q,k). Use the global ANN(Q,k) to filter the original data set to get the candidate set KANN(Q,k)_RDD. According to Theorem 2, it can be known that KANN(Q,k)_RDD must contain the final top-k dominant result set D .

步骤2：利用k-skyband剪枝Step 2: Use k-skyband pruning

如下图1阶段二所示，由于得到的KANN(Q,k)_RDD有可能非常的大，如果直接计算KANN(Q,k)_RDD中所有对象的支配分数也是非常耗时的，所以利用k-skyband思想，找到KANN(Q,k)_RDD中的k-skyband进一步剪枝得到最终的候选集GlobalCandidate(k-skyband)。根据定理1可以知道GlobalCandidate(k-skyband)中一定包含最终的top-k支配结果集D。As shown in the second stage of Figure 1 below, since the obtained KANN(Q,k)_RDD may be very large, it is also very time-consuming to directly calculate the dominance scores of all objects in KANN(Q,k)_RDD, so use k- The idea of skyband is to find the k-skyband in KANN(Q,k)_RDD and further pruning to get the final candidate set GlobalCandidate(k-skyband). According to Theorem 1, it can be known that GlobalCandidate(k-skyband) must contain the final top-k dominant result set D.

步骤3：获取top-k支配结果集Step 3: Get the top-k dominant result set

如下图1阶段三所示，将候选集与原始数据集进行笛卡尔积运算，形成<key,value>形式，其中key表示候选集，若候选集支配原始数据集中的一个数据则value为1，否则为0；最后通过ReduceByKey这个API接口得到GlobalCandidate(k-skyband)中所有对象的支配分数，然后找出top-k个支配分数最高的。As shown in Phase 3 of Figure 1 below, Cartesian product operation is performed on the candidate set and the original data set to form <key, value>, where the key represents the candidate set, and if the candidate set dominates a piece of data in the original data set, the value is 1. Otherwise, it is 0; finally, the dominance scores of all objects in the GlobalCandidate (k-skyband) are obtained through the API interface of ReduceByKey, and then the top-k ones with the highest dominance scores are found.

实施例1：Example 1:

本实施例是在一个7节点的spark分布式集群上完成的，spark是搭建在hadoop上，使用hadoop的yarn资源管理器和HDFS文件存储系统。7个结点中master结点既作为Driver结点又做worker结点，其余6个结点均为worker结点。所有的算法均用Scala语言编写，基本配置如下表2：This embodiment is completed on a 7-node spark distributed cluster. Spark is built on hadoop, using hadoop's yarn resource manager and HDFS file storage system. Among the 7 nodes, the master node is both a driver node and a worker node, and the remaining 6 nodes are all worker nodes. All algorithms are written in Scala language, and the basic configuration is shown in Table 2:

表2实验环境配置Table 2 Experimental environment configuration

如图2至图5所示，实验部分主要从以下若干个方面来评价DSDA、DKDA、DAKDA三个算法：分区数量num对查询时间的影响(选择合理分区数目)、返回结果k对查询的影响、查询输入集合Q大小对查询时间的影响、各个算法候选集的比较以及算法的可扩展性，实验中的参数默认设置如下表3所示，其中覆盖率c＝覆盖输入Q最小圆的半径/覆盖所有数据集最小圆半径。As shown in Figure 2 to Figure 5, the experimental part mainly evaluates the three algorithms of DSDA, DKDA, and DAKDA from the following aspects: the impact of the number of partitions num on the query time (select a reasonable number of partitions), and the impact of the returned result k on the query , the impact of the size of the query input set Q on the query time, the comparison of the candidate sets of each algorithm, and the scalability of the algorithm. The default settings of the parameters in the experiment are shown in Table 3 below, where the coverage rate c = the radius of the smallest circle covering the input Q / Minimum circle radius to cover all datasets.

表3实验默认参数配置Table 3 Experiment default parameter configuration

先对真实的较大数据集进行分析：ZILLOW数据集，原始数据集有2245109条，由于有的记录中有的属性值空缺的，删除后的数据集大小为1771107条，总共有5个属性，对于度量空间的距离公式使用的是马哈顿距离。具体流程如图1所示。如图6所示，将数据集均匀分不到各个slaver节点中，然后每个节点单独执行上述提出的算法，得到候选集，最后汇总得到top-k支配结果集。First analyze the real large data set: ZILLOW data set, the original data set has 2,245,109 records, because some records have missing attribute values, the size of the deleted data set is 1,771,107, and there are 5 attributes in total. The distance formula for metric spaces uses the Manehattan distance. The specific process is shown in Figure 1. As shown in Figure 6, the data set is evenly divided into each slaver node, and then each node independently executes the algorithm proposed above to obtain a candidate set, and finally aggregates to obtain a top-k dominance result set.

给定m＝5，实验1评价各个算法随返回结果数量k变化情况的性能。如图2所示，发现DSDA算法随k的变化比较明显，而DAKDA算法随k的变化小，说明DSDA算法对k比较敏感。Given m=5, Experiment 1 evaluates the performance of each algorithm as the number k of returned results changes. As shown in Figure 2, it is found that the DSDA algorithm changes significantly with k, while the DAKDA algorithm changes slightly with k, indicating that the DSDA algorithm is more sensitive to k.

给定k＝10，实验2评价各个算法随查询集合Q大小m变化情况的性能。从图3中我们发现随着m的增大，算法DKDA急剧增大。Given k=10, Experiment 2 evaluates the performance of each algorithm as the size m of the query set Q changes. From Figure 3, we found that with the increase of m, the algorithm DKDA increases sharply.

给定k＝10，m＝5，实验3评价各个算法随查询集合Q覆盖率c变化情况的性能。如图4所示：在覆盖率较大情况下DSDA算法查询最慢。本发明方法的可扩展性如图5所示。Given k=10, m=5, Experiment 3 evaluates the performance of each algorithm as the coverage c of the query set Q changes. As shown in Figure 4: DSDA algorithm query is the slowest when the coverage rate is large. The scalability of the method of the present invention is shown in FIG. 5 .

通过上述实施例1可以看出，本发明对于给定的数据集，根据用户的查询输入以及给定的度量空间中的距离公式，提出适合于大数据集的top-k支配并行的方案；利用k-skyband结果集中包含top-k支配结果集特性，先利用集合k-近邻剪枝获取候选集，然后再获取候选集的k-skyband，最后求解top-k支配结果。As can be seen from the above-mentioned embodiment 1, for a given data set, according to the user's query input and the distance formula in the given metric space, the present invention proposes a top-k dominant parallel scheme suitable for large data sets; The k-skyband result set contains the characteristics of the top-k dominant result set. First, the set k-nearest neighbor pruning is used to obtain the candidate set, and then the k-skyband of the candidate set is obtained, and finally the top-k dominant result is solved.

这种基于k-skyband和集合ANN方法对比于传统的使用skyline求解top-k支配，以及单纯使用k-skyband求解top-k支配方法，对数据进行了筛选，减少了数据之间比较次数，加快了查询速度。本发明在spark平台上并行实现，由于基于度量空间的top-k支配查询目前研究只是单机算法，而本本发明提出的是并行算法，远快于单机而实施例1的结果也恰恰证明该结论，所以本发明将传统的基于skyline以及k-skyband方法并行化，法查询速度更快，且对于较大的输入集合或者海量数据集都适用。This method based on k-skyband and ensemble ANN is compared with the traditional method of using skyline to solve top-k dominance, and the method of simply using k-skyband to solve top-k domination. It screens the data, reduces the number of comparisons between data, and speeds up query speed. The present invention is implemented in parallel on the spark platform, because the top-k dominant query based on metric space is only a stand-alone algorithm at present, and what the present invention proposes is a parallel algorithm, which is much faster than a stand-alone and the result of embodiment 1 just proves this conclusion, Therefore, the present invention parallelizes the traditional skyline-based and k-skyband-based methods, the method query speed is faster, and it is applicable to larger input sets or massive data sets.

Claims

1. under a distributed environment, top-k based on metric space arranges querying method, it is characterised in that: include following successively The step that order performs:

(1) given range formula d () inquired about in input data object set Q and metric space, range formula d () is used for weighing Measure the distance between whole data object O and inquiry input data object set Q；

(2) proposing to calculate parallel based on set ANN and k-skyband according to step (1), the particular content of this parallel algorithm is:

(21) utilize ANN (Q, k) beta pruning:

The distance between all data objects and inquiry input object is calculated according to distance metric function d () and inquiry input Q Deal_Data_RDD also saves it in each subregion, the middle ANN of the most each subregion this subregion of independent Parallel implementation (Q, K), finally by the ANN of each subregion (Q, k) result carry out screening by reduce interface obtain the overall situation ANN (Q, k)；To obtain (Q, k) is broadcast on each node the overall ANN taken, and (Q k) goes to filter original data set, finally obtains candidate to utilize ANN (Q, (Q necessarily comprises last top-k and arranges result set D collection KANN in k) _ RDD for k) _ RDD, KANN；Wherein, (Q k) is ANN Refer to query set Q k-NN, the rule of filtration be not by ANN (Q, k) in object arranged；

(22) utilize k-skyband beta pruning: utilize k-skyband thought, find KANN (Q, the k-skyband in k) _ RDD, so Rear beta pruning further obtains final Candidate Set GlobalCandidate (k-skyband)；

(23) acquisition top-k domination result set:

Calculate the domination mark of all objects in GlobalCandidate (k-skyband), then find out top-k Zhi Peifen Number is the highest, returns and arranges result as top-k.

Under distributed environment the most according to claim 1, top-k based on metric space arranges querying method, and its feature exists In: in described step (21), due to each subregion ANN (Q, k) be not necessarily the overall situation ANN (Q k), then needs each point The ANN in district (Q, distance k) comparing distance one by one finally give the ANN of the overall situation (Q, k).

Under distributed environment the most according to claim 1, top-k based on metric space arranges querying method, and its feature exists In: the detailed content of described step (23) is: the Candidate Set obtained in step (22) and raw data set are carried out cartesian product Computing, then uses the api interface of the ReduceByKey of Spark offer, obtains the domination mark of each Candidate Set.