CN106055674B

CN106055674B - A kind of top-k under distributed environment based on metric space dominates querying method

Info

Publication number: CN106055674B
Application number: CN201610393610.1A
Authority: CN
Inventors: 何洁月; 罗浩
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-06-03
Filing date: 2016-06-03
Publication date: 2019-05-31
Anticipated expiration: 2036-06-03
Also published as: CN106055674A

Abstract

The present invention discloses the top-k under a kind of distributed environment based on metric space and dominates querying method, successively the following steps are included: step 1: given inquiry inputs the range formula d () in set Q and metric space, and range formula is used to measure the distance between entire data object and query object Q；Step 2: being proposed to be based on set ANN and k-skyband parallel algorithm according to step 1.The characteristics of by making full use of the parallel computation between each node under distributed environment, query performance is dominated by the top-k that beta pruning, sequence greatly improve under large data sets environment based on metric space, accelerate inquiry velocity, provides service for the decision of user.

Description

A kind of top-k under distributed environment based on metric space dominates querying method

Technical field

The present invention relates to a kind of querying methods, and in particular to a kind of to be based on measurement in the case where mass data concentrates distributed environment The parallel top-k in space dominates querying method.

Background technique

Top-k based on metric space dominates inquiry and is more and more more closed as a kind of important complex query Note, it concentrates the data for returning to a part and meeting user demand from magnanimity multidimensional data.Such inquiry provides for user Decision, such as be widely used in fields such as Webpage search, multimedia retrieval, e-commerce.The inquiry do not need user to Determine evaluation function and result set is controllable, calculates each object and dominate score, return and dominate the highest k result set of score.

Top-k based on metric space dominates query-defined as follows: using O={ o₁,o₂,…,o_nIndicate all data objects Set, o_iIt indicates that wherein i-th of data object, each data object have D dimension, and is all a point in space.For one The top-k of a metric space dominates inquiry, and Q indicates inquiry input set, and d () indicates range formula in metric space, it is this away from It oneself can be defined from formula, such as the shortest path in figure, the maximum stream flow in network, manhatton distance etc., k indicates to return Dominate the highest k result of score.Domination is meant that: o if it exists_i∈O,o_i'∈ O is shown between two objects with symbol < table < Dominance relation, if o_i< o_i'< o_i’, then have:

Give a data object o_i∈ O, object o_iDomination score dscore be entire data set in by it dominate object Number, it is as follows:

Dscore=| { o_j∈O|o_i< o_j}|

It is a kind of dynamic as long as the top-k based on metric space dominates inquiry and finally obtains dominating k element of score highest Top-k dominate inquiry.Tiakas E et al. proposes the concept at first, but the also only research under traditional single cpu mode, Sharply increased now with data set, traditional uniprocessor algorithm encounters performance bottleneck, and Tiakas E et al. using M-tree this Kind index storage organization is completely not applicable for large data sets, will lead to a large amount of data redundancy, so research is based on measurement sky Between parallel top-k dominate algorithm it is extremely urgent.

Summary of the invention

Goal of the invention: it is an object of the invention to solve the deficiencies in the prior art, provides and a kind of distributed ring Parallel top-k under border based on metric space dominates querying method.

Technical solution: the parallel top-k under a kind of distributed environment of the present invention based on metric space dominates inquiry Method successively includes the steps that following sequence executes:

(1) the range formula d (), range formula d () in inquiry input data object set Q and metric space are given For measuring the distance between entire data object O and inquiry input data object set Q；

(2) it proposes to calculate parallel based on set ANN and k-skyband according to step (1), the particular content of the parallel algorithm Are as follows:

(21) ANN (Q, k) beta pruning is utilized:

According to distance metric function d () and inquiry input Q calculate all data objects and inquire input object between away from From Deal_Data_RDD and save it in each subregion then each independent Parallel implementation subregion of subregion middle ANN (Q, K), finally ANN (Q, the k) result of each subregion is screened to obtain global ANN (Q, k) by reduce interface；It will obtain The global ANN (Q, k) taken is broadcast on each node, is gone to filter original data set using ANN (Q, k), is finally obtained candidate Collect KANN (Q, k) _ RDD, centainly dominates result set D comprising last top-k in KANN (Q, k) _ RDD, the rule of filtering is not It is dominated by object in ANN (Q, k)；

(22) k-skyband beta pruning is utilized:

Since obtained KANN (Q, k) _ RDD is possible to very big, own if directly calculated in KANN (Q, k) _ RDD The domination score of object is also very time-consuming, so finding the k- in KANN (Q, k) _ RDD using k-skyband thought The further beta pruning of skyband obtains final Candidate Set GlobalCandidate (k-skyband)；

(23) top-k is obtained to dominate:

The domination score for calculating all objects in GlobalCandidate (k-skyband), then finds out top-k branch It is highest with score, it returns and dominates result as top-k.

Further, in the step (21), since the ANN (Q, k) of each subregion is not necessarily global ANN (Q, k), The distance that the ANN (Q, k) of each subregion is compared distance one by one is then needed to finally obtain global ANN (Q, k).

Further, the detailed content of the step (23) are as follows: by Candidate Set and initial data obtained in step (22) Collection carries out cartesian product operation, then using the api interface of the Spark ReduceByKey provided, obtains the branch of each Candidate Set With score.

The utility model has the advantages that the present invention, which is provided, dominates inquiry based on the empty top-k of measurement under distributed environment, and propose three kinds Distributed algorithm goes to solve top-k domination, by making full use of the parallel computation between each node under distributed environment Feature dominates query performance by the top-k that beta pruning, sequence greatly improve under large data sets environment based on metric space, Accelerate inquiry velocity, provides service for the decision of user；Specifically include following advantages:

(1) it proposes parallel computation skyline method, each subregion can be made while carrying out solution skyline, in this way may be used Result set is dominated with rapid solving skyline to obtain top-k；

(2) parallel computation k-skyband method is proposed, each subregion individually solves k-skyband, is independent of each other, and utilizes The characteristic of k-skyband, which does not need circulation, can be obtained by result；

(3) it proposes first with set ANN beta pruning, then parallel computation k-skyband method.Effective beta pruning, reduces Comparison operation between data, to accelerate inquiry velocity.

Detailed description of the invention

The flow chart of DAKDA algorithm in Fig. 1 present invention；

The size that Fig. 2 is k in embodiment influences schematic diagram to inquiry；

The size that Fig. 3 is m in embodiment influences schematic diagram to inquiry；

The size inquiry that Fig. 4 is c in embodiment influences schematic diagram；

Fig. 5 is the scalability comparison diagram of each algorithm in the present invention；

Fig. 6 is distributed treatment figure of the present invention；

Fig. 7 is exemplary diagram of the invention.

Specific embodiment

Technical solution of the present invention is described in detail below, but protection scope of the present invention is not limited to the implementation Example.

The hereinafter definition of involved symbol and parameter such as table 1:

1 symbol description of table

Define 1 (KNN (q, k)): given data set an O, d () are metric function, and the k- neighbour of o ∈ O, object o are KNN (o, k), KNN (o, k) indicate the k object nearest apart from object o.

Define 2 (ANN (Q, k)): given data set an O, d () are metric function, and Q indicates a group polling input object collection Close Q={ q₁,q₂,…,q_m, ANN (Q, k) indicates k nearest object of distance Q.Select reasonable aggregate distance function d () can shadow Inquiry is rung, in general aggregate distance function has: minimum, maximum, average value etc..

Define 3 (dominations in metric space): if (O, d ()) is a metric space, Q indicates a group polling input object Set Q={ q₁,q₂,…,q_m}.So for object o ∈ O, all object distance set in it and Q are as follows:

Adist (o, Q)={ d (o, q₁),d(o,q₂),…,d(o,q_m)}

As object p ∈ O, if o < p, has:

This dominate is measured by the size of distance.

Define 4 (top-k based on measurement is dominated): a given group polling inputs Q and distance metric function d ().According to degree Dominance relation in quantity space, if data object o_i∈ O, object o_iDomination score are as follows:

Dscore=| and p ∈ O | o < p }, whereinIt returns It returns and wherein dominates the highest k object of score, the top-k for being namely based on metric space dominates query results.

Shown such as Fig. 7, the top-k based on metric space in the present embodiment dominates inquiry, first inquiry input Q= {q₁,q₂, the distance metric function d () used is Euclidean distance, and it is o that top-1, which dominates result,₁, because of o₁To q₁,q₂Distance Respectively less than outer (including on circle) all the points of circle, only o₂Object is not by o₁It dominates (because of o₂To q₁Distance is less than o₁To q₁Distance), If there is n data object o in space₁Domination score be dscore (o₁)=n-1, and o₂Object o is not dominated at least₁,o₃, institute With o₂Domination score dscore (o₂)≤n-2, then dscore (o₁) > dscore (o₂) so it is o that top-1, which is dominated,₁。

5 (k-skyband) entire data spaces are defined,At most k-1 object dominates object o, a series of this o group At set be exactly k-skyband.

Theorem 1:top-k dominates result set

Prove reduction to absurdity, it is assumed that there are an object o₁∈ D, and dominate o₁Object number > k-1, therefore certainly exist k Domination score dscore >=o.dscore+1 of object, at this timeContradiction, therefore top-k dominates result set It must demonstrate,prove.

Theorem 2: inquiry input set Q, the k object { o of ANN (Q, k)₁,o₂,…,o_k∈ O, by(whereinExpression does not dominate) set KANN (Q, k) is formed, wherein kANN (Q, k) includes Object ANN (Q, k) itself, top-k dominate result set

The 1- neighbour's object for proving that sets ANN (Q, 1) query object Q is o, because object all in D-1ANN (Q, 1) is equal It is dominated by object o, so top-1, which dominates one, is scheduled on 1ANN (Q, 1).If top-1 domination is not object o, from the above, it can be seen that branch It is scheduled in set 1ANN (Q, 1) with the high object one of score second；If top-1 domination is object o, from the above, it can be seen that dominating score the Two high objects one are scheduled in set 2ANN (Q, 2), and so on it is understood that top-k dominate result set It must demonstrate,prove.

All algorithms are realized on spark platform below::

(1) top-k based on skyline dominates algorithm (DSDA)

In existing DSDA, data set is assigned randomly in each node first, then using in spark Mappartition interface is realized in Mappartition interface and calculates skyline algorithm, each subregion available in this way Skyline, finally by the skyline of each subregion two-by-two compare obtain overall situation skyline, return skyline in Zhi Peifen The highest object of number is exactly the result set that top-k is dominated.Successively carrying out k circulation can be obtained by final result set.

(2) top-k based on k-skyband dominates algorithm (DKDA)

The thought of algorithm parallelization in spark cluster, parallel algorithm is similar to skyline by existing DKDA.Root According to top-k dominating result set known to theorem 1So k-skyband is first sought, then from k- Returning in skyband and dominating the highest k object of score is top-k dominating result set.

Data set is assigned randomly in each node first, then uses the Mappartition interface in spark, It is realized in Mappartition interface and calculates k-skyband algorithm, the k-skyband of each subregion available in this way, finally The k-skyband of each subregion is compared two-by-two and obtains overall situation k-skyband, it is highest to return to domination score in k-skyband Object is exactly the result set that top-k is dominated.This method is not needing to carry out k times to recycle in contrast to skyline method advantage, but The k-skyband for being to solve for raw data set is very time-consuming.

(3) algorithm (DAKDA) is dominated based on the parallel top-k of set ANN beta pruning and k-skyband

Since algorithm 1 needs to carry out k circulation, query time is caused to increase with k and increase, and algorithm 2 solves original number It is very time-consuming according to collection k-skyband, so the present invention can carry out beta pruning.

In the present invention, result set is dominated according to 1 top-k of theoremAnd according to 2 top- of theorem K dominates result setIt is time-consuming due to solving k-skyband ratio solution KANN (Q, k), so first with set ANN remove be not Candidate Set data, obtain Candidate Set KANN (Q, k), then solve KANN (Q, k) in k-skyband, most It is returned from k-skyband afterwards and dominates the highest k result of score as top-k domination.Step is as shown in Figure 1:

Step 1: utilizing ANN (Q, k) beta pruning

Shown in following Fig. 1 stage one, need that data handle according to distance metric function d () and inquiry input Q It is stored in each subregion to the distance between each object and query object Deal_Data_RDD, then seeks each subregion Middle ANN (Q, k) finally obtains global ANN (Q, k).It is obtained using the data set that global ANN (Q, k) goes filter original Candidate Set KANN (Q, k) _ RDD, according to theorem 2 it is known that centainly dominating knot comprising last top-k in KANN (Q, k) _ RDD Fruit collects D.

Step 2: utilizing k-skyband beta pruning

Shown in following Fig. 1 stage two, since obtained KANN (Q, k) _ RDD is possible to very big, if directly calculated The domination score of all objects is also very time-consuming in KANN (Q, k) _ RDD, so being found using k-skyband thought The further beta pruning of k-skyband in KANN (Q, k) _ RDD obtains final Candidate Set GlobalCandidate (k- skyband).According to theorem 1 it is known that centainly being dominated comprising final top-k in GlobalCandidate (k-skyband) Result set D.

Step 3: obtaining top-k and dominate result set

Shown in following Fig. 1 stage three, Candidate Set and raw data set are subjected to cartesian product operation, formation < key, value > form, wherein key indicates Candidate Set, otherwise it is 0 that value, which is 1, if Candidate Set dominates the data that initial data is concentrated； Finally by ReduceByKey, this api interface obtains the domination of all objects in GlobalCandidate (k-skyband) Then it is highest to find out top-k domination score for score.

Embodiment 1:

The present embodiment is completed on the spark distributed type assemblies of 7 nodes, and spark is built on hadoop, Use the yarn resource manager and HDFS document storage system of hadoop.Master node is both used as Driver in 7 nodes Node does worker node again, remaining 6 node is worker node.All algorithms are write with Scala language, substantially Configuration such as the following table 2:

The configuration of 2 experimental situation of table

As shown in Figures 2 to 5, experimental section mainly evaluates DSDA, DKDA, DAKDA tri- in terms of several from following Algorithm: influence (selection rationalization partition number) of the number of partitions num to query time returns the result influence, inquiry of the k to inquiry Influence, the comparison of each algorithm Candidate Set and the scalability of algorithm of the set Q size to query time are inputted, in experiment Parameter default setting is as shown in table 3 below, and wherein all data sets of radius/covering of coverage rate c=covering input Q smallest circle are minimum Radius of circle.

Table 3 tests default parameters configuration

First analyze true larger data collection: ZILLOW data set, raw data set have 2245109, due to The attribute value vacancy having in some records, the data set size after deletion is 1771107, a total of 5 attributes, for degree The range formula of quantity space uses horse Hatton's distance.Detailed process is as shown in Figure 1.As shown in fig. 6, data set is uniformly divided Less than in each slaver node, then algorithm set forth above is individually performed in each node, obtains Candidate Set, finally summarizes Result set is dominated to top-k.

Given m=5, the 1 each algorithm of evaluation of experiment is with the performance for returning the result quantity k situation of change.As shown in Fig. 2, hair Existing DSDA algorithm is obvious with the variation of k, and DAKDA algorithm is small with the variation of k, illustrates that DSDA algorithm is more sensitive to k.

Given k=10, experiment 2 evaluate each algorithm with the performance of query set Q size m situation of change.From Fig. 3 we It was found that algorithm DKDA increased dramatically with the increase of m.

Given k=10, m=5, experiment 3 evaluate each algorithm with the performance of query set Q coverage rate c situation of change.Such as figure Shown in 4: DSDA algorithm queries are most slow in the larger situation of coverage rate.The scalability of the method for the present invention is as shown in Figure 5.

Through the foregoing embodiment 1 as can be seen that the present invention for given data set, according to the inquiry of user input and Range formula in given metric space proposes that the top-k for being suitable for large data sets dominates parallel scheme；Utilize k- Result set characteristic is dominated comprising top-k in skyband result set, obtains Candidate Set first with set k- neighbour's beta pruning, then again The k-skyband of Candidate Set is obtained, top-k is finally solved and dominates result.

It is this that top-k domination is solved using skyline in contrast to traditional based on k-skyband and set ANN method, with And top-k administration method is solved using k-skyband merely, data are screened, number of comparisons between data is reduced, Accelerate inquiry velocity.Present invention Parallel Implementation on spark platform, since the top-k based on metric space dominates inquiry mesh Preceding research is uniprocessor algorithm, and it is parallel algorithm that this is proposed by the present invention, and far faster than single machine, the result of embodiment 1 is also exactly Prove the conclusion, thus the present invention by traditional based on skyline and k-skyband method parallelization, method inquiry velocity is more Fastly, and for biggish input set or mass data collection it is all suitable for.

Claims

1. the top-k under a kind of distributed environment based on metric space dominates querying method, it is characterised in that: successively include following The step of sequence executes:

(1) the range formula d () in inquiry input data object set Q and metric space is given, range formula d () is used to Measure the distance between entire data object O and inquiry input data object set Q；

(2) it is proposed to be based on set ANN and k-skyband parallel algorithm, the particular content of the parallel algorithm according to step (1) are as follows:

(21) ANN (Q, k) beta pruning is utilized:

The distance between all data objects and inquiry input object are calculated according to distance metric function d () and inquiry input Q Deal_Data_RDD is simultaneously saved it in each subregion, then the middle ANN of the respective subregion of the independent Parallel implementation of each subregion (Q, k) is finally screened ANN (Q, the k) result of each subregion by reduce interface to obtain global ANN (Q, k)；It will The global ANN (Q, k) of acquisition is broadcast on each node, is gone to filter original data set using ANN (Q, k), is finally obtained time Result set D centainly is dominated comprising last top-k in selected works KANN (Q, k) _ RDD, KANN (Q, k) _ RDD；

(22) it utilizes k-skyband beta pruning: utilizing k-skyband thought, find the k-skyband in KANN (Q, k) _ RDD, so Further beta pruning obtains final Candidate Set GlobalCandidate (k-skyband) afterwards；

(23) it obtains top-k and dominates result set:

The domination score for calculating all objects in GlobalCandidate (k-skyband), then finds out top-k Zhi Peifen Number is highest, returns and dominates result as top-k；

Wherein, KNN (q, k) refers to the k-NN of data object q, indicates the object of the k nearest apart from object q；ANN (Q, k) refers to The k-NN of query set Q indicates k nearest object of distance Q.

2. the top-k under distributed environment according to claim 1 based on metric space dominates querying method, feature exists In: in the step (21), since the ANN (Q, k) of each subregion is not necessarily global ANN (Q, k), then need each point The ANN (Q, k) in area compares the distance of distance one by one, and then finally obtains global ANN (Q, k).

3. the top-k under distributed environment according to claim 1 based on metric space dominates querying method, feature exists In: the detailed content of the step (23) are as follows: Candidate Set obtained in step (22) and raw data set are subjected to cartesian product Operation obtains the domination score of each Candidate Set then using the api interface of the Spark ReduceByKey provided.