CN106055674A - top-k arrangement query method based on metric space in distributed environment - Google Patents

top-k arrangement query method based on metric space in distributed environment Download PDF

Info

Publication number
CN106055674A
CN106055674A CN201610393610.1A CN201610393610A CN106055674A CN 106055674 A CN106055674 A CN 106055674A CN 201610393610 A CN201610393610 A CN 201610393610A CN 106055674 A CN106055674 A CN 106055674A
Authority
CN
China
Prior art keywords
ann
skyband
domination
distance
metric space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610393610.1A
Other languages
Chinese (zh)
Other versions
CN106055674B (en
Inventor
何洁月
罗浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201610393610.1A priority Critical patent/CN106055674B/en
Publication of CN106055674A publication Critical patent/CN106055674A/en
Application granted granted Critical
Publication of CN106055674B publication Critical patent/CN106055674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a top-k arrangement query method based on a metric space in a distributed environment. The top-k arrangement query method sequentially comprises the following steps of: (1), giving a query input set Q and a distance formula d() in the metric space, wherein the distance formula is used for measuring the distance between a whole data object and a query object Q; and (2), providing a parallel algorithm based on a set ANN and k-skyband according to the step (1). Characteristics of parallel calculation among various nodes are sufficiently utilized in the distributed environment; the top-k arrangement query performance based on the metric space in a large dataset environment is greatly improved by pruning and sorting; the query speed is enhanced; and services are provided for decision of users.

Description

Under a kind of distributed environment, top-k based on metric space arranges querying method
Technical field
The present invention relates to a kind of querying method, be specifically related to one under mass data integrated distribution formula environment based on tolerance The parallel top-k in space arranges querying method.
Background technology
Top-k based on metric space domination inquiry is the most more closed as a kind of important complex query Note, it is concentrated from magnanimity multidimensional data and returns a part of data meeting user's request.Such inquiry provides the user Decision-making, such as, be widely used in fields such as Webpage search, multimedia retrieval, ecommerce.This inquiry need not user to Determine evaluation function and result set is controlled, calculate each object domination mark, return k the result set that domination mark is the highest.
Top-k based on metric space domination is query-defined as follows: use O={o1,o2,…,onRepresent all data objects Set, oiRepresent that wherein i-th data object, each data object have D to tie up, and be all a point in space.For one Individual metric space top-k domination inquiry, Q represent inquiry input set, range formula in d () degree of a representation quantity space, this away from Can define with oneself from formula, such as the shortest path in figure, the maximum stream flow in network, manhatton distance etc., k represents return K the result that domination mark is the highest.Domination is meant that: if there is oi∈O,oi'∈ O, represents between two objects with symbol < Dominance relation, if oi< oi', then have:
A given data object oi∈ O, object oiDomination mark dscore be in whole data set by it arrange object Number, as follows:
Dscore=| { oj∈O|oi< oj}|
As long as top-k based on metric space arranges inquiry and finally obtains arranging the highest k the element of mark, it is a kind of dynamic Top-k domination inquiry.Tiakas E et al. proposes this concept at first, but is also the research under traditional single cpu mode, Sharply increasing now with data set, traditional uniprocessor algorithm runs into performance bottleneck, and Tiakas E et al. use M-tree this Plant index storage organization the most inapplicable for large data sets, substantial amounts of data redundancy can be caused, so research is based on tolerance sky Between parallel top-k domination algorithm extremely urgent.
Summary of the invention
Goal of the invention: it is an object of the invention to solve the deficiencies in the prior art, it is provided that and a kind of distributed ring Under border, parallel top-k based on metric space arranges querying method.
Technical scheme: parallel top-k based on metric space domination inquiry under a kind of distributed environment of the present invention Method, includes the step that following sequence performs successively:
(1) given range formula d () inquired about in input data object set Q and metric space, range formula d () is used Weigh the distance between whole data object O and inquiry input data object set Q;
(2) propose to calculate parallel based on set ANN and k-skyband according to step (1), the particular content of this parallel algorithm For:
(21) utilize ANN (Q, k) beta pruning:
According to distance metric function d () and inquiry input Q calculate all data objects and inquiry input object between away from From Deal_Data_RDD and save it in this subregion of the independent Parallel implementation of the most each subregion in each subregion middle ANN (Q, K), finally by the ANN of each subregion (Q, k) result carry out screening by reduce interface obtain the overall situation ANN (Q, k);To obtain (Q, k) is broadcast on each node the overall ANN taken, and (Q k) goes to filter original data set, finally obtains candidate to utilize ANN (Q, (Q, necessarily comprises last top-k and arranges result set D collection KANN in k) _ RDD, the rule of filtration is not for k) _ RDD, KANN By ANN (Q, k) in object arranged;
(22) k-skyband beta pruning is utilized:
Due to obtain KANN (Q, k) _ RDD are likely very big, if directly calculate KANN (Q, all in k) _ RDD The domination mark of object is also the most time-consuming, so utilizing k-skyband thought, finds KANN (Q, the k-in k) _ RDD The further beta pruning of skyband obtains final Candidate Set GlobalCandidate (k-skyband);
(23) obtain top-k to arrange:
Calculate the domination mark of all objects in GlobalCandidate (k-skyband), then find out top-k Partition number is the highest, returns and arranges result as top-k.
Further, in described step (21), due to each subregion ANN (Q, k) be not necessarily the overall situation ANN (Q, k), Then need by the ANN of each subregion (Q, distance k) comparing distance one by one finally give the ANN of the overall situation (Q, k).
Further, the detailed content of described step (23) is: by the Candidate Set obtained in step (22) and initial data Collection carries out cartesian product computing, then uses the api interface of the ReduceByKey of Spark offer, obtains propping up of each Candidate Set Partition number.
Beneficial effect: the present invention provides top-k domination inquiry empty based on tolerance under distributed environment, and proposes three kinds Distributed algorithm goes to solve top-k domination, by making full use of parallel computation between each node under distributed environment Feature, improves top-k based on metric space under large data sets environment greatly by beta pruning, sequence and arranges query performance, Accelerating inquiry velocity, the decision-making for user provides service;Specifically include advantages below:
(1) propose parallel computation skyline method, each subregion can be made simultaneously to carry out solving skyline, so may be used With rapid solving skyline thus obtain top-k arrange result set;
(2) proposing parallel computation k-skyband method, each subregion individually solves k-skyband, is independent of each other, and utilizes The characteristic of k-skyband need not circulation and can be obtained by result;
(3) propose first with set ANN beta pruning, then parallel computation k-skyband method.Effective beta pruning, decreases Comparison operation between data, thus accelerate inquiry velocity.
Accompanying drawing explanation
The flow chart of DAKDA algorithm in Fig. 1 present invention;
Fig. 2 is that in embodiment, the size of k affects schematic diagram to inquiry;
Fig. 3 is that in embodiment, the size of m affects schematic diagram to inquiry;
Fig. 4 is that the size inquiry of c in embodiment affects schematic diagram;
Fig. 5 is the extensibility comparison diagram of each algorithm in the present invention;
Fig. 6 is distributed treatment figure of the present invention;
Fig. 7 is the exemplary plot of the present invention.
Detailed description of the invention
Below technical solution of the present invention is described in detail, but protection scope of the present invention is not limited to described enforcement Example.
The definition of the most involved symbol and parameter such as table 1:
Table 1 symbol description
Definition 1 (KNN (q, k)): given a data set O, d () are metric function, and o ∈ O, and the k-neighbour of object o is (o, k), (o k) represents k nearest object of distance object o to KNN to KNN.
Definition 2 (ANN (Q, k)): given a data set O, d () are metric function, and Q represents a group polling input object collection Close Q={q1,q2,…,qm, (Q k) represents k the object that distance Q is nearest to ANN.Select reasonable aggregate distance function d () meeting shadow Ringing inquiry, in general aggregate distance function has: minimum, maximum, meansigma methods etc..
Definition 3 (dominations in metric space): if (O, d ()) is a metric space, Q represents a group polling input object Set Q={q1,q2,…,qm}.So for object o ∈ O, all object distance collection are combined into Q for it:
Adist (o, Q)={ d (o, q1),d(o,q2),…,d(o,qm)}
As object p ∈ O, if o is < p, then have:
This domination is that the size by distance is weighed.
Definition 4 (top-k domination based on tolerance): a given group polling input Q, and distance metric function d ().According to degree Dominance relation in quantity space, if data object oi∈ O, object oiDomination mark be:
Dscore=| p ∈ O | o < p} |, whereinReturn Return k the object that wherein domination mark is the highest, it is simply that top-k based on metric space arranges query results.
Such as Fig. 7, shown, top-k based on the metric space domination in the present embodiment is inquired about, first inquiry input Q= {q1,q2, distance metric function d () of use is Euclidean distance, and top-1 domination result is o1, because o1To q1,q2Distance equal Institute (is included on circle) a little, only o outside less than circle2Object is not by o1Domination (because o2To q1Distance is less than o1To q1Distance), as Space really has n data object o1Domination mark be dscore (o1)=n-1, and o2The most do not arrange object o1,o3, so o2Domination mark dscore (o2)≤n-2, then dscore (o1)>dscore(o2) so top-1 domination is o1
Definition 5 (k-skyband) whole data space,At most k-1 object domination object o, these a series of o groups The set become is exactly k-skyband.
Theorem 1:top-k domination result set
Prove. apagoge, it is assumed that there are an object o1∈ D, and domination o1Object number > k-1, therefore certainly exist k The domination mark dscore >=o.dscore+1 of individual object, nowContradiction, therefore top-k arranges result setMust demonstrate,prove.
Theorem 2: inquiry input set Q, ANN (Q, k object { o k)1,o2,…,ok∈ O, by(whereinRepresenting and do not arrange) (Q, k), wherein (Q k) comprises kANN composition set KANN (Q, k) own, top-k arranges result set to object ANN
Proving. the 1-neighbour's object setting ANN (Q, 1) query object Q is o, because all of object is equal in D-1ANN (Q, 1) Arranged by object o, so top-1 domination one is scheduled on 1ANN (Q, 1).If top-1 domination is not object o, prop up the most as from the foregoing The object one that partition number second is high is scheduled in set 1ANN (Q, 1);If top-1 domination is object o, arrange mark the most as from the foregoing Second high object one is scheduled in set 2ANN (Q, 2), the like it is understood that top-k arranges result setMust demonstrate,prove.
The most all of algorithm all realizes on spark platform::
(1) top-k based on skyline domination algorithm (DSDA)
In existing DSDA, first data set is assigned randomly in each node, then uses in spark Mappartition interface, realizes calculating skyline algorithm in Mappartition interface, so can obtain each subregion Skyline, finally the skyline of each subregion is compared two-by-two acquisition overall situation skyline, returns Zhi Peifen in skyline The highest object of number is exactly the result set of top-k domination.Carry out k circulation successively and can be obtained by final result set.
(2) top-k based on k-skyband domination algorithm (DKDA)
Existing DKDA, by the parallelization in spark cluster of this algorithm, the thought of parallel algorithm is similar to skyline.Root Top-k dominating result set is understood according to theorem 1So first seeking k-skyband, then from k- K the object returning domination mark in skyband the highest is top-k dominating result set.
First data set is assigned randomly in each node, then uses the Mappartition interface in spark, Mappartition interface realizes calculate k-skyband algorithm, so can obtain the k-skyband of each subregion, finally The k-skyband of each subregion is compared two-by-two acquisition overall situation k-skyband, returns domination mark in k-skyband the highest Object is exactly the result set of top-k domination.The method in contrast to skyline method advantage and is to be made without k circulation, but The k-skyband being to solve for raw data set is the most time-consuming.(3) based on set ANN beta pruning and the parallel top-k of k-skyband Domination algorithm (DAKDA)
Owing to algorithm 1 needs to carry out k circulation, cause query time to increase with k and increase, and algorithm 2 solves original number Very time-consuming, so the present invention can carry out beta pruning according to collection k-skyband.
In the present invention, understand top-k according to theorem 1 and arrange result setAnd understand top-according to theorem 2 K arranges result setDue to solve k-skyband ratio solve KANN (Q, k) time-consuming, so first with set It is not the data of Candidate Set that ANN removes, obtain Candidate Set KANN (Q, k), then solve KANN (Q, k) in k-skyband, After to return domination the highest k the result of mark from k-skyband be that top-k arranges.Step is as shown in Figure 1:
Step 1: utilize ANN (Q, k) beta pruning
Shown in below figure 1 stage one, need according to distance metric function d () and inquiry input Q, data to be processed It is saved in each subregion to distance Deal_Data_RDD between each object and query object, then seeks each subregion Middle ANN (Q, k), finally obtain the overall situation ANN (Q, k).(Q, k) data set going filter original obtains to utilize overall ANN (Q, k) _ RDD, according to theorem 2, (Q necessarily comprises last top-k domination knot to Candidate Set KANN in k) _ RDD it is known that KANN Fruit collection D.
Step 2: utilize k-skyband beta pruning
Shown in below figure 1 stage two, due to the KANN that obtains, (Q, k) _ RDD are likely very big, if directly calculated (Q, in k) _ RDD, the domination mark of all objects is also the most time-consuming to KANN, so utilizing k-skyband thought, finds (Q, the further beta pruning of k-skyband in k) _ RDD obtains final Candidate Set GlobalCandidate (k-to KANN skyband).According to theorem 1 it is known that GlobalCandidate (k-skyband) necessarily comprises final top-k domination Result set D.
Step 3: obtain top-k and arrange result set
Shown in below figure 1 stage three, Candidate Set and raw data set are carried out cartesian product computing, form < key, value > form, wherein key represents Candidate Set, if the data that Candidate Set domination initial data is concentrated, value is 1, is otherwise 0; The domination of all objects in GlobalCandidate (k-skyband) is obtained finally by this api interface of ReduceByKey Mark, then finds out top-k domination mark the highest.
Embodiment 1:
The present embodiment is to complete on the spark distributed type assemblies of 7 nodes, and spark is to build on hadoop, Use yarn explorer and the HDFS document storage system of hadoop.In 7 nodes, master node is both as Driver Node does again worker node, and remaining 6 node is worker node.All of algorithm is all write with Scala language, substantially Configuration is such as table 2 below:
Table 2 experimental situation configures
As shown in Figures 2 to 5, experimental section mainly evaluates DSDA, DKDA, DAKDA tri-from following in terms of several Algorithm: number of partitions num on the impact of query time (selecting rationalization partition number), return result k to the impact of inquiry, inquiry Input set Q size impact, the comparison of each algorithm Candidate Set and the extensibility of algorithm on query time, in experiment Parameter default setting is as shown in table 3 below, and wherein to cover all data sets of radius/cover of input Q smallest circle minimum for coverage rate c= Radius of circle.
Default parameters configuration tested by table 3
First being analyzed real larger data collection: ZILLOW data set, raw data set has 2245109, due to The property value vacancy having in some records, the data set size after deletion is 1771107, a total of 5 attributes, for degree The range formula of quantity space uses horse Hatton's distance.Idiographic flow is as shown in Figure 1.As shown in Figure 6, data set is uniformly divided Less than in each slaver node, the most each node individually performs algorithm set forth above, obtains Candidate Set, finally collects Result set is arranged to top-k.
Given m=5, experiment 1 each algorithm of evaluation is with the performance returning fruiting quantities k situation of change.As in figure 2 it is shown, send out Existing DSDA algorithm is obvious with the change of k, and DAKDA algorithm is little with the change of k, illustrates that DSDA algorithm is more sensitive to k.
Given k=10, experiment 2 each algorithm of evaluation is with the performance of query set Q size m situation of change.From Fig. 3 we Finding along with the increase of m, algorithm DKDA is increased dramatically.
Given k=10, m=5, experiment 3 each algorithm of evaluation is with the performance of query set Q coverage rate c situation of change.Such as figure Shown in 4: in the case of coverage rate is relatively big, DSDA algorithm queries is the slowest.The extensibility of the inventive method is as shown in Figure 5.
By above-described embodiment 1 it can be seen that the present invention is for given data set, input according to the inquiry of user and The given range formula in metric space, the top-k proposing to be suitable for large data sets arranges parallel scheme;Utilize k- Skyband result set comprises top-k and arranges result set characteristic, obtain Candidate Set first with set k-neighbour's beta pruning, the most again Obtain the k-skyband of Candidate Set, finally solve top-k and arrange result.
This based on k-skyband and set ANN method in contrast to traditional use skyline solve top-k domination, with And use merely k-skyband to solve top-k administration method, data are screened, are decreased number of comparisons between data, Accelerate inquiry velocity.The present invention is Parallel Implementation on spark platform, due to top-k based on metric space domination inquiry mesh Front research is uniprocessor algorithm, and this present invention proposition is parallel algorithm, and far faster than unit, the result of embodiment 1 is the most exactly Prove this conclusion, thus the present invention by traditional based on skyline and k-skyband method parallelization, method inquiry velocity is more Hurry up, and bigger input set or mass data collection are all suitable for.

Claims (3)

1. under a distributed environment, top-k based on metric space arranges querying method, it is characterised in that: include following successively The step that order performs:
(1) given range formula d () inquired about in input data object set Q and metric space, range formula d () is used for weighing Measure the distance between whole data object O and inquiry input data object set Q;
(2) proposing to calculate parallel based on set ANN and k-skyband according to step (1), the particular content of this parallel algorithm is:
(21) utilize ANN (Q, k) beta pruning:
The distance between all data objects and inquiry input object is calculated according to distance metric function d () and inquiry input Q Deal_Data_RDD also saves it in each subregion, the middle ANN of the most each subregion this subregion of independent Parallel implementation (Q, K), finally by the ANN of each subregion (Q, k) result carry out screening by reduce interface obtain the overall situation ANN (Q, k);To obtain (Q, k) is broadcast on each node the overall ANN taken, and (Q k) goes to filter original data set, finally obtains candidate to utilize ANN (Q, (Q necessarily comprises last top-k and arranges result set D collection KANN in k) _ RDD for k) _ RDD, KANN;Wherein, (Q k) is ANN Refer to query set Q k-NN, the rule of filtration be not by ANN (Q, k) in object arranged;
(22) utilize k-skyband beta pruning: utilize k-skyband thought, find KANN (Q, the k-skyband in k) _ RDD, so Rear beta pruning further obtains final Candidate Set GlobalCandidate (k-skyband);
(23) acquisition top-k domination result set:
Calculate the domination mark of all objects in GlobalCandidate (k-skyband), then find out top-k Zhi Peifen Number is the highest, returns and arranges result as top-k.
Under distributed environment the most according to claim 1, top-k based on metric space arranges querying method, and its feature exists In: in described step (21), due to each subregion ANN (Q, k) be not necessarily the overall situation ANN (Q k), then needs each point The ANN in district (Q, distance k) comparing distance one by one finally give the ANN of the overall situation (Q, k).
Under distributed environment the most according to claim 1, top-k based on metric space arranges querying method, and its feature exists In: the detailed content of described step (23) is: the Candidate Set obtained in step (22) and raw data set are carried out cartesian product Computing, then uses the api interface of the ReduceByKey of Spark offer, obtains the domination mark of each Candidate Set.
CN201610393610.1A 2016-06-03 2016-06-03 A kind of top-k under distributed environment based on metric space dominates querying method Active CN106055674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610393610.1A CN106055674B (en) 2016-06-03 2016-06-03 A kind of top-k under distributed environment based on metric space dominates querying method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610393610.1A CN106055674B (en) 2016-06-03 2016-06-03 A kind of top-k under distributed environment based on metric space dominates querying method

Publications (2)

Publication Number Publication Date
CN106055674A true CN106055674A (en) 2016-10-26
CN106055674B CN106055674B (en) 2019-05-31

Family

ID=57170263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610393610.1A Active CN106055674B (en) 2016-06-03 2016-06-03 A kind of top-k under distributed environment based on metric space dominates querying method

Country Status (1)

Country Link
CN (1) CN106055674B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273464A (en) * 2017-06-02 2017-10-20 浙江大学 A kind of similar inquiry processing method of non-distributive measure based on publish/subscribe pattern
CN110245022A (en) * 2019-06-21 2019-09-17 齐鲁工业大学 Parallel Skyline processing method and system under mass data
CN113065036A (en) * 2021-04-14 2021-07-02 深圳大学 Method and device for measuring performance of space supporting point and related components

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799681A (en) * 2012-07-24 2012-11-28 河海大学 Top-k query method oriented to any data segment
CN103970871A (en) * 2014-05-12 2014-08-06 华中科技大学 Method and system for inquiring file metadata in storage system based on provenance information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799681A (en) * 2012-07-24 2012-11-28 河海大学 Top-k query method oriented to any data segment
CN103970871A (en) * 2014-05-12 2014-08-06 华中科技大学 Method and system for inquiring file metadata in storage system based on provenance information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DAICHI AMAGATA等: "Efficient processing of top-k dominating queries", 《WORLD WIDE WEB》 *
TIAKAS E等: "Processing Top-k Dominating Queries in Metric Spaces", 《ACM TRANSACTIONS ON DATABASE SYSTEMS(TODS)》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273464A (en) * 2017-06-02 2017-10-20 浙江大学 A kind of similar inquiry processing method of non-distributive measure based on publish/subscribe pattern
CN107273464B (en) * 2017-06-02 2020-05-12 浙江大学 Distributed measurement similarity query processing method based on publish/subscribe mode
CN110245022A (en) * 2019-06-21 2019-09-17 齐鲁工业大学 Parallel Skyline processing method and system under mass data
CN110245022B (en) * 2019-06-21 2021-11-12 齐鲁工业大学 Parallel Skyline processing method and system under mass data
CN113065036A (en) * 2021-04-14 2021-07-02 深圳大学 Method and device for measuring performance of space supporting point and related components

Also Published As

Publication number Publication date
CN106055674B (en) 2019-05-31

Similar Documents

Publication Publication Date Title
Tao et al. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space
Ren et al. Exploiting vertex relationships in speeding up subgraph isomorphism over large graphs
CN103345514B (en) Streaming data processing method under big data environment
CN103116639B (en) Based on article recommend method and the system of user-article bipartite graph model
Tao et al. Approximate MaxRS in spatial databases
Cao et al. Efficient and accurate strategies for differentially-private sliding window queries
CN102722531B (en) Query method based on regional bitmap indexes in cloud environment
CN102750328B (en) A kind of construction and storage method of data structure
Yun et al. Fastraq: A fast approach to range-aggregate queries in big data environments
CN110222029A (en) A kind of big data multidimensional analysis computational efficiency method for improving and system
CN109308303B (en) Multi-table connection online aggregation method based on Markov chain
CN112925821B (en) MapReduce-based parallel frequent item set incremental data mining method
Koide et al. Fast subtrajectory similarity search in road networks under weighted edit distance constraints
Tang et al. An intermediate data partition algorithm for skew mitigation in spark computing environment
CN106055674A (en) top-k arrangement query method based on metric space in distributed environment
Adamu et al. A survey on big data indexing strategies
CN105930531A (en) Method for optimizing cloud dimensions of agricultural domain ontological knowledge on basis of hybrid models
CN106649731A (en) Node similarity searching method based on large-scale attribute network
CN103761298B (en) Distributed-architecture-based entity matching method
CN107506388A (en) A kind of iterative data balancing optimization method towards Spark parallel computation frames
CN104794237B (en) web information processing method and device
Yu Entity resolution with recursive blocking
Xu et al. Efficient similarity join based on Earth mover’s Distance using Mapreduce
He et al. Efficient and robust data augmentation for trajectory analytics: A similarity-based approach
Kang et al. EMP: Max-P regionalization with enriched constraints

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant