CN106909639A - A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark - Google Patents

A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark Download PDF

Info

Publication number
CN106909639A
CN106909639A CN201710083816.9A CN201710083816A CN106909639A CN 106909639 A CN106909639 A CN 106909639A CN 201710083816 A CN201710083816 A CN 201710083816A CN 106909639 A CN106909639 A CN 106909639A
Authority
CN
China
Prior art keywords
data
new
rdd
space
rddresult
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710083816.9A
Other languages
Chinese (zh)
Other versions
CN106909639B (en
Inventor
乔百友
王秋杰
韩东红
王国仁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201710083816.9A priority Critical patent/CN106909639B/en
Publication of CN106909639A publication Critical patent/CN106909639A/en
Application granted granted Critical
Publication of CN106909639B publication Critical patent/CN106909639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, comprise the following steps:Step 1:Whole data space is divided into many size identical grid cells, and each grid cell is encoded using Z order space filling curves technologies;Step 2:Position of each spatial object in m roads space connection data set according to it in data space is projected into corresponding grid cell;Step 3:If meeting condition i<M, then to two datasets RDDresultnew, RDDiPerform space concatenation operation Overlap;Step 4:I=i+1, performs step 3 until condition i<Untill m is unsatisfactory for;Step 5:Perform last time space concatenation operation Overlap.The present invention is a kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, has significant raising in treatment effeciency and in terms of reducing calculation cost.

Description

A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark
Technical field
The present invention relates to Spatial data query processing technology field, a kind of chain type based on Spark is particularly related to Multi way spatial join Query Processing Algorithm.
Background technology
Spatial Join Query is a kind of important Spatial data query type, is widely present in spatial data management, empty Between Connection inquiring treatment technology also be always spatial database management field study hotspot.Multi way spatial join inquiry is a kind of Conventional spatial join operation, it be retrieved from multiple spatial data set it is all meet a certain Spatial predicate (as it is intersecting, Comprising etc.) spatial object, be one of most time-consuming spatial operation, its complexity and importance make decision spatial data One of key factor of management system overall performance, therefore Multi way spatial join query processing efficiency is improved just always as science The study hotspot problem on boundary.Particularly in recent years, with technology of Internet of things, earth observation technology and location Based service technology Fast development and extensive use so that spatial data scale is sharply increased, and has become the important big data of a class.It is how right This space-like big data carries out efficient Multi way spatial join query processing, it has also become current spatial data management field is faced Significant challenge.The traditional treatment technology person's of presence autgmentability based on spatial database is weak, thus is difficult to meet empty Between the treatment of big data quick search requirement, and Spark is flat as a kind of new ultra-large data distribution formula parallel processing Platform and attract widespread attention, be also the key technology of current big data treatment.Therefore combine at Spark distributed parallels The large-scale data disposal ability that platform is provided is come at the high efficient multi-path Spatial Join Query for furtheing investigate space big data Reason method, has become the important means for solving above-mentioned challenge.
In Multi way spatial join query processing, the problems with that existing method is primarily present:(1) it is traditional based on sky The multi-wad join processing method of spatial database, the main processing mode using centralization, its autgmentability is poor, it is difficult to meet space big The requirement of data quick search treatment;(2) existing some epidemic algorithms, such as dynamic programming algorithm, Hybrid connections algorithm are more It is centralized index building, for the data cube computation inquiry of magnanimity, efficiency comparison is low;(3) existing distributed approach master Hadoop platform is based on, and focuses on General Multiplex Connection inquiring treatment optimization aspect, the data duplication for existing is excessive, filtering The weak problem of ability, so as to have impact on the efficiency of query processing;(4) distributed multi-channel spatial join algorithm newest at present is exactly Gupta et al. proposes two kinds of Multi way spatial join Query Processing Algorithm Controlled- based on MapReduce Replicate and ε-Controlled-Replicate.Controlled-Replicate is by the sky in all kinds of connection data sets Between object divide and copy to all grid cells in fourth quadrant, then carry out multi-wad join computing.Obvious this method is made Into the duplication of a large amount of spatial objects, influence connection treatment effeciency.Looked into for this author has also been proposed improved Multi way spatial join Processing Algorithm ε-Controlled-Replicate are ask, the algorithm reduces data duplication, improves inquiry to a certain extent Treatment effeciency, but also there is a problem that data duplication is excessive.For Spark platforms, data duplication amount is excessive, can make It is too big into the data volume being disposably loaded into internal memory, it is impossible to play the advantage calculated based on internal memory of Spark well, also can The problems such as causing search efficiency low.
Above mentioned problem is furtherd investigate, and after proposing corresponding solution, space big data can be applied it to Connection inquiring treatment etc. related application field.Therefore, the space multi-way connection that the present invention is proposed under a kind of Spark platforms is looked into Processing Algorithm is ask, the algorithm is inquired about mainly for chain type Multi way spatial join, method is drawn using the data space based on grid, and Division and the coding of data are realized with reference to Z-order codings, data projection is carried out according to locus where data object And duplication.In connection procedure, the algorithm reduces useless connection data using edge filtering method, so as to reduce follow-up connection Unnecessary calculating, and connecting object superfluous view with replicate.And reduce the defeated of reproducible results using avoidance strategy is repeated Go out, so as to reduce the cost that follow-up connection is calculated comprehensively, improve the efficiency of multi-wad join query processing.
The content of the invention
In view of the defect that prior art is present, empty the invention aims to provide a kind of chain type multichannel based on Spark Between Connection inquiring Processing Algorithm, the algorithm focuses primarily on chain type Multi way spatial join query processing problem, it is preferred that emphasis is reduce The spatial data of filtration stage is replicated and amount of calculation, so as to reduce follow-up connection calculation cost, improves query processing efficiency, should Algorithm simultaneously has good adaptability and autgmentability.
To achieve these goals, technical scheme:
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, comprises the following steps:
Step 1:Using Meshing Method, whole data space is divided into many size identical grid cells, and Each grid cell is encoded using Z-order space filling curves technology;
Step 2:By m (m>2) road space connection data set R1, R2..., RmIn each spatial object according to it in data Position in space projects to corresponding grid cell, and forms a series of key-value pairs, and projection result is stored in into elasticity respectively Distributed data collection RDD1, RDD2..., RDDmIn, set cyclic variable i=2, intermediate result data collection RDDresultnew= RDD1
Step 3:If meeting condition i<M, then to two datasets RDDresultnew, RDDiPerform space concatenation operation Overlap(RDDresultnew,RDDi).In calculating process, carry out successively data aggregation, edge filtering, space connection calculate, Repetition avoids being operated with data duplication etc., ultimately forms intermediate result data collection RDDresultnew, i.e. RDDresultnew= Overlap(RDDresultnew,RDDi);
Step 4:I=i+1, performs step 3 until condition i<Untill m is unsatisfactory for;
Step 5:Perform last time space concatenation operation Overlap (RDDresultnew,RDDm), in calculating process, according to It is secondary to carry out data aggregation, edge filtering, space connection calculating, and result is directly exported, form final space connection result collection Close, and be saved in HDFS file system.
Further, the data are divided and coding method is:It is using the division methods based on grid that whole data are empty Between be divided into n equal-sized grid cells, grid cell is encoded using Z-order space filling curves, spatial data Object is projected to each grid cell according to its position, and all grid cells are mapped into multiple using Hash modes Executor execution units so that whole process task is divided into multiple parallel process tasks.
Further, the spatial object is projected as:Spatial data object is mapped to accordingly according to its position In grid cell, if C=(c1,c2,…,cn) represent a data space division, ciRepresent each grid cell;Let R be one The spatial object set of class treatment to be connected, if spatial object u a ∈ R, its MBR and grid cell ciHave overlapping, the ciFor The Z-order codings of grid cell, then be mapped to grid cell c by object uiIn, and generate corresponding key-value pair (ci, u), if One spatial object has overlapping with multiple grid cells, then can generate multiple key-value pairs accordingly.
Further, step 3 specifically includes following steps:
Step 3-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiHeld according to Key values Row Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew
Step 3-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out Real space concatenation operation;
Step 3-3:Perform and repeat avoidance strategy, form connection intermediate result, and data are performed to connection intermediate result and answer System operation, ultimately forms new middle connection result data set RDDresultnew
Further, step 5 is comprised the following steps:
Step 5-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiHeld according to Key values Row Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew
Step 5-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out Real space concatenation operation;
Step 5-3:Perform and repeat avoidance strategy, formed by tuple to the final connection result data set that constitutes RDDresultnew, and it is saved in HDFS file system.
Further, the data copy operation is:For the last space concatenation operation on current grid unit Any tuple t in the middle connection result set T for being generated, if t.s is the space related to space concatenation operation next time Object, if then t.s and a certain grid cell ciIn the presence of overlapping, then tuple t is copied into grid cell ci, and generate corresponding key Value is to (ci,t)。
Further, the filtering policy is:During executed in parallel concatenation operation, using corresponding filtering policy, Remove the tuple that can not possibly produce connection result, and only the tuple there may be connection result is replicated.
Further, the filtering policy includes two parts:
Edge filtering, the edge filtering is:Before computing is attached, count in the middle of the connection for having completed first The border MBR of the spatial object related to the connection of follow-up space in result, and filtered out using the MBR and subsequently to connect data Concentrate and the disjoint spatial objects of the MBR, so as to reduce follow-up connection calculation cost;
Duplicate stage is filtered, and the duplicate stage is filtered into:, it is necessary to former roads during multichannel links query processing Intermediate result after connection treatment carries out data copy operation, is only copied into producing other nets of connection result In lattice unit, so as to avoid the loss of connection result, in being replicated to middle connection result, only to comprising inter-network lattice connecting object Intermediate result replicated.
Further, the avoidance strategy that repeats is:When two spatial objects across multiple grid cells are attached, Grid cell where the lower left corner intersection point of the new object for only allowing the two to overlap mutually and being formed is responsible for output result.
Compared with prior art, beneficial effects of the present invention:The present invention is that a kind of chain type multichannel space based on Spark connects Connect Query Processing Algorithm, data space is divided using Meshing Method, and based on the position where spatial object come Data projection and duplication are carried out, useless connecting object is filtered out in calculating process using edge filtering mode, and by contracting Small reproduction range, reduces data duplication, has significant raising in treatment effeciency and in terms of reducing calculation cost, and with good Good adaptability and autgmentability.
Brief description of the drawings
Fig. 1 is the exemplary plot of Z-order curve encodings in the specific embodiment of the invention;
Fig. 2 is being divided to data in the specific embodiment of the invention and the schematic diagram of duty mapping;
Fig. 3 is the exemplary plot that data are projected and replicated with operation in the specific embodiment of the invention;
Fig. 4 is the exemplary plot of the edge filtering described in the specific embodiment of the invention;
Fig. 5 is the exemplary plot that the repetition described in the specific embodiment of the invention is avoided;
Fig. 6 is the chain type Multi way spatial join Query Processing Algorithm based on Spark in the specific embodiment of the invention Handling process schematic diagram.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with accompanying drawing, the present invention is entered Row is further described.
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark proposed by the present invention, the algorithm mainly gathers Jiao is in chain type Multi way spatial join query processing problem, it is preferred that emphasis is the spatial data for reducing filtration stage is replicated and amount of calculation, So as to reduce follow-up connection calculation cost, query processing efficiency is improved, the algorithm simultaneously has good adaptability and autgmentability.
Embodiment 1:
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, key technology mainly includes following Part:
1) data space is divided and encoded:Whole space is divided into by n equal-sized net using Meshing Method Lattice unit, is encoded using Z-order space filling curves to grid cell, and spatial data object is projected to respectively according to its position Individual grid cell, and all grid cells are mapped to by multiple Executor execution units using Hash modes so that whole place Reason task is divided into multiple parallel calculating tasks, so as to lift the execution performance of whole algorithm in a parallel fashion.
As shown in figure 1, in order to keep the spatial relationship between spatial object, being carried out to grid cell using space filling curve Coding, Fig. 1 is Z-order curves when position (bit) number is 1,2.Z-order curves are a kind of space filling curves.Z-order (z- sequences) technology is using the bit attribute information come representation space object, then using the method for circulation by data space Decompose, the subspace after division can obtain set of number, be referred to as the z- ranking values of the subspace, and as the subspace data The Key values of object.
Be illustrated in figure 2 task and divide Mapping Examples, each grid cell after being as can be seen from the figure divided with draw Assign to data set thereon and n Executor execution unit in Spark platforms is distributed to come parallel by Hash mapping modes Perform.
2) spatial object projection:Spatial data object is exactly mapped to corresponding grid cell according to its position In.If C=(c1,c2,…,cn) represent a data space division, ciRepresent each grid cell;Let R be a class to be connected The spatial object set for the treatment of, if spatial object u a ∈ R, its MBR and grid cell ci(ciIt is the Z-order of grid cell Coding) have overlapping, then object u is mapped to grid cell ciIn, and generate corresponding key-value pair (ci, u), if a space pair As having overlapping with multiple grid cells, then multiple key-value pairs can be formed accordingly.Projection operation can be expressed as:
3) data duplication:, it is necessary to carry out multiple concatenation operation between multiple data sets in multichannel link query processing, number It is then that the intermediate result that the last space on current grid unit connects is copied into other related grid lists according to replicating Unit, so as to carry out follow-up attended operation, its result is similar with projection operation, can generate a series of key-value pair.If t ∈ T are Tuple in connection intermediate result, t.s is the object that will carry out follow-up space connection, if then t.s and a certain grid cell ci In the presence of overlapping, then tuple t is copied into grid cell ci, and generate corresponding key-value pair (ci,t).Operation is replicated to be represented by:
Fig. 3 is data projection and the example for replicating operation, there it can be seen that spatial object has been projected to overlapping therewith Grid cell.Object r1It is projected to 6 and No. 12 grid cells, r2It is projected to 9 and No. 12 units, r3Then it is projected to 9 Hes No. 11 units, i.e. Project (r1, C)={ (6, r1),(12,r1), Project (r2, C)={ (9, r2),(12,r2), Project(r3, C)={ (9, r2),(11,r2)}.As execution r1, r2And r3When carrying out multi-wad join successively, due to r2And grid Unit 9 has overlapping, therefore r in grid cell 121And r2Connection intermediate result (r1,r2) to be copied in grid cell 9, Form key-value pair (9, (r1,r2)), so as to realize and the spatial object r in grid cell 93Later joining operation, it is to avoid connection The loss of result.
4) filtering policy:During executed in parallel concatenation operation, using edge filtering strategy, removing to produce The tuple of connection result, and only possible resultful tuple is replicated, greatly reduce storage and the follow-up cost for calculating.Tool Body includes following two filtering policys:
A:Edge filtering:The border MBR of relevant connection object in connection result, and profit are completed before statistics several times first Filtered out with the MBR subsequently to connect in data set with the disjoint spatial objects of the MBR, so as to reduce follow-up connection calculate Cost:
Fig. 4 is an example for edge filtering, and three data sets R, S and T carry out three tunnel concatenation operations successively in figureProject to spatial object in grid cell 3 as illustrated,Result be respectively (r1,s1), (r1,s2), (r1,s3), the object that can be obtained in the corresponding S set that a preceding connection result is concentrated is s1、s2And s3, its border MBR is figure Shown in middle dotted line, when computing is attached with object in data set T, can directly filter out in projecting to grid cell 3 Spatial object t disjoint with the MBR1、t4And t5, it is to avoid these spatial objects respectively with s1、s2And s3It is attached computing, So as to the cost of follow-up calculating is greatly reduced.
B:Duplicate stage is filtered:, it is necessary in after processing former road connections during multichannel links query processing Between result carry out data copy operation, be copied into the grid cell that other may produce connection result, perform follow-up Attended operation, it is to avoid lose connection result.In being replicated to middle connection result, only to being related to the centre of inter-network lattice connecting object Result is replicated, so as to avoid unnecessary duplication.
5) avoidance strategy is repeated:When two spatial objects across multiple grid cells are attached, the two phases are only allowed Overlapping and where the lower left corner intersection point of new object that is formed grid cell is responsible for output result, that is, only allows a grid Unit is responsible for output result, so avoids the repetition output of result, reduces treatment cost.
Fig. 5 show the example that repetition is avoided, the object s in wherein S set1It is projected to its grid for overlapping Object r in the set of unit 2,3,6,8,9,12, R1Then it is projected to grid cell 3,6,9,12, r2Object has been projected to 8, 9th, 10,11 4 grid cells, if not carrying out repeating to avoid, in treatment is attached, grid cell 3,6,9,12 will Output identical connection result (r1,s1), and grid cell 8 and 9 can also export identical connection result (r2,s1), it is clear that occur in that Repeat.According to the repetition avoidance strategy for being proposed, as shown in Figure 5, the lower left corner (figure of the object that object overlapping part is formed Shown in middle P and Q points) where grid cell be responsible for output result, i.e., by grid cell 3 be responsible for treatment output r1And s1Connection As a result (r1,s1), grid cell 8 is responsible for treatment output r2And s1Connection result (r2,s1), this strategy avoids repeat treatment and The repetition output of result, reduces subsequent treatment cost.
Chain type Multi way spatial join inquires about Qm=Overlap (R1,R2,R3,...,Rm), according to its definition, can be expressed as Qm=Overlap (... Overlap (Overlap (R1,R2),R3),…,Rm), the chain type multichannel based on Spark proposed by the present invention The handling process of Spatial Join Query Processing algorithm is as shown in fig. 6, mainly include following steps:
A:According to mesh generation coding method to multi-wad join data set R1,R2,R3,…,RmProjected, and will coding Value, using the attribute informations such as the mark and its MBR of each spatial object as Value values, forms a series of key assignments as Key values It is right, and respectively by data set R1,R2,R3,…,RmProjection result be put into elasticity distribution formula data set RDD1,RDD2,RDD3,…, RDDmIn;
B:Calculate Overlap (R1,R2), i.e., to RDD1And RDD2Cogroup operations are performed, by RDD1And RDD2In data It is brought together according to Key values and obtains RDDnew, using edge filtering strategy to RDDnewFiltered, removing there can not possibly be result Data object, then carry out real space concatenation operation, perform and repeat avoidance strategy, and form connection intermediate result;To even Connect intermediate result and perform data copy operation, form intermediate result data collection RDDresultnew
C:RDDresult is calculated according to step B identicals computational methodsnewAnd RDD3Between concatenation operation, obtain most New R1,R2,R3Connection intermediate result RDDresultnew.Identical is taken to calculate method, successively cycle calculations RDDresultnewWith RDD4, with RDD5..., with RDDm-1Concatenation operation, finally give data set R1,R2,R3,…,Rm-1Company Meet intermediate result data collection RDDresultnew
D:RDDresultnewWith RDDmCogroup operations are performed, new RDD is generatednew, row bound mistake is entered on this basis Filter, concatenation operation treatment, and result is directly exported, form data set R1,R2,R3,…,RmFinal space connection data set RDDresultnew, and result is saved in HDFS file system due to being last time spatial join operation, therefore no longer need Carry out duplication operation.
Embodiment 2:
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, comprises the following steps:Step 1:Will be whole Data space is divided into many size identical grid cells, and using Z-order space filling curves technology to each grid cell Encoded;Step 2:By m (m>2) road space connection data set R1, R2..., RmIn each spatial object according to it in data Position in space projects to corresponding grid cell, and projection result is stored in into elasticity distribution formula data set RDD1, RDD2..., RDDmIn.Setting cyclic variable i=2, intermediate result data collection RDDresultnew=RDD1;Step 3:If met Condition i<M, then to two datasets RDDresultnew, RDDiPerform space concatenation operation Overlap (RDDresultnew, RDDi), in calculating process, data aggregation, edge filtering, space connection are carried out successively and calculated, repeated to avoid with data duplication etc. Operation, ultimately forms new intermediate result data collection RDDresultnew, RDDresultnew=Overlap (RDDresultnew, RDDi);Step 4:I=i+1, performs step 3 until condition i<Untill m is unsatisfactory for;Step 5:Perform the connection of last time space Computing Overlap (RDDresultnew,RDDm), in calculating process, data aggregation, edge filtering, space connection meter are carried out successively Calculate, and result is directly exported, form final space connection result set, and be saved in HDFS file system.The present invention is one The chain type Multi way spatial join Query Processing Algorithm based on Spark is planted, is had in treatment effeciency and in terms of reducing calculation cost It is significant to improve, and with good adaptability and autgmentability.
Embodiment 3:
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, comprises the following steps:
Step 1:Using Meshing Method, whole data space is divided into many size identical grid cells, and Each grid cell is encoded using Z-order space filling curves technology;
Step 2:By m (m>2) road space connection data set R1, R2..., RmIn each spatial object according to it in data Position in space projects to corresponding grid cell, and forms a series of key-value pairs, and projection result is stored in into elasticity respectively Distributed data collection RDD1, RDD2..., RDDmIn, set cyclic variable i=2, intermediate result data collection RDDresultnew= RDD1
Step 3:If meeting condition i<M, then to two datasets RDDresultnew, RDDiPerform space concatenation operation Overlap(RDDresultnew,RDDi).In calculating process, carry out successively data aggregation, edge filtering, space connection calculate, Repetition avoids being operated with data duplication etc., ultimately forms intermediate result data collection RDDresultnew, i.e. RDDresultnew= Overlap(RDDresultnew,RDDi);
Step 4:I=i+1, performs step 3 until condition i<Untill m is unsatisfactory for;
Step 5:Perform last time space concatenation operation Overlap (RDDresultnew,RDDm), in calculating process, according to It is secondary to carry out data aggregation, edge filtering, space connection calculating, and result is directly exported, form final space connection result collection Close, and be saved in HDFS file system.
Further, the data are divided and coding method is:It is using the division methods based on grid that whole data are empty Between be divided into n equal-sized grid cells, grid cell is encoded using Z-order space filling curves, spatial data Object is projected to each grid cell according to its position, and all grid cells are mapped into multiple using Hash modes Executor execution units so that whole process task is divided into multiple parallel process tasks.
Further, the spatial object is projected as:Spatial data object is mapped to accordingly according to its position In grid cell, if C=(c1,c2,…,cn) represent a data space division, ciRepresent each grid cell;Let R be one The spatial object set of class treatment to be connected, if spatial object u a ∈ R, its MBR and grid cell ciHave overlapping, the ciFor The Z-order codings of grid cell, then be mapped to grid cell c by object uiIn, and generate corresponding key-value pair (ci, u), if One spatial object has overlapping with multiple grid cells, then can generate multiple key-value pairs accordingly.
Further, step 3 specifically includes following steps:
Step 3-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiHeld according to Key values Row Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew
Step 3-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out Real space concatenation operation;
Step 3-3:Perform and repeat avoidance strategy, form connection intermediate result, and data are performed to connection intermediate result and answer System operation, ultimately forms new middle connection result data set RDDresultnew
Further, step 5 is comprised the following steps:
Step 5-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiHeld according to Key values Row Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew
Step 5-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out Real space concatenation operation;
Step 5-3:Perform and repeat avoidance strategy, formed by tuple to the final connection result data set that constitutes RDDresultnew, and it is saved in HDFS file system.
Further, the data copy operation is:For the last space concatenation operation on current grid unit Any tuple t in the middle connection result set T for being generated, if t.s is the space related to space concatenation operation next time Object, if then t.s and a certain grid cell ciIn the presence of overlapping, then tuple t is copied into grid cell ci, and generate corresponding key Value is to (ci,t)。
Further, the filtering policy is:During executed in parallel concatenation operation, using corresponding filtering policy, Remove the tuple that can not possibly produce connection result, and only the tuple there may be connection result is replicated.
Further, the filtering policy includes two parts:
Edge filtering, the edge filtering is:Before computing is attached, count in the middle of the connection for having completed first The border MBR of the spatial object related to the connection of follow-up space in result, and filtered out using the MBR and subsequently to connect data Concentrate and the disjoint spatial objects of the MBR, so as to reduce follow-up connection calculation cost;
Duplicate stage is filtered, and the duplicate stage is filtered into:, it is necessary to former roads during multichannel links query processing Intermediate result after connection treatment carries out data copy operation, is only copied into producing other nets of connection result In lattice unit, so as to avoid the loss of connection result, in being replicated to middle connection result, only to comprising inter-network lattice connecting object Intermediate result replicated.
Further, the avoidance strategy that repeats is:When two spatial objects across multiple grid cells are attached, Grid cell where the lower left corner intersection point of the new object for only allowing the two to overlap mutually and being formed is responsible for output result.
Although the foregoing describing specific embodiment of the invention, it is familiar with researcher in this field and should be appreciated that These are merely illustrative of, and the present invention is a kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, therefore citing Illustrate to be merely to illustrate that filtering policy, repeat the core concept of avoidance strategy, connection handling process etc..Can enter after The more massive experiment of row, and related algorithm is further improved, data projection, the effect for replicating and filtering are improved, while It is contemplated that the performance of algorithm is further improved with reference to index technology, without departing from principle of the invention and essence.The present invention Scope be only limited by the claims that follow.

Claims (9)

1. a kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, it is characterised in that:Comprise the following steps:
Step 1:Using Meshing Method, whole data space is divided into many size identical grid cells, and use Z-order space filling curves technology is encoded to each grid cell;
Step 2:By m (m>2) road space connection data set R1, R2..., RmIn each spatial object according to it in data space In position project to corresponding grid cell, and form a series of key-value pairs, projection result is stored in elasticity distribution respectively Formula data set RDD1, RDD2..., RDDmIn, set cyclic variable i=2, intermediate result data collection RDDresultnew=RDD1
Step 3:If meeting condition i<M, then to two datasets RDDresultnew, RDDiPerform space concatenation operation Overlap(RDDresultnew,RDDi).In calculating process, carry out successively data aggregation, edge filtering, space connection calculate, Repetition avoids being operated with data duplication etc., ultimately forms intermediate result data collection RDDresultnew, i.e. RDDresultnew= Overlap(RDDresultnew,RDDi);
Step 4:I=i+1, performs step 3 until condition i<Untill m is unsatisfactory for;
Step 5:Perform last time space concatenation operation Overlap (RDDresultnew,RDDm), in calculating process, enter successively The aggregation of row data, edge filtering, space connection calculate, and result are directly exported, and form final space connection result set, and It is saved in HDFS file system.
2. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 1, it is characterised in that: The data are divided and coding method is:Whole data space is divided into n size phase using the division methods based on grid Deng grid cell, grid cell is encoded using Z-order space filling curves, spatial data object is thrown according to its position All grid cells are mapped to multiple Executor execution units by shadow to each grid cell using Hash modes so that Whole process task is divided into multiple parallel process tasks.
3. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 1, it is characterised in that: The spatial object is projected as:Spatial data object is mapped in corresponding grid cell according to its position, if C= (c1,c2,…,cn) represent a data space division, ciRepresent each grid cell;Let R be the sky of class treatment to be connected Between object set, if spatial object u a ∈ R, its MBR and grid cell ciHave overlapping, the ciIt is the Z- of grid cell Order is encoded, then object u is mapped into grid cell ciIn, and generate corresponding key-value pair (ci, u), if a spatial object Have overlapping with multiple grid cells, then can generate multiple key-value pairs accordingly.
4. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 1, it is characterised in that: Step 3 specifically includes following steps:
Step 3-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiPerformed according to Key values Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew
Step 3-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out reality Space concatenation operation;
Step 3-3:Perform and repeat avoidance strategy, form connection intermediate result, and data duplication behaviour is performed to connection intermediate result Make, ultimately form new middle connection result data set RDDresultnew
5. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 1, it is characterised in that: Step 5 is comprised the following steps:
Step 5-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiPerformed according to Key values Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew
Step 5-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out reality Space concatenation operation;
Step 5-3:Perform and repeat avoidance strategy, formed by tuple to the final connection result data set RDDresult that constitutesnew, And it is saved in HDFS file system.
6. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 1, it is characterised in that: The data copy operation is:For the middle connection knot that the last space concatenation operation on current grid unit is generated Any tuple t in fruit set T, if t.s is the spatial object related to space concatenation operation next time, if t.s with it is a certain Grid cell ciIn the presence of overlapping, then tuple t is copied into grid cell ci, and generate corresponding key-value pair (ci,t)。
7. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 4, it is characterised in that: The filtering policy is:During executed in parallel concatenation operation, using corresponding filtering policy, removing can not possibly produce connection The tuple of result, and only the tuple there may be connection result is replicated.
8. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 7, it is characterised in that: The filtering policy includes two parts:
Edge filtering, the edge filtering is:Before computing is attached, the connection intermediate result for having completed is counted first In the spatial object related to the connection of follow-up space border MBR, and filtered out using the MBR and subsequently to connect data set With the disjoint spatial objects of the MBR, so as to reduce follow-up connection calculation cost;
Duplicate stage is filtered, and the duplicate stage is filtered into:, it is necessary to be connected to former roads during multichannel links query processing Intermediate result after treatment carries out data copy operation, is only copied into producing other grid lists of connection result In unit, so as to avoid the loss of connection result, in being replicated to middle connection result, only to comprising in inter-network lattice connecting object Between result replicated.
9. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 4, it is characterised in that: It is described repeat avoidance strategy be:When two spatial objects across multiple grid cells are attached, the two are only allowed to overlap mutually And the grid cell where the lower left corner intersection point of the new object for being formed is responsible for output result.
CN201710083816.9A 2017-02-16 2017-02-16 Chained multi-path space connection query processing method based on Spark Active CN106909639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710083816.9A CN106909639B (en) 2017-02-16 2017-02-16 Chained multi-path space connection query processing method based on Spark

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710083816.9A CN106909639B (en) 2017-02-16 2017-02-16 Chained multi-path space connection query processing method based on Spark

Publications (2)

Publication Number Publication Date
CN106909639A true CN106909639A (en) 2017-06-30
CN106909639B CN106909639B (en) 2020-09-29

Family

ID=59209302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710083816.9A Active CN106909639B (en) 2017-02-16 2017-02-16 Chained multi-path space connection query processing method based on Spark

Country Status (1)

Country Link
CN (1) CN106909639B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722314A (en) * 2020-12-31 2021-11-30 京东城市(北京)数字科技有限公司 Space connection query method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692230A (en) * 2009-07-28 2010-04-07 武汉大学 Three-dimensional R tree spacial index method considering levels of detail
US20140297585A1 (en) * 2013-03-29 2014-10-02 International Business Machines Corporation Processing Spatial Joins Using a Mapreduce Framework
CN104391679A (en) * 2014-11-18 2015-03-04 浪潮电子信息产业股份有限公司 GPU (graphics processing unit) processing method for high-dimensional data stream in irregular stream
US20160055207A1 (en) * 2014-08-19 2016-02-25 International Business Machines Corporation Processing Multi-Way Theta Join Queries Involving Arithmetic Operators on Mapreduce
CN106055563A (en) * 2016-05-19 2016-10-26 福建农林大学 Method for parallel space query based on grid division and system of same
CN106209989A (en) * 2016-06-29 2016-12-07 山东大学 Spatial data concurrent computational system based on spark platform and method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692230A (en) * 2009-07-28 2010-04-07 武汉大学 Three-dimensional R tree spacial index method considering levels of detail
US20140297585A1 (en) * 2013-03-29 2014-10-02 International Business Machines Corporation Processing Spatial Joins Using a Mapreduce Framework
US20160055207A1 (en) * 2014-08-19 2016-02-25 International Business Machines Corporation Processing Multi-Way Theta Join Queries Involving Arithmetic Operators on Mapreduce
CN104391679A (en) * 2014-11-18 2015-03-04 浪潮电子信息产业股份有限公司 GPU (graphics processing unit) processing method for high-dimensional data stream in irregular stream
CN106055563A (en) * 2016-05-19 2016-10-26 福建农林大学 Method for parallel space query based on grid division and system of same
CN106209989A (en) * 2016-06-29 2016-12-07 山东大学 Spatial data concurrent computational system based on spark platform and method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BAIYOU QIAO等: "A Boundary Filtering Based Spatial Join Query Processing Optimization Algorithm", 《2015 12TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY(FSKD)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722314A (en) * 2020-12-31 2021-11-30 京东城市(北京)数字科技有限公司 Space connection query method and device, electronic equipment and storage medium
CN113722314B (en) * 2020-12-31 2024-04-16 京东城市(北京)数字科技有限公司 Space connection query method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN106909639B (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN112132287B (en) Distributed quantum computing simulation method and device
US10762087B2 (en) Database search
CN108228724A (en) Power grid GIS topology analyzing method and storage medium based on chart database
CN103488537B (en) Method and device for executing data ETL (Extraction, Transformation and Loading)
CN112784968A (en) Hybrid pipeline parallel method for accelerating distributed deep neural network training
CN103116625A (en) Volume radio direction finde (RDF) data distribution type query processing method based on Hadoop
CN104462351B (en) A kind of data query model and method towards MapReduce patterns
CN106209989A (en) Spatial data concurrent computational system based on spark platform and method thereof
CN105204920B (en) A kind of implementation method and device of the distributed computing operation based on mapping polymerization
CN103942108B (en) Resource parameters optimization method under Hadoop isomorphism cluster
CN104504154A (en) Method and device for data aggregate query
CN107870949A (en) Data analysis job dependence relation generation method and system
CN116644804B (en) Distributed training system, neural network model training method, device and medium
CN106021386A (en) Theta-join method for massive distributed data
CN103164495A (en) Half-connection inquiry optimizing method based on periphery searching and system thereof
CN104301212B (en) Functional chain combination method
CN106909639A (en) A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark
CN110750560A (en) System and method for optimizing network multi-connection
CN104951442B (en) A kind of method and apparatus of definitive result vector
CN109767002A (en) A kind of neural network accelerated method based on muti-piece FPGA collaboration processing
CN106780747A (en) A kind of method that Fast Segmentation CFD calculates grid
CN111600734A (en) Network fault processing model construction method, fault processing method and system
CN115001978A (en) Cloud tenant virtual network intelligent mapping method based on reinforcement learning model
Marir et al. An enhanced grouping algorithm for vertical partitioning problem in DDBs
CN104899447B (en) The attribute reduction method of electric power big data pretreatment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant