CN106909639A - A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark - Google Patents
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark Download PDFInfo
- Publication number
- CN106909639A CN106909639A CN201710083816.9A CN201710083816A CN106909639A CN 106909639 A CN106909639 A CN 106909639A CN 201710083816 A CN201710083816 A CN 201710083816A CN 106909639 A CN106909639 A CN 106909639A
- Authority
- CN
- China
- Prior art keywords
- data
- new
- rdd
- space
- rddresult
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2264—Multidimensional index structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, comprise the following steps:Step 1:Whole data space is divided into many size identical grid cells, and each grid cell is encoded using Z order space filling curves technologies;Step 2:Position of each spatial object in m roads space connection data set according to it in data space is projected into corresponding grid cell;Step 3:If meeting condition i<M, then to two datasets RDDresultnew, RDDiPerform space concatenation operation Overlap;Step 4:I=i+1, performs step 3 until condition i<Untill m is unsatisfactory for;Step 5:Perform last time space concatenation operation Overlap.The present invention is a kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, has significant raising in treatment effeciency and in terms of reducing calculation cost.
Description
Technical field
The present invention relates to Spatial data query processing technology field, a kind of chain type based on Spark is particularly related to
Multi way spatial join Query Processing Algorithm.
Background technology
Spatial Join Query is a kind of important Spatial data query type, is widely present in spatial data management, empty
Between Connection inquiring treatment technology also be always spatial database management field study hotspot.Multi way spatial join inquiry is a kind of
Conventional spatial join operation, it be retrieved from multiple spatial data set it is all meet a certain Spatial predicate (as it is intersecting,
Comprising etc.) spatial object, be one of most time-consuming spatial operation, its complexity and importance make decision spatial data
One of key factor of management system overall performance, therefore Multi way spatial join query processing efficiency is improved just always as science
The study hotspot problem on boundary.Particularly in recent years, with technology of Internet of things, earth observation technology and location Based service technology
Fast development and extensive use so that spatial data scale is sharply increased, and has become the important big data of a class.It is how right
This space-like big data carries out efficient Multi way spatial join query processing, it has also become current spatial data management field is faced
Significant challenge.The traditional treatment technology person's of presence autgmentability based on spatial database is weak, thus is difficult to meet empty
Between the treatment of big data quick search requirement, and Spark is flat as a kind of new ultra-large data distribution formula parallel processing
Platform and attract widespread attention, be also the key technology of current big data treatment.Therefore combine at Spark distributed parallels
The large-scale data disposal ability that platform is provided is come at the high efficient multi-path Spatial Join Query for furtheing investigate space big data
Reason method, has become the important means for solving above-mentioned challenge.
In Multi way spatial join query processing, the problems with that existing method is primarily present:(1) it is traditional based on sky
The multi-wad join processing method of spatial database, the main processing mode using centralization, its autgmentability is poor, it is difficult to meet space big
The requirement of data quick search treatment;(2) existing some epidemic algorithms, such as dynamic programming algorithm, Hybrid connections algorithm are more
It is centralized index building, for the data cube computation inquiry of magnanimity, efficiency comparison is low;(3) existing distributed approach master
Hadoop platform is based on, and focuses on General Multiplex Connection inquiring treatment optimization aspect, the data duplication for existing is excessive, filtering
The weak problem of ability, so as to have impact on the efficiency of query processing;(4) distributed multi-channel spatial join algorithm newest at present is exactly
Gupta et al. proposes two kinds of Multi way spatial join Query Processing Algorithm Controlled- based on MapReduce
Replicate and ε-Controlled-Replicate.Controlled-Replicate is by the sky in all kinds of connection data sets
Between object divide and copy to all grid cells in fourth quadrant, then carry out multi-wad join computing.Obvious this method is made
Into the duplication of a large amount of spatial objects, influence connection treatment effeciency.Looked into for this author has also been proposed improved Multi way spatial join
Processing Algorithm ε-Controlled-Replicate are ask, the algorithm reduces data duplication, improves inquiry to a certain extent
Treatment effeciency, but also there is a problem that data duplication is excessive.For Spark platforms, data duplication amount is excessive, can make
It is too big into the data volume being disposably loaded into internal memory, it is impossible to play the advantage calculated based on internal memory of Spark well, also can
The problems such as causing search efficiency low.
Above mentioned problem is furtherd investigate, and after proposing corresponding solution, space big data can be applied it to
Connection inquiring treatment etc. related application field.Therefore, the space multi-way connection that the present invention is proposed under a kind of Spark platforms is looked into
Processing Algorithm is ask, the algorithm is inquired about mainly for chain type Multi way spatial join, method is drawn using the data space based on grid, and
Division and the coding of data are realized with reference to Z-order codings, data projection is carried out according to locus where data object
And duplication.In connection procedure, the algorithm reduces useless connection data using edge filtering method, so as to reduce follow-up connection
Unnecessary calculating, and connecting object superfluous view with replicate.And reduce the defeated of reproducible results using avoidance strategy is repeated
Go out, so as to reduce the cost that follow-up connection is calculated comprehensively, improve the efficiency of multi-wad join query processing.
The content of the invention
In view of the defect that prior art is present, empty the invention aims to provide a kind of chain type multichannel based on Spark
Between Connection inquiring Processing Algorithm, the algorithm focuses primarily on chain type Multi way spatial join query processing problem, it is preferred that emphasis is reduce
The spatial data of filtration stage is replicated and amount of calculation, so as to reduce follow-up connection calculation cost, improves query processing efficiency, should
Algorithm simultaneously has good adaptability and autgmentability.
To achieve these goals, technical scheme:
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, comprises the following steps:
Step 1:Using Meshing Method, whole data space is divided into many size identical grid cells, and
Each grid cell is encoded using Z-order space filling curves technology;
Step 2:By m (m>2) road space connection data set R1, R2..., RmIn each spatial object according to it in data
Position in space projects to corresponding grid cell, and forms a series of key-value pairs, and projection result is stored in into elasticity respectively
Distributed data collection RDD1, RDD2..., RDDmIn, set cyclic variable i=2, intermediate result data collection RDDresultnew=
RDD1;
Step 3:If meeting condition i<M, then to two datasets RDDresultnew, RDDiPerform space concatenation operation
Overlap(RDDresultnew,RDDi).In calculating process, carry out successively data aggregation, edge filtering, space connection calculate,
Repetition avoids being operated with data duplication etc., ultimately forms intermediate result data collection RDDresultnew, i.e. RDDresultnew=
Overlap(RDDresultnew,RDDi);
Step 4:I=i+1, performs step 3 until condition i<Untill m is unsatisfactory for;
Step 5:Perform last time space concatenation operation Overlap (RDDresultnew,RDDm), in calculating process, according to
It is secondary to carry out data aggregation, edge filtering, space connection calculating, and result is directly exported, form final space connection result collection
Close, and be saved in HDFS file system.
Further, the data are divided and coding method is:It is using the division methods based on grid that whole data are empty
Between be divided into n equal-sized grid cells, grid cell is encoded using Z-order space filling curves, spatial data
Object is projected to each grid cell according to its position, and all grid cells are mapped into multiple using Hash modes
Executor execution units so that whole process task is divided into multiple parallel process tasks.
Further, the spatial object is projected as:Spatial data object is mapped to accordingly according to its position
In grid cell, if C=(c1,c2,…,cn) represent a data space division, ciRepresent each grid cell;Let R be one
The spatial object set of class treatment to be connected, if spatial object u a ∈ R, its MBR and grid cell ciHave overlapping, the ciFor
The Z-order codings of grid cell, then be mapped to grid cell c by object uiIn, and generate corresponding key-value pair (ci, u), if
One spatial object has overlapping with multiple grid cells, then can generate multiple key-value pairs accordingly.
Further, step 3 specifically includes following steps:
Step 3-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiHeld according to Key values
Row Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew;
Step 3-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out
Real space concatenation operation;
Step 3-3:Perform and repeat avoidance strategy, form connection intermediate result, and data are performed to connection intermediate result and answer
System operation, ultimately forms new middle connection result data set RDDresultnew。
Further, step 5 is comprised the following steps:
Step 5-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiHeld according to Key values
Row Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew;
Step 5-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out
Real space concatenation operation;
Step 5-3:Perform and repeat avoidance strategy, formed by tuple to the final connection result data set that constitutes
RDDresultnew, and it is saved in HDFS file system.
Further, the data copy operation is:For the last space concatenation operation on current grid unit
Any tuple t in the middle connection result set T for being generated, if t.s is the space related to space concatenation operation next time
Object, if then t.s and a certain grid cell ciIn the presence of overlapping, then tuple t is copied into grid cell ci, and generate corresponding key
Value is to (ci,t)。
Further, the filtering policy is:During executed in parallel concatenation operation, using corresponding filtering policy,
Remove the tuple that can not possibly produce connection result, and only the tuple there may be connection result is replicated.
Further, the filtering policy includes two parts:
Edge filtering, the edge filtering is:Before computing is attached, count in the middle of the connection for having completed first
The border MBR of the spatial object related to the connection of follow-up space in result, and filtered out using the MBR and subsequently to connect data
Concentrate and the disjoint spatial objects of the MBR, so as to reduce follow-up connection calculation cost;
Duplicate stage is filtered, and the duplicate stage is filtered into:, it is necessary to former roads during multichannel links query processing
Intermediate result after connection treatment carries out data copy operation, is only copied into producing other nets of connection result
In lattice unit, so as to avoid the loss of connection result, in being replicated to middle connection result, only to comprising inter-network lattice connecting object
Intermediate result replicated.
Further, the avoidance strategy that repeats is:When two spatial objects across multiple grid cells are attached,
Grid cell where the lower left corner intersection point of the new object for only allowing the two to overlap mutually and being formed is responsible for output result.
Compared with prior art, beneficial effects of the present invention:The present invention is that a kind of chain type multichannel space based on Spark connects
Connect Query Processing Algorithm, data space is divided using Meshing Method, and based on the position where spatial object come
Data projection and duplication are carried out, useless connecting object is filtered out in calculating process using edge filtering mode, and by contracting
Small reproduction range, reduces data duplication, has significant raising in treatment effeciency and in terms of reducing calculation cost, and with good
Good adaptability and autgmentability.
Brief description of the drawings
Fig. 1 is the exemplary plot of Z-order curve encodings in the specific embodiment of the invention;
Fig. 2 is being divided to data in the specific embodiment of the invention and the schematic diagram of duty mapping;
Fig. 3 is the exemplary plot that data are projected and replicated with operation in the specific embodiment of the invention;
Fig. 4 is the exemplary plot of the edge filtering described in the specific embodiment of the invention;
Fig. 5 is the exemplary plot that the repetition described in the specific embodiment of the invention is avoided;
Fig. 6 is the chain type Multi way spatial join Query Processing Algorithm based on Spark in the specific embodiment of the invention
Handling process schematic diagram.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with accompanying drawing, the present invention is entered
Row is further described.
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark proposed by the present invention, the algorithm mainly gathers
Jiao is in chain type Multi way spatial join query processing problem, it is preferred that emphasis is the spatial data for reducing filtration stage is replicated and amount of calculation,
So as to reduce follow-up connection calculation cost, query processing efficiency is improved, the algorithm simultaneously has good adaptability and autgmentability.
Embodiment 1:
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, key technology mainly includes following
Part:
1) data space is divided and encoded:Whole space is divided into by n equal-sized net using Meshing Method
Lattice unit, is encoded using Z-order space filling curves to grid cell, and spatial data object is projected to respectively according to its position
Individual grid cell, and all grid cells are mapped to by multiple Executor execution units using Hash modes so that whole place
Reason task is divided into multiple parallel calculating tasks, so as to lift the execution performance of whole algorithm in a parallel fashion.
As shown in figure 1, in order to keep the spatial relationship between spatial object, being carried out to grid cell using space filling curve
Coding, Fig. 1 is Z-order curves when position (bit) number is 1,2.Z-order curves are a kind of space filling curves.Z-order
(z- sequences) technology is using the bit attribute information come representation space object, then using the method for circulation by data space
Decompose, the subspace after division can obtain set of number, be referred to as the z- ranking values of the subspace, and as the subspace data
The Key values of object.
Be illustrated in figure 2 task and divide Mapping Examples, each grid cell after being as can be seen from the figure divided with draw
Assign to data set thereon and n Executor execution unit in Spark platforms is distributed to come parallel by Hash mapping modes
Perform.
2) spatial object projection:Spatial data object is exactly mapped to corresponding grid cell according to its position
In.If C=(c1,c2,…,cn) represent a data space division, ciRepresent each grid cell;Let R be a class to be connected
The spatial object set for the treatment of, if spatial object u a ∈ R, its MBR and grid cell ci(ciIt is the Z-order of grid cell
Coding) have overlapping, then object u is mapped to grid cell ciIn, and generate corresponding key-value pair (ci, u), if a space pair
As having overlapping with multiple grid cells, then multiple key-value pairs can be formed accordingly.Projection operation can be expressed as:
3) data duplication:, it is necessary to carry out multiple concatenation operation between multiple data sets in multichannel link query processing, number
It is then that the intermediate result that the last space on current grid unit connects is copied into other related grid lists according to replicating
Unit, so as to carry out follow-up attended operation, its result is similar with projection operation, can generate a series of key-value pair.If t ∈ T are
Tuple in connection intermediate result, t.s is the object that will carry out follow-up space connection, if then t.s and a certain grid cell ci
In the presence of overlapping, then tuple t is copied into grid cell ci, and generate corresponding key-value pair (ci,t).Operation is replicated to be represented by:
Fig. 3 is data projection and the example for replicating operation, there it can be seen that spatial object has been projected to overlapping therewith
Grid cell.Object r1It is projected to 6 and No. 12 grid cells, r2It is projected to 9 and No. 12 units, r3Then it is projected to 9 Hes
No. 11 units, i.e. Project (r1, C)={ (6, r1),(12,r1), Project (r2, C)={ (9, r2),(12,r2),
Project(r3, C)={ (9, r2),(11,r2)}.As execution r1, r2And r3When carrying out multi-wad join successively, due to r2And grid
Unit 9 has overlapping, therefore r in grid cell 121And r2Connection intermediate result (r1,r2) to be copied in grid cell 9,
Form key-value pair (9, (r1,r2)), so as to realize and the spatial object r in grid cell 93Later joining operation, it is to avoid connection
The loss of result.
4) filtering policy:During executed in parallel concatenation operation, using edge filtering strategy, removing to produce
The tuple of connection result, and only possible resultful tuple is replicated, greatly reduce storage and the follow-up cost for calculating.Tool
Body includes following two filtering policys:
A:Edge filtering:The border MBR of relevant connection object in connection result, and profit are completed before statistics several times first
Filtered out with the MBR subsequently to connect in data set with the disjoint spatial objects of the MBR, so as to reduce follow-up connection calculate
Cost:
Fig. 4 is an example for edge filtering, and three data sets R, S and T carry out three tunnel concatenation operations successively in figureProject to spatial object in grid cell 3 as illustrated,Result be respectively (r1,s1), (r1,s2),
(r1,s3), the object that can be obtained in the corresponding S set that a preceding connection result is concentrated is s1、s2And s3, its border MBR is figure
Shown in middle dotted line, when computing is attached with object in data set T, can directly filter out in projecting to grid cell 3
Spatial object t disjoint with the MBR1、t4And t5, it is to avoid these spatial objects respectively with s1、s2And s3It is attached computing,
So as to the cost of follow-up calculating is greatly reduced.
B:Duplicate stage is filtered:, it is necessary in after processing former road connections during multichannel links query processing
Between result carry out data copy operation, be copied into the grid cell that other may produce connection result, perform follow-up
Attended operation, it is to avoid lose connection result.In being replicated to middle connection result, only to being related to the centre of inter-network lattice connecting object
Result is replicated, so as to avoid unnecessary duplication.
5) avoidance strategy is repeated:When two spatial objects across multiple grid cells are attached, the two phases are only allowed
Overlapping and where the lower left corner intersection point of new object that is formed grid cell is responsible for output result, that is, only allows a grid
Unit is responsible for output result, so avoids the repetition output of result, reduces treatment cost.
Fig. 5 show the example that repetition is avoided, the object s in wherein S set1It is projected to its grid for overlapping
Object r in the set of unit 2,3,6,8,9,12, R1Then it is projected to grid cell 3,6,9,12, r2Object has been projected to 8,
9th, 10,11 4 grid cells, if not carrying out repeating to avoid, in treatment is attached, grid cell 3,6,9,12 will
Output identical connection result (r1,s1), and grid cell 8 and 9 can also export identical connection result (r2,s1), it is clear that occur in that
Repeat.According to the repetition avoidance strategy for being proposed, as shown in Figure 5, the lower left corner (figure of the object that object overlapping part is formed
Shown in middle P and Q points) where grid cell be responsible for output result, i.e., by grid cell 3 be responsible for treatment output r1And s1Connection
As a result (r1,s1), grid cell 8 is responsible for treatment output r2And s1Connection result (r2,s1), this strategy avoids repeat treatment and
The repetition output of result, reduces subsequent treatment cost.
Chain type Multi way spatial join inquires about Qm=Overlap (R1,R2,R3,...,Rm), according to its definition, can be expressed as
Qm=Overlap (... Overlap (Overlap (R1,R2),R3),…,Rm), the chain type multichannel based on Spark proposed by the present invention
The handling process of Spatial Join Query Processing algorithm is as shown in fig. 6, mainly include following steps:
A:According to mesh generation coding method to multi-wad join data set R1,R2,R3,…,RmProjected, and will coding
Value, using the attribute informations such as the mark and its MBR of each spatial object as Value values, forms a series of key assignments as Key values
It is right, and respectively by data set R1,R2,R3,…,RmProjection result be put into elasticity distribution formula data set RDD1,RDD2,RDD3,…,
RDDmIn;
B:Calculate Overlap (R1,R2), i.e., to RDD1And RDD2Cogroup operations are performed, by RDD1And RDD2In data
It is brought together according to Key values and obtains RDDnew, using edge filtering strategy to RDDnewFiltered, removing there can not possibly be result
Data object, then carry out real space concatenation operation, perform and repeat avoidance strategy, and form connection intermediate result;To even
Connect intermediate result and perform data copy operation, form intermediate result data collection RDDresultnew;
C:RDDresult is calculated according to step B identicals computational methodsnewAnd RDD3Between concatenation operation, obtain most
New R1,R2,R3Connection intermediate result RDDresultnew.Identical is taken to calculate method, successively cycle calculations
RDDresultnewWith RDD4, with RDD5..., with RDDm-1Concatenation operation, finally give data set R1,R2,R3,…,Rm-1Company
Meet intermediate result data collection RDDresultnew;
D:RDDresultnewWith RDDmCogroup operations are performed, new RDD is generatednew, row bound mistake is entered on this basis
Filter, concatenation operation treatment, and result is directly exported, form data set R1,R2,R3,…,RmFinal space connection data set
RDDresultnew, and result is saved in HDFS file system due to being last time spatial join operation, therefore no longer need
Carry out duplication operation.
Embodiment 2:
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, comprises the following steps:Step 1:Will be whole
Data space is divided into many size identical grid cells, and using Z-order space filling curves technology to each grid cell
Encoded;Step 2:By m (m>2) road space connection data set R1, R2..., RmIn each spatial object according to it in data
Position in space projects to corresponding grid cell, and projection result is stored in into elasticity distribution formula data set RDD1,
RDD2..., RDDmIn.Setting cyclic variable i=2, intermediate result data collection RDDresultnew=RDD1;Step 3:If met
Condition i<M, then to two datasets RDDresultnew, RDDiPerform space concatenation operation Overlap (RDDresultnew,
RDDi), in calculating process, data aggregation, edge filtering, space connection are carried out successively and calculated, repeated to avoid with data duplication etc.
Operation, ultimately forms new intermediate result data collection RDDresultnew, RDDresultnew=Overlap (RDDresultnew,
RDDi);Step 4:I=i+1, performs step 3 until condition i<Untill m is unsatisfactory for;Step 5:Perform the connection of last time space
Computing Overlap (RDDresultnew,RDDm), in calculating process, data aggregation, edge filtering, space connection meter are carried out successively
Calculate, and result is directly exported, form final space connection result set, and be saved in HDFS file system.The present invention is one
The chain type Multi way spatial join Query Processing Algorithm based on Spark is planted, is had in treatment effeciency and in terms of reducing calculation cost
It is significant to improve, and with good adaptability and autgmentability.
Embodiment 3:
A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, comprises the following steps:
Step 1:Using Meshing Method, whole data space is divided into many size identical grid cells, and
Each grid cell is encoded using Z-order space filling curves technology;
Step 2:By m (m>2) road space connection data set R1, R2..., RmIn each spatial object according to it in data
Position in space projects to corresponding grid cell, and forms a series of key-value pairs, and projection result is stored in into elasticity respectively
Distributed data collection RDD1, RDD2..., RDDmIn, set cyclic variable i=2, intermediate result data collection RDDresultnew=
RDD1;
Step 3:If meeting condition i<M, then to two datasets RDDresultnew, RDDiPerform space concatenation operation
Overlap(RDDresultnew,RDDi).In calculating process, carry out successively data aggregation, edge filtering, space connection calculate,
Repetition avoids being operated with data duplication etc., ultimately forms intermediate result data collection RDDresultnew, i.e. RDDresultnew=
Overlap(RDDresultnew,RDDi);
Step 4:I=i+1, performs step 3 until condition i<Untill m is unsatisfactory for;
Step 5:Perform last time space concatenation operation Overlap (RDDresultnew,RDDm), in calculating process, according to
It is secondary to carry out data aggregation, edge filtering, space connection calculating, and result is directly exported, form final space connection result collection
Close, and be saved in HDFS file system.
Further, the data are divided and coding method is:It is using the division methods based on grid that whole data are empty
Between be divided into n equal-sized grid cells, grid cell is encoded using Z-order space filling curves, spatial data
Object is projected to each grid cell according to its position, and all grid cells are mapped into multiple using Hash modes
Executor execution units so that whole process task is divided into multiple parallel process tasks.
Further, the spatial object is projected as:Spatial data object is mapped to accordingly according to its position
In grid cell, if C=(c1,c2,…,cn) represent a data space division, ciRepresent each grid cell;Let R be one
The spatial object set of class treatment to be connected, if spatial object u a ∈ R, its MBR and grid cell ciHave overlapping, the ciFor
The Z-order codings of grid cell, then be mapped to grid cell c by object uiIn, and generate corresponding key-value pair (ci, u), if
One spatial object has overlapping with multiple grid cells, then can generate multiple key-value pairs accordingly.
Further, step 3 specifically includes following steps:
Step 3-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiHeld according to Key values
Row Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew;
Step 3-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out
Real space concatenation operation;
Step 3-3:Perform and repeat avoidance strategy, form connection intermediate result, and data are performed to connection intermediate result and answer
System operation, ultimately forms new middle connection result data set RDDresultnew。
Further, step 5 is comprised the following steps:
Step 5-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiHeld according to Key values
Row Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew;
Step 5-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out
Real space concatenation operation;
Step 5-3:Perform and repeat avoidance strategy, formed by tuple to the final connection result data set that constitutes
RDDresultnew, and it is saved in HDFS file system.
Further, the data copy operation is:For the last space concatenation operation on current grid unit
Any tuple t in the middle connection result set T for being generated, if t.s is the space related to space concatenation operation next time
Object, if then t.s and a certain grid cell ciIn the presence of overlapping, then tuple t is copied into grid cell ci, and generate corresponding key
Value is to (ci,t)。
Further, the filtering policy is:During executed in parallel concatenation operation, using corresponding filtering policy,
Remove the tuple that can not possibly produce connection result, and only the tuple there may be connection result is replicated.
Further, the filtering policy includes two parts:
Edge filtering, the edge filtering is:Before computing is attached, count in the middle of the connection for having completed first
The border MBR of the spatial object related to the connection of follow-up space in result, and filtered out using the MBR and subsequently to connect data
Concentrate and the disjoint spatial objects of the MBR, so as to reduce follow-up connection calculation cost;
Duplicate stage is filtered, and the duplicate stage is filtered into:, it is necessary to former roads during multichannel links query processing
Intermediate result after connection treatment carries out data copy operation, is only copied into producing other nets of connection result
In lattice unit, so as to avoid the loss of connection result, in being replicated to middle connection result, only to comprising inter-network lattice connecting object
Intermediate result replicated.
Further, the avoidance strategy that repeats is:When two spatial objects across multiple grid cells are attached,
Grid cell where the lower left corner intersection point of the new object for only allowing the two to overlap mutually and being formed is responsible for output result.
Although the foregoing describing specific embodiment of the invention, it is familiar with researcher in this field and should be appreciated that
These are merely illustrative of, and the present invention is a kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, therefore citing
Illustrate to be merely to illustrate that filtering policy, repeat the core concept of avoidance strategy, connection handling process etc..Can enter after
The more massive experiment of row, and related algorithm is further improved, data projection, the effect for replicating and filtering are improved, while
It is contemplated that the performance of algorithm is further improved with reference to index technology, without departing from principle of the invention and essence.The present invention
Scope be only limited by the claims that follow.
Claims (9)
1. a kind of chain type Multi way spatial join Query Processing Algorithm based on Spark, it is characterised in that:Comprise the following steps:
Step 1:Using Meshing Method, whole data space is divided into many size identical grid cells, and use
Z-order space filling curves technology is encoded to each grid cell;
Step 2:By m (m>2) road space connection data set R1, R2..., RmIn each spatial object according to it in data space
In position project to corresponding grid cell, and form a series of key-value pairs, projection result is stored in elasticity distribution respectively
Formula data set RDD1, RDD2..., RDDmIn, set cyclic variable i=2, intermediate result data collection RDDresultnew=RDD1;
Step 3:If meeting condition i<M, then to two datasets RDDresultnew, RDDiPerform space concatenation operation
Overlap(RDDresultnew,RDDi).In calculating process, carry out successively data aggregation, edge filtering, space connection calculate,
Repetition avoids being operated with data duplication etc., ultimately forms intermediate result data collection RDDresultnew, i.e. RDDresultnew=
Overlap(RDDresultnew,RDDi);
Step 4:I=i+1, performs step 3 until condition i<Untill m is unsatisfactory for;
Step 5:Perform last time space concatenation operation Overlap (RDDresultnew,RDDm), in calculating process, enter successively
The aggregation of row data, edge filtering, space connection calculate, and result are directly exported, and form final space connection result set, and
It is saved in HDFS file system.
2. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 1, it is characterised in that:
The data are divided and coding method is:Whole data space is divided into n size phase using the division methods based on grid
Deng grid cell, grid cell is encoded using Z-order space filling curves, spatial data object is thrown according to its position
All grid cells are mapped to multiple Executor execution units by shadow to each grid cell using Hash modes so that
Whole process task is divided into multiple parallel process tasks.
3. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 1, it is characterised in that:
The spatial object is projected as:Spatial data object is mapped in corresponding grid cell according to its position, if C=
(c1,c2,…,cn) represent a data space division, ciRepresent each grid cell;Let R be the sky of class treatment to be connected
Between object set, if spatial object u a ∈ R, its MBR and grid cell ciHave overlapping, the ciIt is the Z- of grid cell
Order is encoded, then object u is mapped into grid cell ciIn, and generate corresponding key-value pair (ci, u), if a spatial object
Have overlapping with multiple grid cells, then can generate multiple key-value pairs accordingly.
4. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 1, it is characterised in that:
Step 3 specifically includes following steps:
Step 3-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiPerformed according to Key values
Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew;
Step 3-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out reality
Space concatenation operation;
Step 3-3:Perform and repeat avoidance strategy, form connection intermediate result, and data duplication behaviour is performed to connection intermediate result
Make, ultimately form new middle connection result data set RDDresultnew。
5. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 1, it is characterised in that:
Step 5 is comprised the following steps:
Step 5-1:Calculate Overlap (RDDresultnew,RDDi), i.e., to RDDresultnew, RDDiPerformed according to Key values
Cogroup is operated, will RDDresultnewAnd RDDiIn data be brought together according to Key values and obtain RDDnew;
Step 5-2:Using filtering policy to RDDnewFiltered, removed impossible resultful data pair, then carried out reality
Space concatenation operation;
Step 5-3:Perform and repeat avoidance strategy, formed by tuple to the final connection result data set RDDresult that constitutesnew,
And it is saved in HDFS file system.
6. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 1, it is characterised in that:
The data copy operation is:For the middle connection knot that the last space concatenation operation on current grid unit is generated
Any tuple t in fruit set T, if t.s is the spatial object related to space concatenation operation next time, if t.s with it is a certain
Grid cell ciIn the presence of overlapping, then tuple t is copied into grid cell ci, and generate corresponding key-value pair (ci,t)。
7. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 4, it is characterised in that:
The filtering policy is:During executed in parallel concatenation operation, using corresponding filtering policy, removing can not possibly produce connection
The tuple of result, and only the tuple there may be connection result is replicated.
8. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 7, it is characterised in that:
The filtering policy includes two parts:
Edge filtering, the edge filtering is:Before computing is attached, the connection intermediate result for having completed is counted first
In the spatial object related to the connection of follow-up space border MBR, and filtered out using the MBR and subsequently to connect data set
With the disjoint spatial objects of the MBR, so as to reduce follow-up connection calculation cost;
Duplicate stage is filtered, and the duplicate stage is filtered into:, it is necessary to be connected to former roads during multichannel links query processing
Intermediate result after treatment carries out data copy operation, is only copied into producing other grid lists of connection result
In unit, so as to avoid the loss of connection result, in being replicated to middle connection result, only to comprising in inter-network lattice connecting object
Between result replicated.
9. the chain type Multi way spatial join Query Processing Algorithm based on Spark according to claim 4, it is characterised in that:
It is described repeat avoidance strategy be:When two spatial objects across multiple grid cells are attached, the two are only allowed to overlap mutually
And the grid cell where the lower left corner intersection point of the new object for being formed is responsible for output result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710083816.9A CN106909639B (en) | 2017-02-16 | 2017-02-16 | Chained multi-path space connection query processing method based on Spark |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710083816.9A CN106909639B (en) | 2017-02-16 | 2017-02-16 | Chained multi-path space connection query processing method based on Spark |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106909639A true CN106909639A (en) | 2017-06-30 |
CN106909639B CN106909639B (en) | 2020-09-29 |
Family
ID=59209302
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710083816.9A Active CN106909639B (en) | 2017-02-16 | 2017-02-16 | Chained multi-path space connection query processing method based on Spark |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106909639B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722314A (en) * | 2020-12-31 | 2021-11-30 | 京东城市(北京)数字科技有限公司 | Space connection query method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101692230A (en) * | 2009-07-28 | 2010-04-07 | 武汉大学 | Three-dimensional R tree spacial index method considering levels of detail |
US20140297585A1 (en) * | 2013-03-29 | 2014-10-02 | International Business Machines Corporation | Processing Spatial Joins Using a Mapreduce Framework |
CN104391679A (en) * | 2014-11-18 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | GPU (graphics processing unit) processing method for high-dimensional data stream in irregular stream |
US20160055207A1 (en) * | 2014-08-19 | 2016-02-25 | International Business Machines Corporation | Processing Multi-Way Theta Join Queries Involving Arithmetic Operators on Mapreduce |
CN106055563A (en) * | 2016-05-19 | 2016-10-26 | 福建农林大学 | Method for parallel space query based on grid division and system of same |
CN106209989A (en) * | 2016-06-29 | 2016-12-07 | 山东大学 | Spatial data concurrent computational system based on spark platform and method thereof |
-
2017
- 2017-02-16 CN CN201710083816.9A patent/CN106909639B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101692230A (en) * | 2009-07-28 | 2010-04-07 | 武汉大学 | Three-dimensional R tree spacial index method considering levels of detail |
US20140297585A1 (en) * | 2013-03-29 | 2014-10-02 | International Business Machines Corporation | Processing Spatial Joins Using a Mapreduce Framework |
US20160055207A1 (en) * | 2014-08-19 | 2016-02-25 | International Business Machines Corporation | Processing Multi-Way Theta Join Queries Involving Arithmetic Operators on Mapreduce |
CN104391679A (en) * | 2014-11-18 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | GPU (graphics processing unit) processing method for high-dimensional data stream in irregular stream |
CN106055563A (en) * | 2016-05-19 | 2016-10-26 | 福建农林大学 | Method for parallel space query based on grid division and system of same |
CN106209989A (en) * | 2016-06-29 | 2016-12-07 | 山东大学 | Spatial data concurrent computational system based on spark platform and method thereof |
Non-Patent Citations (1)
Title |
---|
BAIYOU QIAO等: "A Boundary Filtering Based Spatial Join Query Processing Optimization Algorithm", 《2015 12TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY(FSKD)》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722314A (en) * | 2020-12-31 | 2021-11-30 | 京东城市(北京)数字科技有限公司 | Space connection query method and device, electronic equipment and storage medium |
CN113722314B (en) * | 2020-12-31 | 2024-04-16 | 京东城市(北京)数字科技有限公司 | Space connection query method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106909639B (en) | 2020-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112132287B (en) | Distributed quantum computing simulation method and device | |
US10762087B2 (en) | Database search | |
CN108228724A (en) | Power grid GIS topology analyzing method and storage medium based on chart database | |
CN112784968A (en) | Hybrid pipeline parallel method for accelerating distributed deep neural network training | |
CN103116625A (en) | Volume radio direction finde (RDF) data distribution type query processing method based on Hadoop | |
CN104462351B (en) | A kind of data query model and method towards MapReduce patterns | |
CN106209989A (en) | Spatial data concurrent computational system based on spark platform and method thereof | |
CN104504154A (en) | Method and device for data aggregate query | |
CN105204920B (en) | A kind of implementation method and device of the distributed computing operation based on mapping polymerization | |
CN116644804B (en) | Distributed training system, neural network model training method, device and medium | |
CN104951442B (en) | A kind of method and apparatus of definitive result vector | |
CN111831354B (en) | Data precision configuration method, device, chip array, equipment and medium | |
CN113971367B (en) | Automatic convolutional neural network framework design method based on shuffled frog-leaping algorithm | |
CN103164495A (en) | Half-connection inquiry optimizing method based on periphery searching and system thereof | |
CN107870949A (en) | Data analysis job dependence relation generation method and system | |
CN104301212B (en) | Functional chain combination method | |
CN111461284A (en) | Data discretization method, device, equipment and medium | |
CN106909639A (en) | A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark | |
CN110750560A (en) | System and method for optimizing network multi-connection | |
CN109767002A (en) | A kind of neural network accelerated method based on muti-piece FPGA collaboration processing | |
CN117708169A (en) | Database query optimization method and device, electronic equipment and storage medium | |
CN116383247A (en) | Large-scale graph data efficient query method | |
CN106780747A (en) | A kind of method that Fast Segmentation CFD calculates grid | |
CN115001978A (en) | Cloud tenant virtual network intelligent mapping method based on reinforcement learning model | |
CN110083609B (en) | Real-time query method for graph structure data in rail transit network passenger flow data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |