CN106777093B - Skyline inquiry system based on space time sequence data flow application - Google Patents

Skyline inquiry system based on space time sequence data flow application Download PDF

Info

Publication number
CN106777093B
CN106777093B CN201611150565.3A CN201611150565A CN106777093B CN 106777093 B CN106777093 B CN 106777093B CN 201611150565 A CN201611150565 A CN 201611150565A CN 106777093 B CN106777093 B CN 106777093B
Authority
CN
China
Prior art keywords
skyline
time
data
query
grid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611150565.3A
Other languages
Chinese (zh)
Other versions
CN106777093A (en
Inventor
季长清
王宝凤
谢雨婧
李媛媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN201611150565.3A priority Critical patent/CN106777093B/en
Publication of CN106777093A publication Critical patent/CN106777093A/en
Application granted granted Critical
Publication of CN106777093B publication Critical patent/CN106777093B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • G06F19/32

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A Skyline query system based on space time sequence data stream application belongs to the application field of dynamic Skyline query in data stream and is used for solving the problem of real-time query processing of mass data. The technical points are as follows: including cloud center service system, cloud center service system includes: a dividing module for performing a spatial time sequence based division of a continuous time sequence into a plurality of time segments in time windows; the inverted grid index generating module is used for generating a grid inverted index for each time segment; and the computing module is used for mapping the moment query points to corresponding Skyline grids, then obtaining global Skyline grids as a candidate set by using a global Skyline grid computing method, and then performing dynamic Skyline query on network node data in the candidate set according to a time sequence to obtain an effective global Skyline result by computing. The effect is as follows: the result query is carried out at the moment when the execution time is finished, so that the method is more accurate and accords with the actual situation.

Description

Skyline inquiry system based on space time sequence data flow application
Technical Field
The invention relates to the application field of dynamic Skyline query in data stream, in particular to a Skyline query system based on space time sequence data stream application, which relates to large-scale data analysis, space time sequence mass data processing and global Skyline calculation.
Background
With the rapid development of the internet and the internet of things and the wide application of technologies such as social networks and cloud computing, mass data technologies are rapidly developed. Massive data is collected and recorded and used for research and analysis in the fields of science, engineering, commerce and the like. Recent studies have shown that: global internet, mobileData sources such as the internet, GPS networks, etc. are generated in excess of 2.5 x 10 a day18Bytes of mass data, and the sources of these mass data are wide. Data on the internet is turned over every two years, and mass data are constantly added to the internet of things, the mobile internet, the internet of vehicles and various sensor networks. However, the explosive growth of massive data has made traditional stand-alone data analysis and processing techniques increasingly unsuitable for the current intensive data analysis and processing requirements. In order to save cost and provide a distributed processing framework for storage and computation of large-scale data, related technologies such as cloud computing, big data, cloud storage, MapReduce and BigTable are proposed.
As predicted by cisco, in 2016 79% of the world's data centers host cloud computing platforms. Mass data are stored in the cloud computing platforms, and due to the fact that the data volume is too large, requirements of the mass data processing technologies on software and hardware are very high, system resources occupy very much, and the problem of low algorithm efficiency is brought. A plurality of scholars put forward a plurality of new high-efficiency mass data processing algorithms by virtue of a cloud computing platform, and the Skyline algorithm is one of the high-efficiency data query and extraction methods, so that key information can be extracted from mass data quickly, the data volume is greatly reduced, the requirements on software and hardware in mass data processing are reduced, and the data processing efficiency is improved. The Skyline algorithm is used as an effective data extraction and processing method, mainly considers how to find out the most interesting or most concerned information of people from huge data sets, and has wide application in the aspects of mass data analysis and processing, such as multi-target decision, shop addressing, environment monitoring, image retrieval, personalized recommendation, data mining and the like. The Skyline query can provide a multi-attribute evaluation principle for a user in a decision process, and the evaluation function can also adopt different measurement methods (such as Euclidean distance, spatial distance and the like) according to different applications so as to improve the experience quality of the user; skyline calculation can help market analysts to position prices and market strategies for mass business transaction data records; in the environment monitoring process, potential natural disasters and risks can be analyzed and evaluated by analyzing massive data accumulated by the sensor network. In addition, the Skyline query is also applied to the fields of image retrieval, shop addressing and the like.
The Skyline algorithm has a plurality of varieties, and the application scenes of the varieties are wider. The variants have respective characteristics and facing problems, most of the existing Skyline algorithms based on MapReduce are static Skyline algorithms, and the problems of Skyline variants cannot be universally solved. Therefore, the Skyline algorithm based on MapReduce is to be further researched and expanded. Besides the urgent need of MapReduce, the variant algorithms still face some self problems to be solved, for example, the subspace Skyline can well solve the problem of large calculation amount caused by high-dimensional data, but the returned result set is too large and most results are not required by users, and the characteristic is not suitable for the current mobile internet terminal query application trend; the attribute value of the queried object in the dynamic Skyline changes along with the change of the queried object, the calculation amount is large, and the requirements on the real-time performance, the response time and the user experience of the algorithm are high. The partition mode or the index mode adopted by the Skyline algorithm based on MapReduce at present cannot meet the requirements; the problem of measurement space modeling and the problem of high query complexity exist in the measurement space Skyline, the query precision is influenced, and the calculation amount is increased. Because all attribute values in the dynamic Skyline change along with the change of the query point, the problems of large calculation amount and high real-time requirement can be encountered when mass data is processed. For example, dynamic Skyline query of a mobile phone user has a very high requirement on real-time performance, and data generated by a mobile phone terminal in the big data era becomes a main source of data growth. Aiming at the trend, the dynamic Skyline algorithm in a centralized environment cannot be competent for processing mass data; the partition mode generally adopted by the existing Skyline algorithm based on MapReduce also does not adapt to the requirement. The dynamic query of parallel anti-Skyline implemented with MapReduce proposed in the literature relies on quadtree (rsky-quadtree) partitioning, which has the disadvantage that for each query point q, an extra step is required to convert the coordinates p of each data point to p', and the quadtree needs to be subsequently re-established. When faced with large data, coordinate transformation and the reconstruction of the quadtree incur burdensome overhead. In order to solve the problems, the definitions of the Skyline lattices and the global Skyline lattices are provided, and a dynamic Skyline query algorithm based on the application of the spatial time sequence data stream is provided on the basis of the definitions. The method has the main idea that a dynamically changing data space is divided into Skyline non-uniform lattices with time stamps according to a time window as a unit, namely, a reverse lattice index structure based on time sequencing is established. When a query point arrives, the current query time is judged to predict the query ending time (the prediction or sampling prediction can be carried out according to the average system execution time and is represented by the lower limit of an execution time window), then the domination relation of Skyline lattices in four quadrants around the query point at the ending time is calculated in a polling mode, global Skyline lattices are obtained according to the domination relation comparison, and data in the global Skyline lattices form a candidate set and are used for the next dynamic Skyline calculation. The method not only can effectively carry out real-time pruning and save a large amount of unnecessary calculation, but also can carry out dynamic adjustment according to time change, thereby accelerating the inquiry of dynamic Skyline and leading the result to be relatively more accurate; in order to verify the algorithm provided by the patent, a system prototype is finally designed and applied to the detection of the abnormal condition of network monitoring.
The existing Skyline algorithm based on MapReduce has less support on the Skyline query in a time-based subspace and the dynamic Skyline query in time sequence under the parallel environment. For example, some Skyline algorithms based on MapReduce modify the Hadoop framework, but still have the problems of poor expandability and poor universality. The dynamic Skyline query method based on MapReduce researched and designed in the prior art can only process non-real-time data in an off-line batch mode and cannot be well used for real-time data query. These methods have not been suitable for data queries that are now explosively growing, and based on this starting point we have designed and implemented the invention.
Disclosure of Invention
According to the defects and shortcomings in the background art, the invention provides a skyline query system based on space time sequence data stream application in a cloud computing environment, so as to improve the defects of the existing dynamic skyline query method of data streams, and improve the accuracy and processing efficiency and the user experience.
A Skyline inquiry system based on space time sequence data flow application comprises a cloud center service system, wherein the cloud center service system comprises: a dividing module for performing a spatial time sequence based division of a continuous time sequence into a plurality of time segments in time windows; the inverted grid index generating module is used for generating a grid inverted index for each time segment; and the computing module is used for mapping the moment query points to corresponding Skyline grids, then obtaining global Skyline grids as a candidate set by using a global Skyline grid computing method, and then performing dynamic Skyline query on network node data in the candidate set according to a time sequence to obtain an effective global Skyline result by computing.
The specific steps of the space time sequence division by the division module are as follows: given a set of objects P, each data point PkIn a bounded interval [ T ]min,Tmax]Constructing a uniform partition t0,...,tB},tiDefinition of (t)i=Tmin+l×i,l=(Tmax-Tmin)/B,i=0,...,B
Form a set of time slices b0,...,bB-1Each time slice bi=[ti,ti+1) The fixed length is l, and B is the number of the bounded intervals which are uniformly divided; the time attribute value of each point is t and is mapped to a time slice bs(t)∈{b0,...,bB-1Where s (t) is defined as follows:
Figure BDA0001179664940000041
in the inverted grid index generation module, for each time slice, the generation process of the grid inverted index is as follows: let a given set of d-dimensional spatial objects P ═ P1,...,pnP for each data point PkI.e. pkAll e.p have d-dimensional attribute Pk.x1,...,pk.xdData of d dimension }The space is divided into grids with equal width, and the width of each unit grid is (1,...,d) (ii) a The width of the cell is determined according to the value of each dimension, so that the mapped data points can be uniformly distributed in the cell, all the points in the same time slice are scanned,
Figure BDA0001179664940000042
point pkMapping to grid coordinates
Figure BDA0001179664940000043
Coordinate mapping such as
Figure BDA0001179664940000044
In the calculation module, the global Skyline grid calculation method comprises the following steps: query points q are mapped to corresponding grid cells cqIn the middle, the whole grid area is divided into an influence area and a dominated area, the influence area comprises cqPeripheral non-empty cells and grid cqA grid on the same horizontal or vertical line; the dominated region is a region dominated by the affected region, for the search of the affected region, a quadrant polling method is adopted, the domination relationship of non-empty Skyline lattices in each quadrant around the query point is calculated through gradual expansion, and data points in global Skyline lattices and lattices are obtained through comparison according to the domination relationship.
The Skyline format governing method is as follows: given any two non-empty Skyline grids C in the Skyline grid set C on the q, d dimension space of the query pointi,cj
Figure BDA0001179664940000049
Simultaneously, the following conditions are satisfied:
Figure BDA0001179664940000045
(ci(t)-q(t))(cj(t)-q(t))>0;
Figure BDA0001179664940000046
|ci(t)-q(t)|≤|cj(t)-q(t)|;
Figure BDA0001179664940000047
|ci(t)-q(t)|<|cj(t)-q(t)|。
skyline lattice ciSkyline lattice c dominated by qj
The global Skyline lattice is given a lattice set C, the global Skyline lattice of C is a lattice set which is not globally dominated by other lattices, and the global Skyline lattice is defined as:
Figure BDA0001179664940000048
when the index is established, a MapReduce processing flow is used, a plurality of maps are started to read the streaming data at the same time, each Map reads different HDFS data fragments, a < key, value > data pair is generated, the key is a space-time index, the value is a hashmap data structure, and corresponding data points obtained according to division are stored in the hash Map data structure; and the intermediate data obtained by each Map is a sub-index of partial data, sorting is completed according to key, and merging generation of the index is completed by calling a Reduce.
When the space time sequence is divided, a monitoring time range is set, a threshold value is set accordingly, if the query time exceeds the specified time range, a plurality of time windows need to be spanned, the size of the time window needing to be spanned is evaluated, and if the size of the time window exceeds the time threshold value, the direct query fails.
Has the advantages that: the spatial time sequence data flow system can accurately and efficiently process a large amount of information according to the requirements of a user through the related technology, then upload the information to a cloud server for analysis, and feed back a final conclusion to the user.
Drawings
FIG. 1 is based on a time series partitioning;
FIG. 2 is a time series based inverted index structure;
FIG. 3 is a grid-based inverted index creation process;
FIG. 4 illustrates an example of a MapReduce generated index;
fig. 5 global sky grid.
Detailed Description
Example 1:
the skyline query method is based on space time sequence data flow application. The invention comprises the following steps:
s1, based on space time sequence division:
we will divide a continuous time series into several time segments in time windows. As shown in fig. 1, the method is as follows: given a set of objects P, each data point PkIn a bounded interval [ T ]min,Tmax]Constructing a uniform partition t0,...,tB},tiDefinition of (1):
ti=Tmin+l×i,l=(Tmax-Tmin)/B,i=0,...,B;
to form a set of time slices b0,...,bB-1Each time slice bi=[ti,ti+1) The fixed length is l. The time attribute value of each point is t and is mapped to a time slice bs(t)∈{b0,...,bB-1Wherein s (t) is as defined
Figure BDA0001179664940000051
And B is the number of the bounded intervals which are uniformly divided.
The interval fixed length (l) values of different granularities are determined according to practical application conditions. In order to reduce the calculation amount, a monitoring time range is set, a threshold value is set, if the query exceeds the specified time range, a plurality of time windows need to be spanned, the size of the time windows needing to be spanned is evaluated, and if the threshold value is exceeded, the direct query fails during the query. Because a time window is introduced, a monitoring range needs to be further defined, if the time window is too small, and the data volume accumulation is not large, a batch flow caching method is adopted, and data flow is cached and then sent in batch periodically. If the time window is large and the data size is large, the data stream is split according to the window, and the splitting granularity is determined by the actual application scene. Therefore, the upper limit and the lower limit of the monitoring range are limited, and if the monitoring range is exceeded, query failure processing is carried out. The processing method also meets the requirement of actual inquiry application, for example, because the vehicle is driven too fast, the inquiry does not need to be continued after leaving a certain application area. The experimental test results show that the application effect is relatively good when the calculation is carried out according to the sampling distribution probability.
S2, establishing a grid reverse index for the time segment:
in this step, a data structure of the inverted grid index based on the time series is designed as shown in fig. 2. For each time slice, the time is determined, and the ending time (i.e. the lower limit of the execution time window) is estimated, and the grid is indexed backwards here, and the index generation process is shown in fig. 3. Let a given set of d-dimensional spatial objects P ═ P1,...,pnP for each data point PkI.e. pkAll e.p have d-dimensional attribute Pk.x1,...,pk.xd}. The d-dimensional data space is divided into grids with equal width, and the width of each unit grid is (1,...,d). The width of the cell is determined according to the value of each dimension, so that the mapped data points can be uniformly distributed as much as possible. All points within the same time slice are scanned,
Figure BDA0001179664940000061
point pkMapping to grid coordinates
Figure BDA0001179664940000062
The coordinate mapping is as follows:
Figure BDA0001179664940000063
in step S1 and step S2, a plurality of maps are started to read the stream data simultaneously by using a MapReduce processing flow based on two processes of time sequence division and grid index generation, each Map reads a different HDFS data segment to generate a data pair such as < key, value >, where key is a spatio-temporal index, value is a hashmap data structure, and corresponding data points obtained by division are stored in the value. The intermediate data obtained by each Map, i.e. the sub-index representing part of the data, is sorted automatically according to key. In order to ensure data integrity and consistency, a Reduce is called finally to complete the merging generation of the index. The generation based on the time sequence inverted index is a preprocessing process, the pre-generation can be used for subsequent query, the query time is not occupied, and the method is an effective data management mode. Meanwhile, the capacity of MapReduce for parallel processing of big data can well finish the work.
And simultaneously starting a plurality of maps to read time stream data by using a Spark stream system, wherein each Map reads different HDFS data fragments to generate a data pair of < key, value >, wherein the key is a space-time index, the value is a hashmap data structure, and corresponding data points obtained according to division are stored in the hash Map data structure. Each Map gets intermediate data with the number of time slices B set to n and the grid width 15 as shown in fig. 4, that is, sub-indexes representing partial data, and sorting is automatically done according to key.
Compared with the work of the people, the method has two optimizations, one is to use the time when the execution time is finished to perform result query, and the method is more representative. For example, if a vehicle running fast on a highway starts an inquiry request, the inquiry result should be filtered according to the time point of the end of the inquiry, so that the result will be more accurate and conform to the actual situation. The other optimization is that a Spark stream processing system is adopted, and the result of Map calculation is distributed and cached in a stream form, and is not written in the HDFS, so that the calculation speed can be greatly accelerated.
S3 calculation of Global Skyline lattices
When mass data is faced, in order to reduce the calculation amount, a calculation method of a coarse-grained global Skyline lattice is provided, and data in the global Skyline lattice after polling calculation is used as a candidate set. Compared with the original data set, the data volume in the candidate data set is greatly reduced, so that the comparison of dominant relations in the next dynamic Skyline calculation is reduced, and the process is similar to pruning. The definition of the Skyline lattice domination relation and the definition of the global Skyline lattice are given below,
definition (Skyline lattice governs): given any two non-empty Skyline grids C in the Skyline grid set C on the q, d dimension space of the query pointi,cjThen Skyline lattice ciSkyline lattice c dominated by qjNamely, it is
Figure BDA0001179664940000075
Simultaneously, the following conditions are satisfied:
Figure BDA0001179664940000071
(ci(t)-q(t))(cj(t)-q(t))>0;
Figure BDA0001179664940000072
|ci(t)-q(t)|≤|cj(t)-q(t)|;
Figure BDA0001179664940000073
|ci(t)-q(t)|<|cj(t)-q(t)|。
definition (global Skyline lattice): the global Skyline lattice (GSC) for a given lattice set C, C is a set of all lattices that are not globally dominated by other lattices
Figure BDA0001179664940000074
The overhead of dynamic Skyline query has a direct relation with the size of a data set, particularly the overhead of real-time judgment of dominating relation among mass data is large, and each query needs to be recalculated. The concept of the global Skyline lattice can well realize coarse-grained pruning, and a candidate set obtained on the basis is the basis for realizing the next step of dynamic Skyline query calculation. The course of coarse-grain pruning will be described in detail below.
As shown in FIG. 5, query point q is mapped to a corresponding netGrid cell cqIn the middle, the entire grid area is divided into the affected area and the dominated area. The region of influence comprising cqGrid c with non-empty periphery1,c2,c3,...,c8And with a grid cqGrid on the same horizontal or vertical line, e.g. c9Grid; dominated region refers to the region dominated by the affected region, e.g. c in the second quadrant10And (4) grid. For the search of the influence area, a 2d quadrant polling method (d is a data set dimension) is adopted, the domination relation of non-empty Skyline lattices in each quadrant around the query point is calculated through a gradual expansion method, and data points in global Skyline lattices and lattices are obtained through comparison according to the domination relation, so that the data points in the influence area can be obtained without traversing all data. The traversal of a very small number of Skyline bins significantly reduces computational overhead relative to the full traversal of the raw data.
In the step, the global Skyline lattice is applied to the data monitored by the network, the query point q at the moment is mapped to the corresponding Skyline lattice, then the global Skyline lattice is obtained by using a global Skyline lattice calculation method and is used as a candidate set, then the network node data in the candidate set is subjected to dynamic Skyline query according to time sequence, and finally an effective global Skyline result, namely a node close to a query threshold value in the network monitoring is obtained by calculation.
The corresponding system or device obtained by the above method is as follows:
a Skyline inquiry system based on space time sequence data flow application comprises a cloud center service system, wherein the cloud center service system comprises: a dividing module for performing a spatial time sequence based division of a continuous time sequence into a plurality of time segments in time windows; the inverted grid index generating module is used for generating a grid inverted index for each time segment; and the computing module is used for mapping the moment query points to corresponding Skyline grids, then obtaining global Skyline grids as a candidate set by using a global Skyline grid computing method, and then performing dynamic Skyline query on network node data in the candidate set according to a time sequence to obtain an effective global Skyline result by computing.
The specific steps of the space time sequence division by the division module are as follows: given a set of objects P, each data point PkIn a bounded interval [ T ]min,Tmax]Constructing a uniform partition t0,...,tB},tiDefinition of (t)i=Tmin+l×i,l=(Tmax-Tmin)/B,i=0,...,B
Form a set of time slices b0,...,bB-1Each time slice bi=[ti,ti+1) The fixed length is l, and B is the number of the bounded intervals which are uniformly divided; the time attribute value of each point is t and is mapped to a time slice bs(t)∈{b0,...,bB-1Where s (t) is defined as follows:
Figure BDA0001179664940000091
in the inverted grid index generation module, for each time slice, the generation process of the grid inverted index is as follows: let a given set of d-dimensional spatial objects P ═ P1,...,pnP for each data point PkI.e. pkAll e.p have d-dimensional attribute Pk.x1,...,pk.xdD-dimensional data space is divided into grids with equal width, and the width of each unit grid is (1,...,d) (ii) a The width of the cell is determined according to the value of each dimension, so that the mapped data points can be uniformly distributed in the cell, all the points in the same time slice are scanned,
Figure BDA0001179664940000092
point pkMapping to grid coordinates
Figure BDA0001179664940000093
Coordinate mapping such as
Figure BDA0001179664940000094
In the calculation module, the global Skyline grid calculation method comprises the following steps: query points q are mapped to corresponding grid cells cqIn the middle, the whole grid area is divided into an influence area and a dominated area, the influence area comprises cqPeripheral non-empty cells and grid cqA grid on the same horizontal or vertical line; the dominated region is a region dominated by the affected region, for the search of the affected region, a quadrant polling method is adopted, the domination relationship of non-empty Skyline lattices in each quadrant around the query point is calculated through gradual expansion, and data points in global Skyline lattices and lattices are obtained through comparison according to the domination relationship.
The Skyline format governing method is as follows: given any two non-empty Skyline grids C in the Skyline grid set C on the q, d dimension space of the query pointi,cj
Figure BDA0001179664940000099
Simultaneously, the following conditions are satisfied:
Figure BDA0001179664940000095
(ci(t)-q(t))(cj(t)-q(t))>0;
Figure BDA0001179664940000096
|ci(t)-q(t)|≤|cj(t)-q(t)|;
Figure BDA0001179664940000097
|ci(t)-q(t)|<|cj(t)-q(t)|。
skyline lattice ciSkyline lattice c dominated by qj
The global Skyline lattice is given a lattice set C, the global Skyline lattice of C is a lattice set which is not globally dominated by other lattices, and the global Skyline lattice is defined as:
Figure BDA0001179664940000098
when the index is established, a MapReduce processing flow is used, a plurality of maps are started to read the streaming data at the same time, each Map reads different HDFS data fragments, a < key, value > data pair is generated, the key is a space-time index, the value is a hashmap data structure, and corresponding data points obtained according to division are stored in the hash Map data structure; and the intermediate data obtained by each Map is a sub-index of partial data, sorting is completed according to key, and merging generation of the index is completed by calling a Reduce.
When the space time sequence is divided, a monitoring time range is set, a threshold value is set accordingly, if the query time exceeds the specified time range, a plurality of time windows need to be spanned, the size of the time window needing to be spanned is evaluated, and if the size of the time window exceeds the time threshold value, the direct query fails.
Example 2:
the present embodiment relates to a specific application of the Skyline query method based on the application of the spatial time series data stream described in embodiment 1:
the Skyline query system based on the space time sequence data flow application is used for calling of mobile medical treatment, wherein the cloud center service system provides a space grid pruning strategy and continuous network medical data monitoring to execute a dynamic Skyline and global Skyline algorithm, threshold values of all attributes are input, and query results are sent according to the time when execution time ends, so that attributes of a hospital are improved. Namely, the system executes the following steps:
s1, a cloud center service system provides a module index data structure through a distributed dynamic Skyline and global Skyline algorithm, meanwhile, a Spark stream system is utilized to start a plurality of maps to read time stream data, each Map reads different HDFS data fragments to generate a data pair of key and value, the key is a space-time index, the value is a hashmap data structure, and corresponding data points obtained through division are stored in the data pair to screen large-scale medical institution data.
S2, the intelligent mobile client is firstly positioned on the terminal equipment through the GPS, and the space where the user is located and the personalized requirements are determined. And then operating a medical calling program, communicating through the cloud server, sending a query instruction, and performing information interaction with the space filtering result and the continuous space monitoring data fed back by the cloud center service system under the participation of the user.
Example 3:
the Skyline query method based on the application of the spatial time series data stream in embodiment 1 is used for epidemic detection, and first, we divide a time series for epidemic monitoring into a plurality of time segments according to a time window, and then perform Skyline static query on data of each time segment. For a set of time objects P with epidemic, each data point PkIn a bounded interval [ T ]min,Tmax]Constructing a uniform partition t0,...,tB},tiDefinition of (t)i=Tmin+l×i,l=(Tmax-Tmin) B, i ═ 0. Form a set of time slices b0,...,bB-1Each time slice bi=[ti,ti+1) The fixed length is l. The time attribute value of each point is t and is mapped to a time slice bs(t)∈{b0,...,bB-1Wherein s (t) is as
Figure BDA0001179664940000111
Wherein the value of the interval (l) of different granularity is determined according to the time of actual monitoring. Meanwhile, in order to reduce the calculation amount, a time range for monitoring the epidemic disease is set, a threshold value is set, if the query exceeds the specified time range, a plurality of time windows need to be spanned, the size of the time window needing to be spanned is evaluated, and if the threshold value is exceeded, the query fails directly. The state of the network nodes is dynamically monitored in real time through network monitoring, and each node continuously sends the time when the execution time is finished to the server, so that the result is more accurate and accords with the actual situation.
Example 4:
the skyline query method based on the application of the spatial time series data stream in embodiment 1 is used forAnd analyzing medical historical data. When the medical history data set is given, the static Skyline results can be determined. If real-time medical data is continuously added, an inquiry request is specified, and a dominant relationship between objects in an inquiry data set and an inquiry request point is considered, then the inquiry result of the Skyline is uncertain, that is, the inquiry result is different according to different inquiry reference objects for dynamic Skyline inquiry, if the inquiry of a user is considered to be possibly changed, the inquired medical history data is also changed, and if the dominant relationship exists, the multi-factor inquiry is the Skyline inquiry, and if the accumulated historical medical data, especially the multi-dimensional data information is as follows: when the information such as illness state, etiology, illness time, treatment condition and the like is very large and cannot be processed by a single computing node, the cloud computing technology is used for parallel processing. DynamicSkylineQuery: one d-dimensional data space S ═ S1,s2,...,sdP is a data set on the data space S, i.e., P ═ P1,p2,...,pnAnd one query object ref carries out dynamic domination calculation on the vector according to the dynamic domination relation and time, and a Skyline result set is obtained through calculation. Data object b dynamically dominates a if and only if b is no farther than a from ref in all attributes and has at least one dimension closer than a. If the query points are dynamically changing over time, then the indexing and querying operations also need to be processed dynamically in time-stream order. The advantages of this embodiment are: the result query is carried out at the moment when the execution time is finished, so that the method is more accurate and accords with the actual situation. And the other is that a Spark stream processing system is adopted to distribute and cache the result of Map calculation in a stream form, so that the calculation speed can be greatly accelerated. The invention is realized to be applied to: monitoring in the aspect of mobile disease early warning, calling for mobile medical treatment, retrieving medical history data and the like. Aiming at the difficulty in maintaining the data increment of the query result, the method is applied to the dynamic query of the global variable skyline in network monitoring and focuses on the discovery of abnormal conditions.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to change or modify the technical solution and the inventive concept of the present invention within the technical scope of the present invention.

Claims (6)

1. A Skyline inquiry system based on space time sequence data flow application is characterized in that,
including cloud center service system, cloud center service system includes:
a dividing module for performing a spatial time sequence based division of a continuous time sequence into a plurality of time segments in time windows;
the inverted grid index generating module is used for generating a grid inverted index for each time segment;
the computing module is used for mapping the moment query points to corresponding Skyline grids, then obtaining global Skyline grids as a candidate set by using a global Skyline grid computing method, and then performing dynamic Skyline query on network node data in the candidate set according to a time sequence to obtain an effective global Skyline result by computing; parallel processing using cloud computing techniques, a d-dimensional data space S ═ S1,s2,...,sdP is a set of data P ═ P over the data space S1,p2,...,pnA query object ref dynamically dominates the vector according to time according to a dynamic domination relation, a result set of Skyline is obtained through calculation, a is dynamically dominated by a data object b, if and only if b is not farther than a from ref on all attributes and at least one dimension is closer than a, if query points dynamically change according to time, index and query operations are also dynamically processed according to a time stream sequence;
the specific steps of the space time sequence division by the division module are as follows: given a set of objects P, each data point PkIn a bounded interval [ T ]min,Tmax]Constructing a uniform partition t0,...,tB},tiDefinition of (t)i=Tmin+l×i,l=(Tmax-Tmin)/B,i=0,...,B
Form a set of time slices b0,...,bB-1Each time slice bi=[ti,ti+1) The fixed length is l, and B is the number of the bounded intervals which are uniformly divided; the time attribute value of each point is t and is mapped to a time slice bs(t)∈{b0,...,bB-1Where s (t) is defined as follows:
Figure FDA0002478184280000011
setting a monitoring time range and a threshold, if the query exceeds the specified time range, needing to span a plurality of time windows, evaluating the size of the time windows needing to be spanned, if the time windows exceed the threshold, directly failing the query during the query, and if the time windows are too small, and under the condition that the data volume is not accumulated, adopting a batch flow caching method to cache the data flow and then periodically sending the data flow in batches; and if the time window is large and the data volume is large, splitting the data stream according to the window, wherein the splitting granularity is determined by the actual application scene.
2. The Skyline query system based on spatial time-series data stream application of claim 1, wherein in the inverted grid index generation module, for each time slice, a generation process of the inverted grid index is as follows: let a given set of d-dimensional spatial objects P ═ P1,...,pnP for each data point PkI.e. pkAll e.p have d-dimensional attribute Pk.x1,...,pk.xdD-dimensional data space is divided into grids with equal width, and the width of each unit grid is (1,...,d) (ii) a The width of the cell is determined according to the value of each dimension, so that the mapped data points can be uniformly distributed in the cell, all the points in the same time slice are scanned,
Figure FDA0002478184280000021
point pkMapping into meshGrid coordinate
Figure FDA0002478184280000022
Coordinate mapping such as
Figure FDA0002478184280000023
3. The Skyline query system based on space time series data flow application of claim 1, wherein in the computation module, the global Skyline lattice computation method is: query points q are mapped to corresponding grid cells cqIn the middle, the whole grid area is divided into an influence area and a dominated area, the influence area comprises cqPeripheral non-empty cells and grid cqA grid on the same horizontal or vertical line; the dominated region is a region dominated by the affected region, for the search of the affected region, a quadrant polling method is adopted, the domination relationship of non-empty Skyline lattices in each quadrant around the query point is calculated through gradual expansion, and data points in global Skyline lattices and lattices are obtained through comparison according to the domination relationship.
4. The Skyline query system based on the application of the spatial time series data streams as claimed in claim 3, wherein the Skyline format governing method is as follows: given any two non-empty Skyline grids C in the Skyline grid set C on the q, d dimension space of the query pointi,cj,ciqcjSimultaneously, the following conditions are satisfied:
Figure FDA0002478184280000024
(ci(t)-q(t))(cj(t)-q(t))>0;
Figure FDA0002478184280000025
|ci(t)-q(t)|≤|cj(t)-q(t)|;
Figure FDA0002478184280000026
|ci(t)-q(t)|<|cj(t)-q(t)|;
skyline lattice ciSkyline lattice c dominated by qj
5. A sky inquiry system based on spatial time series dataflow application as claimed in claim 3, wherein the global sky lattices are, given a lattice set C, the global sky lattice of C is a set of all lattices that are not globally dominated by other lattices, and is defined as:
Figure FDA0002478184280000031
6. the sky line inquiry system based on space time series data flow application of claim 1, wherein when an index is established, a MapReduce processing flow is used, a plurality of maps are started to read stream data simultaneously, each Map reads different HDFS data segments, and generates a < key, value > data pair, where key is a space-time index, value is a hashmap data structure, and corresponding data points obtained according to division are stored therein; and the intermediate data obtained by each Map is a sub-index of partial data, sorting is completed according to key, and merging generation of the index is completed by calling a Reduce.
CN201611150565.3A 2016-12-14 2016-12-14 Skyline inquiry system based on space time sequence data flow application Active CN106777093B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611150565.3A CN106777093B (en) 2016-12-14 2016-12-14 Skyline inquiry system based on space time sequence data flow application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611150565.3A CN106777093B (en) 2016-12-14 2016-12-14 Skyline inquiry system based on space time sequence data flow application

Publications (2)

Publication Number Publication Date
CN106777093A CN106777093A (en) 2017-05-31
CN106777093B true CN106777093B (en) 2021-01-01

Family

ID=58876969

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611150565.3A Active CN106777093B (en) 2016-12-14 2016-12-14 Skyline inquiry system based on space time sequence data flow application

Country Status (1)

Country Link
CN (1) CN106777093B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959521B (en) * 2018-06-28 2021-07-16 中国人民解放军国防科技大学 Uncertain contour query parallel processing method and system based on N-of-N flow model
CN110119408B (en) * 2019-03-22 2022-12-06 西安电子科技大学 Continuous query method for moving object under geospatial real-time streaming data
CN109947904B (en) * 2019-03-22 2021-07-30 东北大学 Preference space Skyline query processing method based on Spark environment
CN110457316A (en) * 2019-06-27 2019-11-15 四川工商学院 A kind of Skyline inquiry method and its system of large-scale dataset
CN110334252B (en) * 2019-07-10 2022-04-12 大连海事大学 Skyline query method on partial order domain
CN110750565B (en) * 2019-08-16 2022-02-22 安徽工业大学 Real-time interval query method based on Internet of things data flow sliding window model
CN110516119A (en) * 2019-08-27 2019-11-29 西南交通大学 A kind of organizational scheduling method, device and the storage medium of natural resources contextual data
CN113449208B (en) * 2020-03-26 2022-09-02 阿里巴巴集团控股有限公司 Space query method, device, system and storage medium
CN111694839B (en) * 2020-04-28 2023-07-14 平安科技(深圳)有限公司 Time sequence index construction method and device based on big data and computer equipment
CN113806353A (en) * 2020-06-12 2021-12-17 第四范式(北京)技术有限公司 Method and device for realizing time sequence feature extraction
CN112925789B (en) * 2021-02-24 2022-12-20 东北林业大学 Spark-based space vector data memory storage query method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150326A (en) * 2012-12-21 2013-06-12 北京大学软件与微电子学院无锡产学研合作教育基地 Skyline query method orienting to probability data flow
CN103778195A (en) * 2014-01-07 2014-05-07 浙江大学 Sorting reverse skyline query method in spatial database
CN105607943A (en) * 2015-12-18 2016-05-25 浪潮集团有限公司 Dynamic deployment mechanism of virtual machine in cloud environment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102323957B (en) * 2011-10-26 2012-10-03 中国人民解放军国防科学技术大学 Distributed parallel Skyline query method based on vertical dividing mode
KR101344649B1 (en) * 2012-05-22 2013-12-26 인하대학교 산학협력단 Hash-based skyline query processing method and apparatus thereof
CN105608206A (en) * 2015-12-25 2016-05-25 天津理工大学 Data-broadcasting-oriented location correlation skyline query processing method
CN105761037A (en) * 2016-02-05 2016-07-13 大连大学 Logistics scheduling method based on space reverse neighbor search under cloud computing environment
CN105760470A (en) * 2016-02-05 2016-07-13 大连大学 Medical calling system based on spatial reverse nearest neighbor query in cloud computing environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150326A (en) * 2012-12-21 2013-06-12 北京大学软件与微电子学院无锡产学研合作教育基地 Skyline query method orienting to probability data flow
CN103778195A (en) * 2014-01-07 2014-05-07 浙江大学 Sorting reverse skyline query method in spatial database
CN105607943A (en) * 2015-12-18 2016-05-25 浪潮集团有限公司 Dynamic deployment mechanism of virtual machine in cloud environment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Discovering Approximate Time Series Motif Based on MP_C Method with the Support of Skyline Index;Nguyen Thanh Son;《2012 Fourth International Conference on Knowledge and Systems Engineering》;20120819;1-8 *
基于维度偏好的Skyline查询结果精简算法;王雪菲;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160315(第3期);I138-5374 *

Also Published As

Publication number Publication date
CN106777093A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106777093B (en) Skyline inquiry system based on space time sequence data flow application
CN106708989B (en) Skyline query method based on space time sequence data stream application
CN106528773B (en) Map computing system and method based on Spark platform supporting spatial data management
CN105589951B (en) A kind of mass remote sensing image meta-data distribution formula storage method and parallel query method
CN105512297A (en) Distributed stream-oriented computation based spatial data processing method and system
CN109408501B (en) Position data processing method and device, server and storage medium
CN106599190A (en) Dynamic Skyline query method based on cloud computing
CN111586091A (en) Edge computing gateway system for realizing computing power assembly
CN106599189A (en) Dynamic Skyline inquiry device based on cloud computing
CN109145225B (en) Data processing method and device
Havers et al. DRIVEN: A framework for efficient Data Retrieval and clustering in Vehicular Networks
CN114328780A (en) Hexagonal lattice-based smart city geographic information updating method, device and medium
Stojanovic et al. High–performance computing in GIS: Techniques and applications
CN116796083B (en) Space data partitioning method and system
CN109800231B (en) Real-time co-movement motion mode detection method of track based on Flink
CN108717444A (en) A kind of big data clustering method and device based on distributed frame
Heiler et al. Comparing implementation variants of distributed spatial join on spark
CN106055669A (en) Data discretization method and system
Yu et al. Efficient Spatio-Temporal-Data-Oriented Range Query Processing for Air Traffic Flow Statistics
CN112257955A (en) Clustering algorithm-based shared bicycle optimization allocation method, control device, electronic equipment and storage medium thereof
Maguerra et al. A survey on solutions for big spatio-temporal data processing and analytics
CN111813542A (en) Load balancing method and device for parallel processing of large-scale graph analysis tasks
Bojkovic et al. Mobile cloud analytics in Big data era
Pravinbhai et al. Big Data An Analytic architecture and prediction using spark for E-agriculture
Wu et al. Towards Adaptive Continuous Trajectory Clustering Over a Distributed Web Data Stream

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant