WO2020192225A1 - 一种面向Spark的遥感数据索引方法、系统及电子设备 - Google Patents

一种面向Spark的遥感数据索引方法、系统及电子设备 Download PDF

Info

Publication number
WO2020192225A1
WO2020192225A1 PCT/CN2019/130566 CN2019130566W WO2020192225A1 WO 2020192225 A1 WO2020192225 A1 WO 2020192225A1 CN 2019130566 W CN2019130566 W CN 2019130566W WO 2020192225 A1 WO2020192225 A1 WO 2020192225A1
Authority
WO
WIPO (PCT)
Prior art keywords
index
remote sensing
sensing data
spark
indexing
Prior art date
Application number
PCT/CN2019/130566
Other languages
English (en)
French (fr)
Inventor
熊景盼
王洋
须成忠
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2020192225A1 publication Critical patent/WO2020192225A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2291User-Defined Types; Storage management thereof

Definitions

  • This application belongs to the technical field of big data applications, and in particular relates to a remote sensing data indexing method, system and electronic equipment oriented to Spark.
  • Remote sensing data is the image information taken by satellites in space.
  • the amount of remote sensing data has been increasing due to the accumulation of photographing and collection over time, which has caused a lot of storage and calculation problems.
  • SDBMS spatial database management system
  • the storage capacity of the SDBMS largely depends on the performance of the underlying DBMS.
  • SDBMS generally uses vertical expansion to enhance its processing capabilities by upgrading hardware such as CPU, large-capacity memory, and high-speed disks. Due to technical and cost reasons, the vertical expansion method is not sustainable, and it is also an expansion method with limited capacity and scale. In terms of availability, the inherent performance bottleneck and single point of failure of a stand-alone SDBMS also make it difficult to adapt to large-scale concurrent access.
  • the most used storage methods are mainly based on latitude and longitude storage, quad-tree storage, R-tree storage, etc.
  • the main idea of the latitude and longitude index storage method is to create an index table through the latitude and longitude method, and the size of each index block is defined by the user or the size of each data block is limited by the data block in HDFS (Hadoop Distributed File System).
  • HDFS High Distributed File System
  • This method is conducive to the calculation of big data frameworks such as Spark, because it is based on the HDFS file system, and files are appropriately cut, which is suitable for the interface of the big data computing framework.
  • the quadtree is an index system established by dividing the space area into four sub-areas and then performing operations all the way down, and then obtaining the last indivisible child node.
  • R-tree is an index system generated by combining similar nodes to generate a tree through clustering, and then nesting each other to generate the entire tree.
  • the index storage system built based on the above index storage method is usually established by the secondary index system, that is, after the macro-global index classification is used at the first level, it then goes to the secondary detailed index classification.
  • This classification method can be adapted to different storage types, and can effectively define the size of the data block, which is convenient for management and storage.
  • SpatialHadoop [Eldawy A,Mokbel M F.Spatialhadoop:A mapreduce framework for spatial data[C].Data Engineering(ICDE),2015 IEEE 31st International Conference on.IEEE,2015:1352-1363.] is an R-tree based
  • the index system is mainly through the technology of dividing data into blocks, and then through Hadoop to complete the related database indexing work.
  • GeoSpark Yu J,Wu J,Sarwat MA demonstration of GeoSpark:A cluster computing framework for processing big spatial data[C].2016 IEEE 32nd International Conference on Data Engineering (ICDE).IEEE,2016:1410-1413.]is one This is a typical remote sensing data indexing system with user-defined secondary index. It uses user-defined data blocks and then uses latitude and longitude and R or quadtree to complete the establishment of the index system.
  • SHAHED [Eldawy A, Mokbel M F, Alharthi S, et al. Shahed: A mapreduce-based system for querying and visualizing spatio-temporal satellite data[C].2015 IEEE 31st International Conference on Data Engineering (ICDE).
  • IEEE, 2015 :1585-1596.] is currently the most mature indexing system in the industry. It is mainly completed by a two-level indexing method. The first layer uses a combination of multiple dimensions and latitude and longitude, and the second layer uses a competitive quadtree method. index.
  • the above index storage methods are all research work done by scholars in the direction of remote sensing big data storage. These storage methods have their own advantages.
  • the global and local double-layer storage methods solve the need for fast index search.
  • some databases are not perfect for expansion, and there are also index methods for database expansion that are particularly perfect. Ignore the problem of the entire database space consumption.
  • the research content of the scholars is only on how to establish a good single index system, and then find or obtain information more quickly, but a single index system cannot provide an efficient index system for a variety of different scenarios, resulting in waste in indexing files A lot of time and resources will reduce the efficiency of the entire processing system.
  • this application pre-provides a method that can be adapted to Spark in different An index strategy that can perform efficient calculations in all scenarios enables faster data search while accelerating Spark's calculations, making resources and time more efficient.
  • This application provides a method, system and electronic device for remote sensing data indexing oriented to Spark, aiming to solve one of the above technical problems in the prior art at least to a certain extent.
  • a remote sensing data indexing method for Spark includes the following steps:
  • Step a Establish a quad-tree, GeoHash and R-tree index systems for remote sensing data in the PostgreSQL database, and store the quad-tree, GeoHash, and R-tree index systems separately to obtain a multi-index for Spark multi-index coexistence Storage System;
  • Step b Select an index strategy selector to establish a connection between Spark and the multi-index storage system
  • Step c Based on the index strategy selector, assign a corresponding index mode according to the calculation scenario, and search for remote sensing data in the multi-index storage system according to the index mode.
  • the technical solution adopted in the embodiment of the application further includes: the step a further includes: acquiring remote sensing data, and storing the remote sensing data in the HDFS file system.
  • the technical solution adopted in the embodiment of the application further includes: the step a further includes: establishing an index system in the HDFS file system, and storing the index system in a PostgreSQL database.
  • the technical solution adopted by the embodiment of the application further includes: the step b also includes: establishing an access hot zone memory file system on the index strategy selector, searching for query and calculation characteristics through machine learning, and analyzing and obtaining hot zone memory files The location of the system, the establishment of the memory file system in the hot zone, and the feature analysis of different computing scenarios, to obtain an index strategy selector suitable for different computing scenarios.
  • the technical solution adopted by the embodiment of the application further includes: in the step c, the allocating a corresponding index method according to the calculation scenario, and searching for remote sensing data in the multi-index storage system according to the index method specifically includes:
  • Step c1 Obtain calculation parameters
  • Step c2 Judging an index method suitable for the calculation parameter
  • Step c3 Select the index method, search for remote sensing data according to the index method, and pass the remote sensing data to the calculation function;
  • Step c4 Drive Spark calculation
  • Step c5 Return the calculation result, and store the calculation result and calculation record
  • Step c6 Publish calculation results.
  • a remote sensing data indexing system for Spark including:
  • Multi-index storage system establishment module used to separately establish the quadtree, GeoHash and R-tree index systems of remote sensing data in the PostgreSQL database, and store the quad-tree, GeoHash and R-tree index systems separately to obtain Spark-oriented Multi-index storage system with multi-index coexistence;
  • Index docking module used to select an index strategy selector to establish the docking between Spark and the multi-index storage system
  • Data index module based on the index strategy selector, assign a corresponding index method according to the calculation scenario, and search for remote sensing data in the multi-index storage system according to the index method.
  • Data acquisition module used to acquire remote sensing data
  • Data storage module used to store the remote sensing data in the HDFS file system.
  • Index system establishment module used to establish a layer of index system in HDFS file system
  • Index system storage module used to store the index system in a PostgreSQL database.
  • the technical solution adopted in the embodiment of the present application further includes: the index docking module is also used to establish an access hot zone memory file system on the index strategy selector, find query and calculation features through machine learning, and analyze and obtain the hot zone memory The location of the file system, perfect the establishment of the hot zone memory file system, and analyze the characteristics of different computing scenarios to obtain index strategy selectors suitable for different computing scenarios.
  • the technical solution adopted in the embodiment of the present application further includes: the data indexing module allocates a corresponding index method according to the calculation scenario, and searching for remote sensing data in the multi-index storage system according to the index method specifically includes: obtaining calculation parameters; determining the calculation Index method suitable for parameters; select the index method, search for remote sensing data according to the index method, and pass the remote sensing data to the calculation function; drive Spark calculation; return the calculation result, and store the calculation result and calculation record; publish the calculation result.
  • an electronic device including:
  • At least one processor At least one processor
  • a memory communicatively connected with the at least one processor; wherein,
  • the memory stores instructions that can be executed by the one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the following operations of the aforementioned Spark-oriented remote sensing data indexing method :
  • Step a Establish a quad-tree, GeoHash and R-tree index systems for remote sensing data in the PostgreSQL database, and store the quad-tree, GeoHash, and R-tree index systems separately to obtain a multi-index for Spark multi-index coexistence Storage System;
  • Step b Select an index strategy selector to establish a connection between Spark and the multi-index storage system
  • Step c Based on the index strategy selector, assign a corresponding index mode according to the calculation scenario, and search for remote sensing data in the multi-index storage system according to the index mode.
  • the beneficial effects produced by the embodiments of this application are: the Spark-oriented remote sensing data indexing method, system and electronic equipment of the embodiments of this application drive multiple indexing methods through integration, and allocate indexing methods according to different computing scenarios. This greatly reduces the indexing time compared to a single indexing method, which provides strong support for the platform for computing remote sensing big data, and adapts to Spark computing tasks, can quickly and efficiently index to the required files, and achieve efficient calculation of remote sensing data.
  • a distributed storage system plus an index can be used to more efficiently use the storage performance of the machine, which increases the utilization rate.
  • FIG. 1 is a flowchart of a method for establishing a multi-index storage system according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of the structure of a multi-index system according to an embodiment of the present application.
  • Figure 3 is a flowchart of a remote sensing data indexing method based on a multi-index storage system
  • FIG. 4 is a schematic structural diagram of a remote sensing data indexing system for Spark according to an embodiment of the present application
  • FIG. 5 is a schematic diagram of the hardware device structure of the Spark-oriented remote sensing data indexing method provided by an embodiment of the present application.
  • this application establishes a multi-index storage system for Spark multi-index coexistence for remote sensing data under some common spatial data index structures, which can integrate and drive multiple indexing methods and adapt to Spark Computing tasks can be quickly and efficiently indexed to required files and passed to calculation functions to achieve efficient calculations.
  • Different calculation scenarios correspond to different indexing methods, so as to give full play to the characteristics of each indexing system and adapt to Spark efficient calculations.
  • FIG. 1 is a flowchart of a method for establishing a multi-index storage system according to an embodiment of the present application.
  • the method for establishing a multi-index storage system in the embodiment of the present application includes the following steps:
  • Step 100 Obtain remote sensing data
  • Step 110 Store the remote sensing data in the HDFS file system
  • step 110 the HDFS file system will generate two other copies during the remote sensing data storage process to ensure that it can restore the original file through other machines after a single node error occurs, avoiding data loss, and multiple copies are also correct.
  • Spark provides support for parallel computing.
  • Step 120 Establish a layer of index system in the HDFS file system
  • Step 130 Store the index system in the PostgreSQL database
  • Step 140 Establish the quad-tree, GeoHash, and R-tree index systems for remote sensing data in the PostgreSQL database, and store the quad-tree, GeoHash, and R-tree index systems in three different databases in PostgreSQL to obtain a Spark multi-index storage system with coexistence;
  • step 140 when an index system is created, three file copies will be added to the storage system for each additional index system.
  • one layer of index system is established for each file copy, and the establishment of a multi-index storage system is realized without increasing data redundancy.
  • FIG. 2 it is a schematic structural diagram of a multi-index system according to an embodiment of this application.
  • the quadtree is a tree structure because of its result. After adding data, the index will appear in an unbalanced tree state, which will reduce the search efficiency and reduce the computational efficiency.
  • Geohash is just a way of spatial indexing, which is especially suitable for point data, and it is more advantageous to use R tree index for line and area data.
  • the R-tree directly stores the position information of the object, but because the position of the continuously moving object will constantly change, it will be updated frequently.
  • the MBR (Master Boot Record) of the R-tree node allows overlap, so multiple paths need to be traversed when searching for old index entries.
  • R-tree requires the MBR boundary to be as compact as possible, which will lead to high update costs, because objects on the boundary can easily enter and exit the MBR frequently, and each delete or insert operation can cause merge and split operations .
  • this application integrates the three indexing methods of quadtree, Geohash, and R-tree, so that the above three indexing methods do not interfere with each other during work, and provide data support for Spark.
  • the remote sensing data is stored in the HDFS file system, a specific storage path needs to be provided to obtain the remote sensing data required for Spark calculations. Therefore, it is necessary to establish an index system in the HDFS file system to facilitate the search for required data in the calculation process.
  • Spark uses different data ranges (time, space) in the process of remote sensing big data processing, which will cause excessive pressure on the simple indexing system and cannot guarantee the efficient provision of required data.
  • Establishing an indexing system oriented to the coexistence of Spark's multiple indexes and realizing the docking between Spark and the indexing system can not only meet the access requirements of Spark, but also effectively improve the computing power of Spark.
  • Step 150 Obtain a reasonable index strategy selector through learning, establish a connection between Spark and a multi-index storage system, and establish a hot zone memory file system for frequently accessed data on the index strategy selector;
  • step 150 due to the advantages and disadvantages of the indexing system itself, the indexing strategy selector can be implemented.
  • the required data volume and data format will affect the efficiency of the index, so
  • a reasonable index strategy selector is obtained through learning and experiments, and indexing methods are allocated according to different scenarios, so that indexing time is greatly reduced compared with a single indexing method.
  • the selection of the index strategy selector includes: (1) Build a Spark big data processing framework on the cluster to test whether its functions are complete and whether it can run normally; (2) Perform the connection work between Spark and the index strategy selector , Test whether the interface is available, adjust the availability of the interface, so that the interface can provide services for Spark; (3) In the case of realization of docking, complete the calculation test work in different scenarios. Test the performance of a single index without using the index strategy selector, compare the test results and optimize the index strategy selector.
  • Step 160 Based on the index strategy selector, assign index methods according to different computing scenarios, search for remote sensing data in the multi-index storage system according to the index methods, and perform Spark calculations;
  • step 160 please also refer to FIG. 3, which is a flowchart of a remote sensing data indexing method based on a multi-index storage system. It specifically includes the following steps:
  • Step 161 Obtain calculation parameters
  • Step 162 Determine the appropriate indexing method for the parameter
  • Step 163 Select an index method, search for remote sensing data according to the index method, and pass the remote sensing data to the calculation function;
  • Step 164 Drive Spark calculation
  • Step 165 Return the calculation result, and store the calculation result and calculation record
  • Step 166 Publish the calculation result.
  • FIG. 4 is a schematic structural diagram of a Spark-oriented remote sensing data indexing system according to an embodiment of the present application.
  • the Spark-oriented remote sensing data index system of the embodiment of the application includes:
  • Data acquisition module used to acquire remote sensing data
  • Data storage module used to store remote sensing data in the HDFS file system; among them, the HDFS file system will generate two other copies during the remote sensing data storage process to ensure that it can be accessed by other machines if a single node fails. Restore the original file to avoid data loss.
  • the situation of multiple copies also supports Spark's parallel computing.
  • Index system establishment module used to establish a layer of index system in HDFS file system
  • Index system storage module used to store the index system in the PostgreSQL database
  • Multi-index storage system establishment module used to establish the quad-tree, GeoHash and R-tree index systems of remote sensing data in the PostgreSQL database, and store the quad-tree, GeoHash and R-tree index systems in three different PostgreSQL Under the database, a multi-index storage system for Spark multi-index coexistence is obtained; among them, when the index system is created, each additional index system will add three file copies to the storage system.
  • one layer of index system is established for each file copy, and the establishment of a multi-index storage system is realized without increasing data redundancy.
  • the quadtree is a tree structure because of its result. After adding data, the index will appear in an unbalanced tree state, which will reduce the search efficiency and reduce the computational efficiency.
  • Geohash is just a way of spatial indexing, which is especially suitable for point data, and it is more advantageous to use R tree index for line and area data.
  • the R-tree directly stores the position information of the object, but because the position of the continuously moving object will constantly change, it will be updated frequently.
  • the MBR (Master Boot Record) of the R-tree node allows overlap, so multiple paths need to be traversed when searching for old index entries.
  • R-tree requires the MBR boundary to be as compact as possible, which will lead to high update costs, because objects on the boundary can easily enter and exit the MBR frequently, and each delete or insert operation can cause merge and split operations .
  • this application integrates the three indexing methods of quadtree, Geohash, and R-tree, so that the above three indexing methods do not interfere with each other during work, and provide data support for Spark.
  • the remote sensing data is stored in the HDFS file system, a specific storage path needs to be provided to obtain the remote sensing data required for Spark calculations. Therefore, it is necessary to establish an index system in the HDFS file system to facilitate the search for required data in the calculation process.
  • Spark uses different data ranges (time, space) in the process of remote sensing big data processing, which will cause excessive pressure on the simple indexing system and cannot guarantee the efficient provision of required data.
  • Establishing an indexing system oriented to the coexistence of Spark's multiple indexes and realizing the docking between Spark and the indexing system can not only meet the access requirements of Spark, but also effectively improve the computing power of Spark.
  • Index docking module It is used to obtain a reasonable index strategy selector through learning, establish the docking between Spark and multi-index storage systems, and establish a hot zone memory file system for frequently accessed data on the index strategy selector; among them, due to the index.
  • the advantages and disadvantages of the system itself enable the index strategy selector to be implemented.
  • this application passes learning and experiments A reasonable indexing strategy selector is obtained, and indexing methods are allocated according to different scenarios, so that the indexing time is greatly reduced compared with a single indexing method.
  • the selection of the index strategy selector includes: (1) Build a Spark big data processing framework on the cluster to test whether its functions are complete and whether it can run normally; (2) Perform the connection work between Spark and the index strategy selector , Test whether the interface is available, adjust the availability of the interface, so that the interface can provide services for Spark; (3) In the case of realization of docking, complete the calculation test work in different scenarios. Test the performance of a single index without using the index strategy selector, compare the test results and optimize the index strategy selector.
  • the index method is allocated according to different computing scenarios, and the remote sensing data is searched in the multi-index storage system according to the index method and Spark calculation is performed; specifically, the data index method includes:
  • FIG. 5 is a schematic diagram of the hardware device structure of the Spark-oriented remote sensing data indexing method provided by an embodiment of the present application.
  • the device includes one or more processors and memory. Taking a processor as an example, the device may also include: an input system and an output system.
  • the processor, the memory, the input system, and the output system may be connected by a bus or other methods.
  • the connection by a bus is taken as an example.
  • the memory can be used to store non-transitory software programs, non-transitory computer executable programs, and modules.
  • the processor executes various functional applications and data processing of the electronic device by running non-transitory software programs, instructions, and modules stored in the memory, that is, realizing the processing methods of the foregoing method embodiments.
  • the memory may include a program storage area and a data storage area, where the program storage area can store an operating system and an application program required by at least one function; the data storage area can store data and the like.
  • the memory may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid state storage devices.
  • the storage may optionally include storage remotely arranged with respect to the processor, and these remote storages may be connected to the processing system through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input system can receive input digital or character information, and generate signal input.
  • the output system may include display devices such as a display screen.
  • the one or more modules are stored in the memory, and when executed by the one or more processors, the following operations of any of the foregoing method embodiments are performed:
  • Step a Establish a quad-tree, GeoHash and R-tree index systems for remote sensing data in the PostgreSQL database, and store the quad-tree, GeoHash, and R-tree index systems separately to obtain a multi-index for Spark multi-index coexistence Storage System;
  • Step b Select an index strategy selector to establish a connection between Spark and the multi-index storage system
  • Step c Based on the index strategy selector, assign a corresponding index mode according to the calculation scenario, and search for remote sensing data in the multi-index storage system according to the index mode.
  • the embodiments of the present application provide a non-transitory (non-volatile) computer storage medium, the computer storage medium stores computer executable instructions, and the computer executable instructions can perform the following operations:
  • Step a Establish a quad-tree, GeoHash and R-tree index systems for remote sensing data in the PostgreSQL database, and store the quad-tree, GeoHash, and R-tree index systems separately to obtain a multi-index for Spark multi-index coexistence Storage System;
  • Step b Select an index strategy selector to establish a connection between Spark and the multi-index storage system
  • Step c Based on the index strategy selector, assign a corresponding index mode according to the calculation scenario, and search for remote sensing data in the multi-index storage system according to the index mode.
  • the embodiment of the present application provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer To make the computer do the following:
  • Step a Establish a quad-tree, GeoHash and R-tree index systems for remote sensing data in the PostgreSQL database, and store the quad-tree, GeoHash, and R-tree index systems separately to obtain a multi-index for Spark multi-index coexistence Storage System;
  • Step b Select an index strategy selector to establish a connection between Spark and the multi-index storage system
  • Step c Based on the index strategy selector, assign a corresponding index mode according to the calculation scenario, and search for remote sensing data in the multi-index storage system according to the index mode.
  • the Spark-oriented remote sensing data indexing method, system, and electronic equipment of the embodiments of the present application integrate and drive multiple indexing methods, and allocate indexing methods according to different computing scenarios, so that the indexing time is greatly reduced compared with a single indexing method.
  • the big data platform has strong support and adapts to Spark computing tasks, can quickly and efficiently index to the required files, and achieve efficient calculation of remote sensing data.
  • a distributed storage system plus an index can be used to more efficiently use the storage performance of the machine, which increases the utilization rate.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种面向Spark的遥感数据索引方法、系统及电子设备。所述方法包括:在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将所述四叉树、GeoHash和R树索引系统进行分别存放,得到面向Spark多索引共存的多索引存储系统;选择索引策略选择器,建立Spark与所述多索引存储系统的对接;基于所述索引策略选择器,根据计算场景分配对应的索引方式,根据所述索引方式在所述多索引存储系统中查找遥感数据。所述方法能够根据不同计算场景分配索引方式,使索引时间相对于单一的索引方式大幅度下降,并且适应Spark计算任务,快速高效的索引到所需文件,实现遥感数据的高效计算。

Description

一种面向Spark的遥感数据索引方法、系统及电子设备 技术领域
本申请属于大数据应用技术领域,特别涉及一种面向Spark的遥感数据索引方法、系统及电子设备。
背景技术
随着大数据云计算技术的发展,各个领域的数据也越来越多。遥感数据是卫星在太空运行所拍摄的图像信息,这些图像信息由于日积月累的拍摄和收集,遥感数据量已经越来越多,从而导致大量的存储和计算问题。
传统上,不同的处理框架在实时性、可用性、批处理性能等方面对底层的存储系统都有不同的要求。遥感数据的存储与计算方法大都围绕空间数据库管理系统(SDBMS)而设计,SDBMS的存储能力很大程度上依赖于底层DBMS的性能。大规模的农情遥感数据和高并发访问对基于SDBMS的农情遥感数据存储与计算方法构成了强力挑战。从可扩展性上看,SDBMS一般是通过垂直扩展的方式,通过升级硬件如CPU、大容量内存、高速磁盘等来增强其处理的能力。由于技术和成本的原因,垂直扩展方式是不可持续的,同时也是能力、规模有限的扩展方式。从可用性上看,单机SDBMS固有的性能瓶颈以及单点失效的问题也使其很难适应大规模的并发访问。
目前已经有很多种索引技术用来服务遥感数据的存储,用的最多的储存方法主要是按照经纬度存储、四叉树存储、R树存储等方式。经纬度索引存储方式主要思想是通过经纬度方式来建立索引表,而每一个索引块的大小通过使用者来界定或者由HDFS(Hadoop分布式文件系统)中的数据块来限制每一个 数据块的大小。此方式利于Spark等大数据框架的计算,因为它基于HDFS文件系统,并且适当的对文件进行切割,适用于大数据计算框架的接口。四叉树是通过将空间区域划分为四个子区域然后一直向下进行操作,然后得到最后不可划分的子节点之后建立的索引系统。R树是通过聚类的方式,将相近的节点结合在一起生成一颗树,然后相互嵌套生成整个树所生成的索引系统。
基于以上的索引存储方法所建立的索引存储系统通常采用二级索引系统建立,即在第一层使用宏观的全局索引分类之后,再到二级的细节索引分类。这样的分类方式可以适应于不同的存储类型,并且可以有效的定义数据块的大小,方便管理和存储。国内外已经有很多的系统是基于这些索引存储系统所建立完成。Hadoop-GIS[Aji A,Wang F,Vo H,et al.Hadoop gis:a high performance spatial data warehousing system over mapreduce[J].Proceedings of the VLDB Endowment,2013,6(11):1009-1020.]是基于一个全局索引来完成索引的建立工作,用户可以界定每一个数据块的大小来方便存储和索引。SpatialHadoop[Eldawy A,Mokbel M F.Spatialhadoop:A mapreduce framework for spatial data[C].Data Engineering(ICDE),2015 IEEE 31st International Conference on.IEEE,2015:1352-1363.]是一个基于R树所建立的索引系统,主要是通过将数据分块的技术,然后通过Hadoop来完成相关的数据库索引工作。GeoSpark[Yu J,Wu J,Sarwat M.A demonstration of GeoSpark:A cluster computing framework for processing big spatial data[C].2016 IEEE 32nd International Conference on Data Engineering(ICDE).IEEE,2016:1410-1413.]是一种典型的用户定义式二级索引的遥感数据索引系统,它通过用户自定义数据块,然后通过经纬度和R或者四叉树来完成索引系统的建立。SHAHED[Eldawy A,Mokbel M F,Alharthi S,et al.Shahed:A mapreduce-based system for querying  and visualizing spatio-temporal satellite data[C].2015 IEEE 31st International Conference on Data Engineering(ICDE).IEEE,2015:1585-1596.]是目前业界最成熟的索引系统,它主要是采用的两级索引方式完成的,第一层采用多维度和经纬度结合的方式,第二层采用竞争四叉树的方式建立索引。
上述的索引存储方法都是各位学者在遥感大数据存储方向上所做的研究工作。这些存储方式各有优势,通过全局和局部双层存储的方式解决了索引快速查找的需求,但是可以看到有些数据库对于拓展等方面不够完善,也有对数据库拓展等建立的索引方法特别完善,但是忽略了整个数据库空间消耗的问题。各位学者的研究内容仅仅是在如何建立好的单一的索引系统,然后更加快速的查找或者获取信息,但单一的索引系统无法对多种不同的场景提供高效的索引系统,导致在索引文件时浪费大量的时间和资源,会降低整个处理系统的工作效率。
目前最新的云计算技术Spark非常适用于处理大型数据集的运算,但是目前所存在的存储和索引方式无法对Spark的计算提供有效的数据支持,因此本申请预提供一种可以适应于Spark在不同场景下都可以进行高效计算的索引策略,使得在数据快速查找的同时,加速Spark的运算,使得资源和时间得到更高效的利用。
发明内容
本申请提供了一种面向Spark的遥感数据索引方法、系统及电子设备,旨在至少在一定程度上解决现有技术中的上述技术问题之一。
为了解决上述问题,本申请提供了如下技术方案:
一种面向Spark的遥感数据索引方法,包括以下步骤:
步骤a:在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将所述四叉树、GeoHash和R树索引系统进行分别存放,得到面向Spark多索引共存的多索引存储系统;
步骤b:选择索引策略选择器,建立Spark与所述多索引存储系统的对接;
步骤c:基于所述索引策略选择器,根据计算场景分配对应的索引方式,根据所述索引方式在所述多索引存储系统中查找遥感数据。
本申请实施例采取的技术方案还包括:所述步骤a还包括:获取遥感数据,并将所述遥感数据存储在HDFS文件系统中。
本申请实施例采取的技术方案还包括:所述步骤a还包括:在HDFS文件系统中建立一层索引系统,并将所述索引系统存储在PostgreSQL数据库中。
本申请实施例采取的技术方案还包括:所述步骤b还包括:在所述索引策略选择器上建立访问热区内存文件系统,通过机器学习寻找查询和计算的特征,分析得到热区内存文件系统的位置,完善热区内存文件系统的建立,并对不同的计算场景进行特征分析,得到适应于不同计算场景的索引策略选择器。
本申请实施例采取的技术方案还包括:在所述步骤c中,所述根据计算场景分配对应的索引方式,根据索引方式在所述多索引存储系统中查找遥感数据具体包括:
步骤c1:获取计算参数;
步骤c2:判断所述计算参数适合的索引方式;
步骤c3:选择索引方式,根据索引方式查找遥感数据,并将遥感数据传递给计算函数;
步骤c4:驱动Spark计算;
步骤c5:返回计算结果,并存储计算结果和计算记录;
步骤c6:发布计算结果。
本申请实施例采取的另一技术方案为:一种面向Spark的遥感数据索引系统,包括:
多索引存储系统建立模块:用于在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将所述四叉树、GeoHash和R树索引系统进行分别存放,得到面向Spark多索引共存的多索引存储系统;
索引对接模块:用于选择索引策略选择器,建立Spark与所述多索引存储系统的对接;
数据索引模块:用于基于所述索引策略选择器,根据计算场景分配对应的索引方式,根据所述索引方式在所述多索引存储系统中查找遥感数据。
本申请实施例采取的技术方案还包括:
数据获取模块:用于获取遥感数据;
数据存储模块:用于将所述遥感数据存储在HDFS文件系统中。
本申请实施例采取的技术方案还包括:
索引系统建立模块:用于在HDFS文件系统中建立一层索引系统;
索引系统存储模块:用于将所述索引系统存储在PostgreSQL数据库中。
本申请实施例采取的技术方案还包括:所述索引对接模块还用于在所述索引策略选择器上建立访问热区内存文件系统,通过机器学习寻找查询和计算的特征,分析得到热区内存文件系统的位置,完善热区内存文件系统的建立,并对不同的计算场景进行特征分析,得到适应于不同计算场景的索引策略选择器。
本申请实施例采取的技术方案还包括:所述数据索引模块根据计算场景分配对应的索引方式,根据索引方式在所述多索引存储系统中查找遥感数据具体 包括:获取计算参数;判断所述计算参数适合的索引方式;选择索引方式,根据索引方式查找遥感数据,并将遥感数据传递给计算函数;驱动Spark计算;返回计算结果,并存储计算结果和计算记录;发布计算结果。
本申请实施例采取的又一技术方案为:一种电子设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的面向Spark的遥感数据索引方法的以下操作:
步骤a:在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将所述四叉树、GeoHash和R树索引系统进行分别存放,得到面向Spark多索引共存的多索引存储系统;
步骤b:选择索引策略选择器,建立Spark与所述多索引存储系统的对接;
步骤c:基于所述索引策略选择器,根据计算场景分配对应的索引方式,根据所述索引方式在所述多索引存储系统中查找遥感数据。
相对于现有技术,本申请实施例产生的有益效果在于:本申请实施例的面向Spark的遥感数据索引方法、系统及电子设备通过集成驱动多种索引方式,根据不同的计算场景分配索引方式,使得索引时间相对于单一的索引方式大幅度下降,对于计算遥感大数据的平台有了有力的支撑,并且适应Spark计算任务,可以快速高效的索引到所需文件,实现遥感数据的高效计算。本申请利用分布式存储系统加上索引的方式可以更高效的利用机器的存储性能,使得使用率增加。
附图说明
图1是本申请实施例的多索引存储系统建立方法的流程图;
图2是本申请实施例的多索引系统结构示意图;
图3为基于多索引存储系统的遥感数据索引方法流程图;
图4是本申请实施例的面向Spark的遥感数据索引系统的结构示意图;
图5是本申请实施例提供的面向Spark的遥感数据索引方法的硬件设备结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
针对现有技术中存在的不足,本申请通过在一些常见的空间数据索引的结构下,针对遥感数据建立一个面向Spark多索引共存的多索引存储系统,可以集成驱动多种索引方式,并且适应Spark计算任务,可以快速高效的索引到所需的文件,传递给计算函数,实现高效计算,不同的计算场景对应于不同的索引方法,从而充分的发挥各个索引系统的特点,适应于Spark高效计算。
具体地,请参阅图1,是本申请实施例的多索引存储系统建立方法的流程图。本申请实施例的多索引存储系统建立方法包括以下步骤:
步骤100:获取遥感数据;
步骤110:将遥感数据存储在HDFS文件系统中;
步骤110中,HDFS文件系统在遥感数据存储的过程中会产生另外两个副本,来保证它本身单一节点出现错误后可以通过其他的机器来恢复原文件,避免数据丢失,多副本的情况也对Spark的并行计算提供支持。
步骤120:在HDFS文件系统中建立一层索引系统;
步骤130:将索引系统存储在PostgreSQL数据库中;
步骤140:在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash 和R树索引系统,并将四叉树、GeoHash和R树索引系统分别存放在PostgreSQL中三个不同的数据库下,得到一个面向Spark多索引共存的多索引存储系统;
步骤140中,在建立索引系统时,每增加一个索引系统会增加三个文件副本存放在存储系统中。在本申请实施例中,为每个文件副本分别建立一层索引系统,在不增加数据冗余的情况下,实现多索引存储系统的建立。具体如图2所示,为本申请实施例的多索引系统结构示意图。
经过研究发现,四叉树、Geohash和R树三种索引方式适合建立多索引存储系统,但是这三种索引方式都存在一定的局限性,其中,四叉树因为其结果是树状结构,在添加数据之后,索引会出现不平衡树的状态,这样会导致查找效率降低,使得计算效率降低。geohash只是空间索引的一种方式,特别适合点数据,而对线、面数据采用R树索引更有优势。R树直接存储对象的位置信息,但由于连续运动对象的位置会不断变化,导致频繁更新。R树结点的MBR(主引导记录)允许重叠,因此,在查找旧索引项时需要遍历多条路径。为了提高查询性能,R树要求MBR的边界尽可能紧凑,这会导致高昂的更新代价,因为边界上的对象很容易频繁地出入该MBR,而每次删除或插入操作都可以引起合并与分裂操作。
因此,本申请通过将四叉树、Geohash和R树三种索引方式进行集成,使得上述三种索引方式在工作时互不干扰,并且为Spark提供数据支撑。而由于遥感数据存储在HDFS文件系统中,需要提供具体的存储路径,才可以得到Spark计算所需的遥感数据。因此需要在HDFS文件系统中建立一层索引系统,方便在计算的过程中查找所需的数据。Spark在进行遥感大数据的处理过程中会使用到不同的数据范围(时间,空间),这会导致单纯的索引系统压力过大,无法保证高效的提供所需数据。建立一个面向Spark多索引共存的索引系统,实现Spark与索引系统的对接,既能满足Spark的访问需求,又可以有效的提高Spark的计算能力。
步骤150:通过学习得到一个合理的索引策略选择器,建立Spark与多索 引存储系统的对接,并在索引策略选择器上建立经常访问的数据的热区内存文件系统;
步骤150中,由于索引系统本身的优缺点,使得索引策略选择器可以实现,当在进行不同指数计算,或者不同的范围进行计算时,所需要的数据量和数据格式都会影响索引的效率,因此本申请通过学习和实验得到一个合理的索引策略选择器,根据不同的场景分配索引方式,使得索引时间相对于单一的索引方式大幅度下降。
具体的,索引策略选择器的选择具体包括:(1)在集群上搭建好Spark大数据处理框架,测试其功能是否完善,是否可以正常运行;(2)进行Spark与索引策略选择器的连接工作,测试接口是否可用,调整接口可用性,使得接口可以为Spark提供服务;(3)在实现对接的情况下,完成对不同场景下的计算测试工作。测试不用索引策略选择器的情况下,单一索引的工作情况,比较测试结果并优化索引策略选择器。
同时,在索引策略选择器上建立访问热区内存文件系统,通过机器学习寻找查询和计算的特征,分析学习得到热区内存文件系统的位置,完善热区内存文件系统的建立,并对不同的计算场景进行特征分析,集成一个适应于不同场景的索引策略选择器,为Spark提供索引服务;可以有效的减少重复数据的搜索与查询,不仅可以减少内存开销,还可以提高计算效率。
步骤160:基于索引策略选择器,根据不同的计算场景分配索引方式,根据索引方式在多索引存储系统中查找遥感数据并进行Spark计算;
步骤160中,请一并参阅图3,为基于多索引存储系统的遥感数据索引方法流程图。其具体包括以下步骤:
步骤161:获取计算参数;
步骤162:判断参数适合的索引方式;
步骤163:选择索引方式,根据索引方式查找遥感数据,并将遥感数据传递给计算函数;
步骤164:驱动Spark计算;
步骤165:返回计算结果,并存储计算结果和计算记录;
步骤166:发布计算结果。
请参阅图4,是本申请实施例的面向Spark的遥感数据索引系统的结构示意图。本申请实施例的面向Spark的遥感数据索引系统包括:
数据获取模块:用于获取遥感数据;
数据存储模块:用于将遥感数据存储在HDFS文件系统中;其中,HDFS文件系统在遥感数据存储的过程中会产生另外两个副本,来保证它本身单一节点出现错误后可以通过其他的机器来恢复原文件,避免数据丢失,多副本的情况也对Spark的并行计算提供支持。
索引系统建立模块:用于在HDFS文件系统中建立一层索引系统;
索引系统存储模块:用于将索引系统存储在PostgreSQL数据库中;
多索引存储系统建立模块:用于在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将四叉树、GeoHash和R树索引系统分别存放在PostgreSQL中三个不同的数据库下,得到一个面向Spark多索引共存的多索引存储系统;其中,在建立索引系统时,每增加一个索引系统会增加三个文件副本存放在存储系统中。在本申请实施例中,为每个文件副本分别建立一层索引系统,在不增加数据冗余的情况下,实现多索引存储系统的建立。
经过研究发现,四叉树、Geohash和R树三种索引方式适合建立多索引存储系统,但是这三种索引方式都存在一定的局限性,其中,四叉树因为其结果是树状结构,在添加数据之后,索引会出现不平衡树的状态,这样会导致查找效率降低,使得计算效率降低。geohash只是空间索引的一种方式,特别适合点数据,而对线、面数据采用R树索引更有优势。R树直接存储对象的位置信息,但由于连续运动对象的位置会不断变化,导致频繁更新。R树结点的MBR(主引导记录)允许重叠,因此,在查找旧索引项时需要遍历多条路径。为了提高查询性能,R树要求MBR的边界尽可能紧凑,这会导致高昂的更新代价,因为边界上的对象很容易频繁地出入该MBR,而每次删除或插入操作都可以 引起合并与分裂操作。
因此,本申请通过将四叉树、Geohash和R树三种索引方式进行集成,使得上述三种索引方式在工作时互不干扰,并且为Spark提供数据支撑。而由于遥感数据存储在HDFS文件系统中,需要提供具体的存储路径,才可以得到Spark计算所需的遥感数据。因此需要在HDFS文件系统中建立一层索引系统,方便在计算的过程中查找所需的数据。Spark在进行遥感大数据的处理过程中会使用到不同的数据范围(时间,空间),这会导致单纯的索引系统压力过大,无法保证高效的提供所需数据。建立一个面向Spark多索引共存的索引系统,实现Spark与索引系统的对接,既能满足Spark的访问需求,又可以有效的提高Spark的计算能力。
索引对接模块:用于通过学习得到一个合理的索引策略选择器,建立Spark与多索引存储系统的对接,并在索引策略选择器上建立经常访问的数据的热区内存文件系统;其中,由于索引系统本身的优缺点,使得索引策略选择器可以实现,当在进行不同指数计算,或者不同的范围进行计算时,所需要的数据量和数据格式都会影响索引的效率,因此本申请通过学习和实验得到一个合理的索引策略选择器,根据不同的场景分配索引方式,使得索引时间相对于单一的索引方式大幅度下降。
具体的,索引策略选择器的选择具体包括:(1)在集群上搭建好Spark大数据处理框架,测试其功能是否完善,是否可以正常运行;(2)进行Spark与索引策略选择器的连接工作,测试接口是否可用,调整接口可用性,使得接口可以为Spark提供服务;(3)在实现对接的情况下,完成对不同场景下的计算测试工作。测试不用索引策略选择器的情况下,单一索引的工作情况,比较测试结果并优化索引策略选择器。
同时,在索引策略选择器上建立访问热区内存文件系统,通过机器学习寻找查询和计算的特征,分析学习得到热区内存文件系统的位置,完善热区内存文件系统的建立,并对不同的计算场景进行特征分析,集成一个适应于不同场景的索引策略选择器,为Spark提供索引服务;可以有效的减少重复数据的搜 索与查询,不仅可以减少内存开销,还可以提高计算效率。
数据索引模块:用于基于索引策略选择器,根据不同的计算场景分配索引方式,根据索引方式在多索引存储系统中查找遥感数据并进行Spark计算;具体的,数据索引方式包括:
1:获取计算参数;
2:判断参数适合的索引方式;
3:选择索引方式,根据索引方式查找遥感数据,并将遥感数据传递给计算函数;
4:驱动Spark计算;
5:返回计算结果,并存储计算结果和计算记录;
6:发布计算结果。
图5是本申请实施例提供的面向Spark的遥感数据索引方法的硬件设备结构示意图。如图5所示,该设备包括一个或多个处理器以及存储器。以一个处理器为例,该设备还可以包括:输入系统和输出系统。
处理器、存储器、输入系统和输出系统可以通过总线或者其他方式连接,图5中以通过总线连接为例。
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序、非暂态计算机可执行程序以及模块。处理器通过运行存储在存储器中的非暂态软件程序、指令以及模块,从而执行电子设备的各种功能应用以及数据处理,即实现上述方法实施例的处理方法。
存储器可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至处理系统。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入系统可接收输入的数字或字符信息,以及产生信号输入。输出系统可包括显示屏等显示设备。
所述一个或者多个模块存储在所述存储器中,当被所述一个或者多个处理器执行时,执行上述任一方法实施例的以下操作:
步骤a:在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将所述四叉树、GeoHash和R树索引系统进行分别存放,得到面向Spark多索引共存的多索引存储系统;
步骤b:选择索引策略选择器,建立Spark与所述多索引存储系统的对接;
步骤c:基于所述索引策略选择器,根据计算场景分配对应的索引方式,根据所述索引方式在所述多索引存储系统中查找遥感数据。
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例提供的方法。
本申请实施例提供了一种非暂态(非易失性)计算机存储介质,所述计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行以下操作:
步骤a:在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将所述四叉树、GeoHash和R树索引系统进行分别存放,得到面向Spark多索引共存的多索引存储系统;
步骤b:选择索引策略选择器,建立Spark与所述多索引存储系统的对接;
步骤c:基于所述索引策略选择器,根据计算场景分配对应的索引方式,根据所述索引方式在所述多索引存储系统中查找遥感数据。
本申请实施例提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行以下操作:
步骤a:在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将所述四叉树、GeoHash和R树索引系统进行分别存放,得到面向Spark多索引共存的多索引存储系统;
步骤b:选择索引策略选择器,建立Spark与所述多索引存储系统的对接;
步骤c:基于所述索引策略选择器,根据计算场景分配对应的索引方式,根据所述索引方式在所述多索引存储系统中查找遥感数据。
为了验证本申请的可行性和有效性,已经在实验中存储了300GB的数据集进行测试,在实验中整个索引系统运行正常,能为Spark的计算提供正确的索引,并且查找速度快,面对不同的计算场景可以使用不同的索引方式,进而加速计算。
本申请实施例的面向Spark的遥感数据索引方法、系统及电子设备通过集成驱动多种索引方式,根据不同的计算场景分配索引方式,使得索引时间相对于单一的索引方式大幅度下降,对于计算遥感大数据的平台有了有力的支撑,并且适应Spark计算任务,可以快速高效的索引到所需文件,实现遥感数据的高效计算。本申请利用分布式存储系统加上索引的方式可以更高效的利用机器的存储性能,使得使用率增加。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本申请中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本申请所示的这些实施例,而是要符合与本申请所公开的原理和新颖特点相一致的最宽的范围。

Claims (11)

  1. 一种面向Spark的遥感数据索引方法,其特征在于,包括以下步骤:
    步骤a:在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将所述四叉树、GeoHash和R树索引系统进行分别存放,得到面向Spark多索引共存的多索引存储系统;
    步骤b:选择索引策略选择器,建立Spark与所述多索引存储系统的对接;
    步骤c:基于所述索引策略选择器,根据计算场景分配对应的索引方式,根据所述索引方式在所述多索引存储系统中查找遥感数据。
  2. 根据权利要求1所述的面向Spark的遥感数据索引方法,其特征在于,所述步骤a还包括:获取遥感数据,并将所述遥感数据存储在HDFS文件系统中。
  3. 根据权利要求2所述的面向Spark的遥感数据索引方法,其特征在于,所述步骤a还包括:在HDFS文件系统中建立一层索引系统,并将所述索引系统存储在PostgreSQL数据库中。
  4. 根据权利要求1所述的面向Spark的遥感数据索引方法,其特征在于,所述步骤b还包括:在所述索引策略选择器上建立访问热区内存文件系统,通过机器学习寻找查询和计算的特征,分析得到热区内存文件系统的位置,完善热区内存文件系统的建立,并对不同的计算场景进行特征分析,得到适应于不同计算场景的索引策略选择器。
  5. 根据权利要求4所述的面向Spark的遥感数据索引方法,其特征在于,在所述步骤c中,所述根据计算场景分配对应的索引方式,根据索引方式在所述多索引存储系统中查找遥感数据具体包括:
    步骤c1:获取计算参数;
    步骤c2:判断所述计算参数适合的索引方式;
    步骤c3:选择索引方式,根据索引方式查找遥感数据,并将遥感数据传递给计算函数;
    步骤c4:驱动Spark计算;
    步骤c5:返回计算结果,并存储计算结果和计算记录;
    步骤c6:发布计算结果。
  6. 一种面向Spark的遥感数据索引系统,其特征在于,包括:
    多索引存储系统建立模块:用于在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将所述四叉树、GeoHash和R树索引系统进行分别存放,得到面向Spark多索引共存的多索引存储系统;
    索引对接模块:用于选择索引策略选择器,建立Spark与所述多索引存储系统的对接;
    数据索引模块:用于基于所述索引策略选择器,根据计算场景分配对应的索引方式,根据所述索引方式在所述多索引存储系统中查找遥感数据。
  7. 根据权利要求6所述的面向Spark的遥感数据索引系统,其特征在于,还包括:
    数据获取模块:用于获取遥感数据;
    数据存储模块:用于将所述遥感数据存储在HDFS文件系统中。
  8. 根据权利要求7所述的面向Spark的遥感数据索引系统,其特征在于,还包括:
    索引系统建立模块:用于在HDFS文件系统中建立一层索引系统;
    索引系统存储模块:用于将所述索引系统存储在PostgreSQL数据库中。
  9. 根据权利要求6所述的面向Spark的遥感数据索引系统,其特征在于, 所述索引对接模块还用于在所述索引策略选择器上建立访问热区内存文件系统,通过机器学习寻找查询和计算的特征,分析得到热区内存文件系统的位置,完善热区内存文件系统的建立,并对不同的计算场景进行特征分析,得到适应于不同计算场景的索引策略选择器。
  10. 根据权利要求9所述的面向Spark的遥感数据索引系统,其特征在于,所述数据索引模块根据计算场景分配对应的索引方式,根据索引方式在所述多索引存储系统中查找遥感数据具体包括:获取计算参数;判断所述计算参数适合的索引方式;选择索引方式,根据索引方式查找遥感数据,并将遥感数据传递给计算函数;驱动Spark计算;返回计算结果,并存储计算结果和计算记录;发布计算结果。
  11. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述1至5任一项所述的面向Spark的遥感数据索引方法的以下操作:
    步骤a:在PostgreSQL数据库中分别建立遥感数据的四叉树、GeoHash和R树索引系统,并将所述四叉树、GeoHash和R树索引系统进行分别存放,得到面向Spark多索引共存的多索引存储系统;
    步骤b:选择索引策略选择器,建立Spark与所述多索引存储系统的对接;
    步骤c:基于所述索引策略选择器,根据计算场景分配对应的索引方式,根据所述索引方式在所述多索引存储系统中查找遥感数据。
PCT/CN2019/130566 2019-03-22 2019-12-31 一种面向Spark的遥感数据索引方法、系统及电子设备 WO2020192225A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910223461.8A CN110083598B (zh) 2019-03-22 2019-03-22 一种面向Spark的遥感数据索引方法、系统及电子设备
CN201910223461.8 2019-03-22

Publications (1)

Publication Number Publication Date
WO2020192225A1 true WO2020192225A1 (zh) 2020-10-01

Family

ID=67413479

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130566 WO2020192225A1 (zh) 2019-03-22 2019-12-31 一种面向Spark的遥感数据索引方法、系统及电子设备

Country Status (2)

Country Link
CN (1) CN110083598B (zh)
WO (1) WO2020192225A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083598B (zh) * 2019-03-22 2021-05-25 深圳先进技术研究院 一种面向Spark的遥感数据索引方法、系统及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049554A (zh) * 2012-12-31 2013-04-17 吴立新 一种矢量qr树并行索引技术
US20130275454A1 (en) * 2012-04-12 2013-10-17 Martin Pfeifle Full Text Search Using R-Trees
CN105630919A (zh) * 2015-12-22 2016-06-01 曙光信息产业(北京)有限公司 存储方法及系统
CN106780667A (zh) * 2016-12-12 2017-05-31 湖北金拓维信息技术有限公司 一种多图层的混合索引方法
CN108804602A (zh) * 2018-05-25 2018-11-13 武汉大学 一种基于spark的分布式空间数据存储计算方法
CN110083598A (zh) * 2019-03-22 2019-08-02 深圳先进技术研究院 一种面向Spark的遥感数据索引方法、系统及电子设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103491185B (zh) * 2013-09-25 2016-05-18 浙江大学 一种基于影像块组织的遥感数据云存储方法
CN105589951B (zh) * 2015-12-18 2019-03-26 中国科学院计算机网络信息中心 一种海量遥感影像元数据分布式存储方法及并行查询方法
KR101852597B1 (ko) * 2017-09-14 2018-04-27 주식회사 포스웨이브 이동객체 빅데이터 정보저장 시스템 및 이를 이용한 이동객체 빅데이터 저장 및 색인 처리 방법

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130275454A1 (en) * 2012-04-12 2013-10-17 Martin Pfeifle Full Text Search Using R-Trees
CN103049554A (zh) * 2012-12-31 2013-04-17 吴立新 一种矢量qr树并行索引技术
CN105630919A (zh) * 2015-12-22 2016-06-01 曙光信息产业(北京)有限公司 存储方法及系统
CN106780667A (zh) * 2016-12-12 2017-05-31 湖北金拓维信息技术有限公司 一种多图层的混合索引方法
CN108804602A (zh) * 2018-05-25 2018-11-13 武汉大学 一种基于spark的分布式空间数据存储计算方法
CN110083598A (zh) * 2019-03-22 2019-08-02 深圳先进技术研究院 一种面向Spark的遥感数据索引方法、系统及电子设备

Also Published As

Publication number Publication date
CN110083598B (zh) 2021-05-25
CN110083598A (zh) 2019-08-02

Similar Documents

Publication Publication Date Title
Xie et al. Simba: Efficient in-memory spatial analytics
You et al. Large-scale spatial join query processing in cloud
Padhy Big data processing with Hadoop-MapReduce in cloud systems
CN110990726A (zh) 时空大数据智能服务系统
Xie et al. Elite: an elastic infrastructure for big spatiotemporal trajectories
CN108073696B (zh) 基于分布式内存数据库的gis应用方法
CN106569896B (zh) 一种数据分发及并行处理方法和系统
CN104239377A (zh) 跨平台的数据检索方法及装置
Cheng et al. Scale-out processing of large RDF datasets
Wang et al. Parallel trajectory search based on distributed index
García-García et al. Efficient distance join query processing in distributed spatial data management systems
Nidzwetzki et al. Distributed secondo: an extensible and scalable database management system
Tian et al. Joins for Hybrid Warehouses: Exploiting Massive Parallelism in Hadoop and Enterprise Data Warehouses.
CN103226608A (zh) 一种基于目录级可伸缩的Bloom Filter位图表的并行文件搜索方法
Pertesis et al. Efficient skyline query processing in spatialhadoop
CN111125248A (zh) 一种大数据存储解析查询系统
WO2020192225A1 (zh) 一种面向Spark的遥感数据索引方法、系统及电子设备
Wang et al. Sparkarray: An array-based scientific data management system built on apache spark
Doulkeridis et al. On saying" enough already!" in mapreduce
García-García et al. Efficient distributed algorithms for distance join queries in spark-based spatial analytics systems
Li et al. An improved distributed query for large-scale RDF data
CN109918410B (zh) 基于Spark平台的分布式大数据函数依赖发现方法
Sangat et al. Distributed ATrie Group Join: Towards Zero Network Cost
CN110569310A (zh) 一种云计算环境下的关系大数据的管理方法
Bhattu et al. Generalized communication cost efficient multi-way spatial join: revisiting the curse of the last reducer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19921670

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19921670

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19921670

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 180322)

122 Ep: pct application non-entry in european phase

Ref document number: 19921670

Country of ref document: EP

Kind code of ref document: A1