CN117909408A - Distributed storage method and system for geographic raster data - Google Patents
Distributed storage method and system for geographic raster data Download PDFInfo
- Publication number
- CN117909408A CN117909408A CN202311729590.7A CN202311729590A CN117909408A CN 117909408 A CN117909408 A CN 117909408A CN 202311729590 A CN202311729590 A CN 202311729590A CN 117909408 A CN117909408 A CN 117909408A
- Authority
- CN
- China
- Prior art keywords
- query
- data
- distributed storage
- raster data
- distributed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000005457 optimization Methods 0.000 claims abstract description 77
- 238000012545 processing Methods 0.000 claims abstract description 72
- 238000013500 data storage Methods 0.000 claims description 19
- 230000006835 compression Effects 0.000 claims description 10
- 238000007906 compression Methods 0.000 claims description 10
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 4
- 238000012952 Resampling Methods 0.000 claims description 3
- 238000004140 cleaning Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 230000011218 segmentation Effects 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 13
- 230000009286 beneficial effect Effects 0.000 abstract 1
- 238000004458 analytical method Methods 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 6
- 230000008901 benefit Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000009877 rendering Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000010076 replication Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24532—Query optimisation of parallel queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Remote Sensing (AREA)
- Computing Systems (AREA)
- Processing Or Creating Images (AREA)
Abstract
The application relates to the technical field of cloud computing, and discloses a distributed storage method and system for geographic raster data. The method segments geographic raster data into a plurality of slices; each slice is stored in a distributed storage system in a distributed mode, an index structure is created according to query requirements, and each slice is inserted into the index structure, so that query efficiency is improved. Each slice in the distributed storage system is processed by using a preset distributed computing framework, so that cluster computing resources can be fully utilized, the parallel processing performance of data is improved, and the method has better transverse expansibility and can process large-scale data. And the performance optimization strategy is determined, so that the overall performance of the system can be further improved. The corresponding index structure is loaded according to the query requirement, the query operation is executed, the query result is obtained, the method can flexibly adapt to different types of geographic raster data and the query requirement, and is beneficial to constructing a more flexible and customizable geographic data processing flow.
Description
Technical Field
The application relates to the technical field of cloud computing, in particular to a distributed storage method and system for geographic raster data.
Background
Geographic raster data is geographic information data represented in raster form, typically used to describe surface features, remote sensing images, meteorological data, and the like. With the development of remote sensing technology and geographic information systems, there is an increasing need to process and analyze large-scale geographic raster data. Storing and querying geography raster data in a distributed environment involves many technical challenges including data slicing, distributed computing, indexing, and performance optimization.
Existing solutions are generally faced with the following problems:
Conventional geographic information systems and desktop Geographic Information System (GIS) tools may experience performance bottlenecks when processing large-scale geographic raster data. Serial processing may result in slower data processing speeds, which may be difficult to meet in real-time or large-scale data processing requirements.
Conventional geographic information systems and open source geographic data processing frameworks may lack direct distributed computing support, making it difficult to fully exploit the performance of a computing cluster in large-scale data processing.
Traditional storage and query schemes perform poorly on large data sets, potentially resulting in long query response times, as well as potentially presenting problems with large storage overhead.
Partial solutions may lack sufficient flexibility to accommodate different data structures and query requirements. This may lead to application limitations in the face of diversified geographic raster data.
Disclosure of Invention
The embodiment of the application provides a distributed storage method of geographic raster data, which aims to solve the problems that in the prior art, the requirements of real-time or large-scale data processing are difficult to meet, the performance of a computing cluster is difficult to fully utilize during large-scale data processing, the performance of the computing cluster is poor on a large-scale data set, the query response time is possibly longer, meanwhile, the storage cost is possibly larger, and part of solutions possibly lack enough flexibility and are difficult to adapt to different data structures and query requirements.
Correspondingly, the embodiment of the application also provides a distributed storage system of the geographic raster data and electronic equipment, which are used for ensuring the realization and the application of the method.
In order to solve the technical problems, an embodiment of the application discloses a distributed storage method of geographic raster data, which comprises the following steps:
slicing the geography raster data into a plurality of slices, each slice including a particular geographic region or grid;
storing each slice in a distributed storage system in a distributed manner;
Creating an index structure according to the query requirements and the geographic raster data, and inserting each slice into the index structure;
Processing each slice in the distributed storage system by using a preset distributed computing frame, and returning a processing result to the distributed storage system;
Determining a performance optimization strategy, wherein the performance optimization strategy comprises data storage optimization, parallel computing optimization and query index optimization;
and loading the corresponding index structure according to the query requirement, executing the query operation, and obtaining the query result from the distributed storage system.
The embodiment of the application also discloses a distributed storage system of the geographic raster data, which comprises:
The data segmentation module is used for segmenting the geographic raster data into a plurality of slices, and each slice comprises a specific geographic area or grid;
The data storage module is used for storing each slice in a distributed storage system in a distributed mode;
The index construction module is used for creating an index structure according to the query requirement and the geographic raster data and inserting each slice into the index structure;
The distributed processing module is used for processing each slice in the distributed storage system by utilizing a preset distributed computing frame and returning a processing result to the distributed storage system;
The performance optimization module is used for determining a performance optimization strategy, wherein the performance optimization strategy comprises data storage optimization, parallel computing optimization and query index optimization;
and the data query module is used for loading the corresponding index structure according to the query requirement, executing the query operation and obtaining the query result from the distributed storage system.
The embodiment of the application also discloses an electronic device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes one or more of the methods in the embodiment of the application when executing the program.
In an embodiment of the application, the geographic raster data is segmented into a plurality of slices for storage and processing in a distributed environment. Each slice is stored in a distributed storage system in a distributed mode, an index structure is created according to query requirements, and each slice is inserted into the index structure so that query efficiency is improved. And processing each slice in the distributed storage system by using a preset distributed computing framework, and returning a processing result to the distributed storage system, so that cluster computing resources can be fully utilized, high-performance parallel processing of geographic raster data is realized, and the method has better transverse expansibility and can process large-scale data. Determining performance optimization strategies, including but not limited to data storage optimization, parallel computing optimization, query index optimization, to better address the storage and query requirements of large-scale geographic raster data, helps to improve overall performance of the system. The corresponding index structure is loaded according to the query requirement, the query operation is executed, the query result is obtained from the distributed storage system, the method can flexibly adapt to different types of geographic raster data and the query requirement, and a more flexible and customizable geographic data processing flow is constructed.
Additional aspects and advantages of embodiments of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of a method for distributed storage of geographic raster data according to an embodiment of the present application;
FIG. 2 is an overall flow chart of distributed storage of geographic raster data provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of a distributed storage system for geographic raster data according to an embodiment of the present application;
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The scheme provided by the embodiment of the application can be executed by any electronic equipment, such as terminal equipment, and can also be a server, wherein the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud computing service. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein. For the technical problems in the prior art, the application provides a distributed storage method and a distributed storage system for geographic raster data, which aim to solve at least one of the technical problems in the prior art.
The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
The embodiment of the application provides a possible implementation manner, as shown in fig. 1, a flowchart of a method for storing geographic raster data in a distributed manner, where the method can be executed by any electronic device, and optionally can be executed at a server end or a terminal device.
As shown in fig. 1, the method may include the steps of:
Step 101, the geography raster data is sliced into a plurality of slices, each slice including a particular geographic region or grid.
The large-scale geography raster data is sliced into multiple slices for parallel processing and distributed storage.
Step 102, each slice is stored in a distributed storage system in a distributed manner.
The distributed storage system may be a hadoop distributed file system (Hadoop Distributed FILE SYSTEM, HDFS), and in the embodiment of the present application, the geographic data processing engine Geotrellis that may use high-performance programs is integrated with a fast general-purpose computing engine APACHE SPARK that is designed for large-scale data processing, so as to implement parallel processing and distributed storage of slices.
Step 103, creating an index structure according to the query requirement and the geographic raster data.
The corresponding index structure is designed according to the query requirement, and efficient geospatial data query can be supported. For example, if it is necessary to locate a specific grid directly at the time of a query, the grid coordinates of the geographic grid data may be used as part of the index to construct an index table using the grid coordinates, and the grid coordinates may be mapped to the corresponding data slice. The method has high query speed and can accurately query the raster data.
And 104, processing each slice in the distributed storage system by using a preset distributed computing framework, and returning the processing result to the distributed storage system.
Parallel computing may be performed using APACHE SPARK or other distributed computing frameworks in embodiments of the application. Geotrellis provides integration with APACHE SPARK that facilitates processing of geographic raster data in a distributed environment. By using the distributed computing framework, cluster computing resources can be fully utilized, high-performance parallel processing of geographic raster data is realized, and compared with a traditional serial computing mode, the distributed computing framework has better transverse expansibility and can process large-scale data.
Compared with a closed commercial solution, geoTrellis and APACHE SPARK are open source technologies, so that the embodiment of the application provides greater freedom and has high flexibility and expandability. The method can be customized and expanded according to specific requirements, and meanwhile, the resources of the open source community can be fully utilized.
Step 105, determining performance optimization strategies including, but not limited to, data storage optimization, parallel computing optimization, query index optimization.
And the data access and processing performance can be improved by considering the aspects of data storage optimization, parallel computing optimization, query index optimization and the like.
And 106, loading a corresponding index structure according to the query requirement, executing the query operation, and obtaining a query result from the distributed storage system.
The query requirement includes basic geospatial queries such as range queries, proximity queries. And selecting corresponding query operation through the query requirement, and loading an index structure to improve the query efficiency and ensure that the query operation fully utilizes the parallelism of the distributed computing framework.
In an embodiment of the application, the geographic raster data is segmented into a plurality of slices for storage and processing in a distributed environment. Each slice is stored in a distributed storage system in a distributed mode, an index structure is created according to query requirements, and each slice is inserted into the index structure so that query efficiency is improved. And processing each slice in the distributed storage system by using a preset distributed computing framework, and returning a processing result to the distributed storage system, so that cluster computing resources can be fully utilized, high-performance parallel processing of geographic raster data is realized, and the method has better transverse expansibility and can process large-scale data. Determining performance optimization strategies, including but not limited to data storage optimization, parallel computing optimization, query index optimization, to better address the storage and query requirements of large-scale geographic raster data, helps to improve overall performance of the system. The corresponding index structure is loaded according to the query requirement, the query operation is executed, the query result is obtained from the distributed storage system, the method can flexibly adapt to different types of geographic raster data and the query requirement, and a more flexible and customizable geographic data processing flow is constructed.
In an alternative embodiment, slicing the geography raster data into a plurality of slices includes:
Determining a slicing strategy of geographic raster data; slicing strategies include partitioning data into regular grids and tiling; the choice of slicing strategy can be influenced by the nature of the data and the application requirements.
Preprocessing geographic raster data, wherein the preprocessing method comprises projection transformation, data cleaning and resampling; the requirements of slicing strategies are met by preprocessing the geographic raster data.
Dividing the preprocessed geographic raster data into a plurality of slices according to a slicing strategy, and distributing a unique identifier or address for each slice; in the embodiment of the application, the slicing strategy is applied to the preprocessed geographic raster data to realize that the large-scale geographic raster data is split into small blocks, and each small block is used as a slice, so that the small blocks can be processed in parallel, wherein each slice is ensured to contain a specific geographic area or grid. Each slice is then assigned a unique identifier or address to reference the slice during distributed storage and processing.
Metadata for each slice is recorded. Metadata such as geographical coordinate ranges, time stamps, etc. Effective management of metadata facilitates subsequent queries and analysis.
As a first example, a grid layer processing framework of GeoTrellis is introduced;
setting the size of the slice to 256×256, and defining the geographical range as a given coordinate range (xmin, xmax, ymax); defining the arrangement mode of the slices, and setting 512 slices in each direction;
and performing preprocessing operations such as projection transformation, data cleaning and resampling on the geographic raster data. The data preprocessing depends on the requirements of the application, so that the data is ready to meet the requirements of subsequent processing;
Slicing the original geo-raster data using a previously defined slicing strategy;
Each slice is assigned a unique identifier, which may consist of information such as the position of the slice, a time stamp, etc., to ensure uniqueness.
Metadata for each slice, such as geographic coordinate ranges, time stamps, etc., are recorded to provide key information in managing and querying each slice in a distributed computing.
In an alternative embodiment, each slice is stored in a distributed storage system in a distributed manner, comprising:
Determining a distributed storage system according to the relevant factors; relevant factors include data size, access pattern, fault tolerance. In the embodiment of the application, a distributed storage system which accords with practical application, such as HDFS, amazons3, googleCloudStorage and the like, can be selected. Relevant factors to consider when selecting a storage system include, but are not limited to, data size, access pattern, fault tolerance.
Each slice is stored in a distributed storage system. May be implemented using an application programming interface (ApplicationProgrammingInterface, API) or file system operations of a distributed storage system.
A preset copy policy and redundancy policy are used in a distributed storage system. The reliability and fault tolerance of data may be increased by considering the replication and redundancy of data implemented in a distributed environment. The fault tolerance may be improved by the distributed storage system itself, e.g. HDFS by replication of data blocks.
Rights and access controls are set for each slice in the distributed storage system. By setting the rights and access controls, it is ensured that only authorized users or systems can access the post-modification data.
As a second example, data is stored into HDFS using Python and GeoTrellis:
class TILEDRASTERLAYER imported into GeoPySpark library. Class TILEDRASTERLAYER is used to represent distributed slices of geographical raster data and provides a method of processing such data;
Initializing SparkContext, and starting a Spark application program;
The GeoTrellis and Hadoop related libraries are imported so that Python can call related functions in Hadoop file system and GeoTrellis. After this, the imported classes can be used to read, write, and interact with the Hadoop distributed storage system for each of the slices;
HDFS storage paths are selected, and HDFS is a common distributed storage system for storing large-scale data in GeoTrellis or other environments integrated with the Hadoop ecosystem. After the HDFS storage path is defined, it can be used in subsequent operations to read or write each slice.
Each slice is loaded, and each slice already present is loaded into one TILEDRASTERLAYER for subsequent processing and analysis in a distributed environment. In practice, the geographical raster data may come from remote sensing images, satellite data, etc. sources and need to be loaded and processed according to specific requirements.
Store each slice to HDFS and close SparkContext after completion. Complete the complete workflow of loading, processing, storing and releasing resources from the data.
In an alternative embodiment, creating an index structure from the query requirements and the geographic raster data, and inserting each slice into the index structure, includes:
An index type is determined based on the query demand geography raster data. The index of geographic raster data may be based on raster coordinates, geographic coordinates, or other methods.
An index structure is created based on the index type, the index structure including a grid index, a spatial index, a grid index. Wherein GeoTrellis itself provides some basic indexing structure.
Each slice is inserted into the index structure. Ensuring that each slice can be uniquely identified by an index can improve the retrieval efficiency of the slice, and can more quickly locate required data when inquiring.
In an alternative embodiment, determining the performance optimization strategy includes:
and determining a query index optimization strategy, wherein the query index optimization strategy is to determine an index structure according to the query requirement.
For geographic raster data, an index structure may be created based on the raster coordinates and geographic coordinates so that the index can be fully utilized at the time of the query, avoiding full table scanning. With the index structure, slices containing relevant data are quickly located according to the geographic scope of the query. This can be achieved by spatial indexing (e.g., geographic coordinate index or R-tree index) to narrow the search and increase query efficiency. The slice size is dynamically adjusted according to the nature of the query. For a broad range query, a larger slice may be selected to reduce the number of slices; while for accurate queries, smaller slices may be selected to improve query accuracy.
As a third example, the index structure in GeoTrellis is used:
Each slice that is present is represented as a TILEDRASTERLAYER object with the specified type, layout, and scope. This object may be further processed and analyzed in a distributed computing environment.
A Z-curve index structure is created. By creating a Z-curve index structure, the efficiency of the retrieval of geo-raster data may be improved, especially when conducting range queries or proximity queries. Z-curve index is widely used in GeoTrellis to accelerate the query operation of spatial data. After creating the index, the geo-raster data may be sliced according to the rules of the Z-curve index and this index is used to speed up the retrieval of the data at the time of the query.
Each slice is indexed according to a specified index structure to improve query efficiency in a distributed environment.
The indexing structure, particularly the Z-curve index, is utilized to efficiently perform the scoped query. By using the index, the amount of data to be retrieved can be reduced, and the query performance can be improved.
In an alternative embodiment, processing each slice in the distributed storage system using a preset distributed computing framework and returning the processing results to the distributed storage system includes:
Configuring a distributed computing framework; after configuration is complete, the distributed computing framework is ensured to be able to access the slices in the distributed storage system. APACHE SPARK may be used as a distributed computing framework in embodiments of the present application.
Each slice is loaded from the distributed storage system. In APACHE SPARK, an interface provided by GeoTrellis or a data read API of APACHE SPARK may be used.
And processing the slices by utilizing the parallel computing capability of the distributed computing framework to obtain a processing result. Slices are processed using the parallel computing capabilities of the distributed computing framework, including but not limited to: geographic operation, statistical analysis and image processing. GeoTrellis provide functionality integrated with APACHE SPARK to make it easier to perform geographic calculations on APACHE SPARK clusters.
And storing the processing result to a distributed storage system. The processing result returned to the distributed storage system may be a new geographical raster data set or other data format, depending on the purpose of the calculation.
In an alternative embodiment, determining the performance optimization strategy further comprises:
A resolution pyramid is created to support queries of different resolutions. By creating a resolution pyramid, each level of the resolution pyramid can be created by downsampling the raw data. The downsampling may be by means of average, maximum, minimum, etc. Each level represents a scaled-down version of the original data. The bottom layer of the resolution pyramid is typically the original resolution data, while the upper layer is a scaled down version of the resolution taper. This architecture allows the system to select the appropriate resolution level as needed at the time of the query. When the query is executed, the system can select the most suitable pyramid level to query according to the region and the resolution requirement of the query, so that the query efficiency can be remarkably improved, and particularly under the condition that a large-range region needs to be covered.
As a fourth example, distributed computations are performed using GeoTrellis and APACHE SPARK:
Initializing SparkContext; before APACHE SPARK applications are run, a SparkContext object needs to be created. In practical applications, this process need only be performed once.
Each slice that is present is represented as a TILEDRASTERLAYER object with the specified type, layout, and scope. This object may be further processed and analyzed in a distributed computing environment.
Geographical operations are performed in units of each slice in the distributed computing framework, and the entire process may be performed in parallel on APACHE SPARK clusters.
The slice (processing result) after distributed computation is stored back into the distributed storage system. To facilitate saving of the processing results, backing up the data, and transferring the data to other applications. The processing results in the embodiment of the application are stored in the HDFS, and the specific distributed storage system can be adjusted according to the requirements.
A resolution pyramid is created to support queries of different resolutions.
And closing SparkContext, and ending the current APACHE SPARK application program.
In an alternative embodiment, determining the performance optimization strategy includes:
And determining a data storage optimization strategy, wherein the data storage optimization strategy is to store the slice into the distributed storage system by adopting a preset compression algorithm.
The compression algorithm can reduce the requirement of storage space, and in addition, the partition and layout of the data can be optimized in the embodiment of the application, so that the data is more uniform in the distributed storage system, and parallel access is facilitated. The compression algorithm in the embodiment of the present application includes, but is not limited to Run-Length Encoding (RLE), which is a basic compression algorithm that reduces the amount of data by recording the number of times the same value or symbol appears in succession. In the remote sensing image processing, RLE can effectively reduce the storage space.
In connection with the second example, each slice is written to a designated HDFS storage path in the manner of a Snappy compression algorithm, providing a more efficient form of storage for subsequent queries and analysis.
In an alternative embodiment, determining the performance optimization strategy further comprises:
and determining a parallel computing optimization strategy, wherein the parallel computing optimization strategy is to set the parallelism of tasks in the distributed computing framework according to the number of nodes in the cluster and the processing capacity of each node.
By using the parallelism of the distributed computing framework and setting the parallelism of the tasks, the computing tasks can be ensured to be effectively distributed and executed in the clusters, and unnecessary data movement is avoided.
In combination with the fourth example, TILEDRASTERLAYER objects are repartitioned to optimize parallel computing. By setting an appropriate number of partitions, the efficiency of parallel computation can be improved. Each partition may be processed in parallel on different nodes of the cluster, thereby accelerating various operations of the geographic raster data.
In an alternative embodiment, determining the performance optimization strategy further comprises:
The cache mechanism is used for storing the commonly used data in the memory so as to reduce frequent access to the distributed storage system and improve the speed of inquiry and calculation.
In combination with the fourth example, the cache mechanism of APACHE SPARK is enabled, and the data of the TILEDRASTERLAYER object is cached in the memory to improve the query performance and improve the performance when repeatedly accessing the geographic raster data. By default, the data will be cached in memory. The caching policy may also be adjusted by passing other parameters to the persist method, such as caching to disk, caching level, etc.
In an alternative embodiment, determining the performance optimization strategy further comprises:
And determining a compression transmission optimization strategy, wherein the compression transmission optimization strategy is to use a compression technology in the data transmission process so as to reduce the overhead of data transmission on a network.
In combination with the fourth example, a local addition operation is performed on the geographical raster data to be transmitted, the result is stored in the HDFS, and the snpey compression algorithm is used for transmission, so as to improve the efficiency of network transmission.
In an alternative embodiment, loading the corresponding index structure according to the query requirement, and executing the query operation to obtain the query result from the distributed storage system, including:
And loading the corresponding index structure according to the query requirement. Query requirements may include, but are not limited to, range queries, proximity queries, statistical analysis, and different query requirements may require different performance optimization strategies. In the embodiment of the application, the API provided by GeoTrellis can be used for loading the proper index structure so as to support quick inquiry.
Queries are executed and data is processed in parallel using a distributed computing framework. Data is processed in parallel in a distributed environment using a distributed computing framework (e.g., APACHE SPARK) to ensure that queries can take full advantage of parallelism to improve performance.
And obtaining a query result and returning the query result. The query results may be geographic raster data, statistical information, or other forms of data, depending on the type of query.
As a fifth example, a data query is performed in GeoTrellis and APACHE SPARK:
Initialization SparkContext, launch a APACHE SPARK application.
Geographic raster data is loaded from a resilient distributed data set (ResilientDistributedDataset, numPyRDD) and a TILEDRASTERLAYER object of GeoTrellis is created. Geographic raster data is to be loaded into TILEDRASTERLAYER objects in Spark, providing a basis for subsequent distributed computation and geospatial analysis.
A query operation, such as a range query, is selected. Specifically, a query_extension object is created, which is used as a parameter of a query operation to acquire geographical raster data within a specified range. The range query is used to obtain data for a particular region for further processing and analysis.
The query is executed to obtain a new TILEDRASTERLAYER object containing geographical raster data meeting the specified query scope for further analysis or visualization.
The query result is obtained and converted into NumPy arrays. A NumPy array containing the query results is obtained and can be directly used in Python.
And closing SparkContext, and ending the Spark application program.
As a sixth example, the embodiment of the present application provides an overall flow for performing distributed storage of geographic raster data by using the method in the embodiment of the present application, specifically as follows:
as shown in fig. 2, the data slicing of the geographic raster data (raster data) specifically includes determining entity data, extracting metadata from the original raster data, calculating multi-layer Layout through the metadata, reading the entity data in blocks through the Layout, and calculating task allocation. Then carrying out distributed computation, producing tile data and indexes by parallel computation, and storing the tile data and indexes into a column database to realize data storage; in this process, the extracted metadata and the calculated multi-layer Layout are stored in a relational database for subsequent retrieval.
In addition, the cloud-optimized raw raster data is stored to an object storage server.
When a client inquires, the system receives an inquiry instruction of the client, and map algebra calculation is carried out according to data in a storage server and tile data in a column database; and simultaneously, corresponding data are searched in the relational database, a layer rendering rule and a data tag can be defined for the data, the data tag can be used for later data retrieval, the layer rendering rule can be used for subsequent rendering of tile data, and in this way, a plurality of sets of rendering rules can be defined for the same tile data, and then the rendered data are returned to the client.
The distributed storage method of the geographic raster data in the embodiment of the application verifies the feasibility of the method by constructing a geographic information database system in a middle and sub-region.
Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application further provides a distributed storage system for geographic raster data, as shown in fig. 3, where the system includes:
A data slicing module 301, configured to slice the geographic raster data into a plurality of slices, each slice including a specific geographic area or grid;
A data storage module 302 for storing each slice in a distributed storage system in a distributed manner;
An index building module 303, configured to create an index structure according to the query requirement and the geographic raster data, and insert each slice into the index structure;
the distributed processing module 304 is configured to process each slice in the distributed storage system by using a preset distributed computing framework, and return a processing result to the distributed storage system;
A performance optimization module 305, configured to determine a performance optimization policy, where the performance optimization policy includes data storage optimization, parallel computing optimization, and query index optimization;
The data query module 306 is configured to load a corresponding index structure according to a query requirement, perform a query operation, and obtain a query result from the distributed storage system.
In an embodiment of the application, the geographic raster data is segmented into a plurality of slices for storage and processing in a distributed environment. Each slice is stored in a distributed storage system in a distributed mode, an index structure is created according to query requirements, and each slice is inserted into the index structure so that query efficiency is improved. And processing each slice in the distributed storage system by using a preset distributed computing framework, and returning a processing result to the distributed storage system, so that cluster computing resources can be fully utilized, high-performance parallel processing of geographic raster data is realized, and the method has better transverse expansibility and can process large-scale data. Determining performance optimization strategies, including but not limited to data storage optimization, parallel computing optimization, query index optimization, to better address the storage and query requirements of large-scale geographic raster data, helps to improve overall performance of the system. The corresponding index structure is loaded according to the query requirement, the query operation is executed, the query result is obtained from the distributed storage system, the method can flexibly adapt to different types of geographic raster data and the query requirement, and a more flexible and customizable geographic data processing flow is constructed.
The distributed storage system for geographic raster data provided by the embodiment of the present application can implement each process implemented in the method embodiments of fig. 1 to 2, and in order to avoid repetition, a detailed description is omitted here.
The distributed storage system for geographic raster data according to the embodiments of the present application may implement the distributed storage method for geographic raster data according to the embodiments of the present application, and the implementation principle is similar, and actions performed by each module and unit in the distributed storage system for geographic raster data according to each embodiment of the present application correspond to steps in the distributed storage method for geographic raster data according to each embodiment of the present application, and detailed functional descriptions of each module in the distributed storage system for geographic raster data may be referred to the descriptions in the corresponding distributed storage method for geographic raster data shown in the foregoing, which are not repeated herein.
Based on the same principles as the methods shown in the embodiments of the present application, the embodiments of the present application also provide an electronic device that may include, but is not limited to: a processor and a memory; a memory for storing a computer program; a processor for executing the method of distributed storage of geographical raster data as shown in any of the alternative embodiments of the present application by invoking a computer program. Compared with the prior art, the distributed storage method of the geographic raster data provided by the application has the advantages that the geographic raster data is divided into a plurality of slices, so that the geographic raster data can be stored and processed in a distributed environment. Each slice is stored in a distributed storage system in a distributed mode, an index structure is created according to query requirements, and each slice is inserted into the index structure so that query efficiency is improved. And processing each slice in the distributed storage system by using a preset distributed computing framework, and returning a processing result to the distributed storage system, so that cluster computing resources can be fully utilized, high-performance parallel processing of geographic raster data is realized, and the method has better transverse expansibility and can process large-scale data. Determining performance optimization strategies, including but not limited to data storage optimization, parallel computing optimization, query index optimization, to better address the storage and query requirements of large-scale geographic raster data, helps to improve overall performance of the system. The corresponding index structure is loaded according to the query requirement, the query operation is executed, the query result is obtained from the distributed storage system, the method can flexibly adapt to different types of geographic raster data and the query requirement, and a more flexible and customizable geographic data processing flow is constructed.
In an alternative embodiment, there is also provided an electronic device, as shown in fig. 4, where the electronic device 400 shown in fig. 4 may be a server, including: a processor 401 and a memory 403. Processor 401 is connected to memory 403, such as via bus 402. Optionally, the electronic device 400 may also include a transceiver 404. It should be noted that, in practical applications, the transceiver 404 is not limited to one, and the structure of the electronic device 400 is not limited to the embodiment of the present application.
The processor 401 may be a CPU (Central processing Unit), a general purpose processor, a DSP (digital Signal processor), an ASIC
(ApplicationSpecificIntegratedCircuit ), FPGA
(FieldProgrammableGateArray) a field programmable gate array) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. Processor 401 may also be a combination that implements computing functionality, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 402 may include a path to transfer information between the components. Bus 402 may be a PCI (PeripheralComponentInterconnect, peripheral component interconnect standard) bus, or an EISA (ExtendedIndustryStandardArchitecture ) bus, or the like. Bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.
The memory 403 may be, but is not limited to, a ROM (read only memory) or other type of static storage device that can store static information and instructions, a RAM (random access memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (ElectricallyErasableProgrammableReadOnlyMemory ), a CD-ROM (CompactDiscReadOnlyMemory, compact disc read only memory) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 403 is used for storing application program codes for executing the inventive arrangements and is controlled to be executed by the processor 401. The processor 401 is arranged to execute application code stored in the memory 403 for implementing what is shown in the foregoing method embodiments.
Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 4 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the application.
The server provided by the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present application is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.
Claims (10)
1. A method of distributed storage of geo-raster data, the method comprising:
slicing the geography raster data into a plurality of slices, each slice including a particular geographic region or grid;
storing each slice in a distributed storage system in a distributed manner;
creating an index structure according to the query requirement and the geographic raster data, and inserting each slice into the index structure;
Processing each slice in the distributed storage system by using a preset distributed computing frame, and returning a processing result to the distributed storage system;
Determining a performance optimization strategy, wherein the performance optimization strategy comprises data storage optimization, parallel computing optimization and query index optimization;
And loading the corresponding index structure according to the query requirement, executing query operation, and obtaining a query result from the distributed storage system.
2. The method of distributed storage of geo-raster data of claim 1, wherein the slicing the geo-raster data into a plurality of slices comprises:
determining a slicing strategy of the geographic raster data; the slicing strategy includes dividing data into a regular grid and tiling;
Preprocessing the geographic raster data, wherein the preprocessing method comprises projective transformation, data cleaning and resampling;
Dividing the preprocessed geographic raster data into a plurality of slices according to the slicing strategy, and distributing a unique identifier or address for each slice;
Metadata for each of the slices is recorded.
3. The method for distributed storage of geographical raster data according to claim 1, wherein said storing each slice in a distributed storage system in a distributed manner comprises:
determining a distributed storage system according to the relevant factors; the related factors comprise data size, access mode and fault tolerance;
storing each slice into the distributed storage system;
using a preset copy policy and redundancy policy in the distributed storage system;
Rights and access controls are set for each slice in the distributed storage system.
4. The method of distributed storage of geo-raster data of claim 1, wherein the creating an index structure from query requirements and the geo-raster data and inserting each of the slices into the index structure comprises:
determining an index type based on the query requirement and the geographic raster data;
The index structure is created based on the index type, and comprises a grid index, a space index and a grid index.
Each of the slices is inserted into the index structure.
5. The method of distributed storage of geo-raster data of claim 1, wherein said determining a performance optimization strategy includes:
and determining a query index optimization strategy, wherein the query index optimization strategy is to determine an index structure according to the query requirement.
6. The method of distributed storage of geo-raster data of claim 1, wherein said determining a performance optimization strategy includes:
and determining a data storage optimization strategy, wherein the data storage optimization strategy is to store the slice to the distributed storage system by adopting a preset compression algorithm.
7. The method of distributed storage of geo-raster data of claim 1, wherein said determining a performance optimization strategy further comprises:
And determining a parallel computing optimization strategy, wherein the parallel computing optimization strategy is used for setting the parallelism of tasks in the distributed computing framework according to the number of nodes in the cluster and the processing capacity of each node.
8. The method for distributed storage of geographic raster data according to claim 1, wherein loading the corresponding index structure according to the query requirement and performing a query operation to obtain a query result from the distributed storage system includes:
Loading the corresponding index structure according to the query requirement;
executing inquiry, and processing data in parallel by using the distributed computing framework;
And obtaining a query result and returning the query result.
9. A distributed storage system for geo-raster data, the system comprising:
The data segmentation module is used for segmenting the geographic raster data into a plurality of slices, and each slice comprises a specific geographic area or grid;
The data storage module is used for storing each slice in a distributed storage system in a distributed mode;
an index construction module for creating an index structure according to the query requirement and the geographic raster data, and inserting each slice into the index structure;
The distributed processing module is used for processing each slice in the distributed storage system by utilizing a preset distributed computing frame and returning a processing result to the distributed storage system;
The performance optimization module is used for determining a performance optimization strategy, wherein the performance optimization strategy comprises data storage optimization, parallel computing optimization and query index optimization;
And the data query module is used for loading the corresponding index structure according to the query requirement, executing query operation and obtaining a query result from the distributed storage system.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 8 when the program is executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311729590.7A CN117909408A (en) | 2023-12-15 | 2023-12-15 | Distributed storage method and system for geographic raster data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311729590.7A CN117909408A (en) | 2023-12-15 | 2023-12-15 | Distributed storage method and system for geographic raster data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117909408A true CN117909408A (en) | 2024-04-19 |
Family
ID=90684423
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311729590.7A Pending CN117909408A (en) | 2023-12-15 | 2023-12-15 | Distributed storage method and system for geographic raster data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117909408A (en) |
-
2023
- 2023-12-15 CN CN202311729590.7A patent/CN117909408A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110442444B (en) | Massive remote sensing image-oriented parallel data access method and system | |
CN104820714B (en) | Magnanimity tile small documents memory management method based on hadoop | |
US20210382790A1 (en) | Distributing Data on Distributed Storage Systems | |
CN107423422B (en) | Spatial data distributed storage and search method and system based on grid | |
CN107239531B (en) | Implementation method for expanding WMS service of GeoServer releasing custom tiles | |
Nishimura et al. | -HBase: design and implementation of an elastic data infrastructure for cloud-scale location services | |
CN111125392B (en) | Remote sensing image storage and query method based on matrix object storage mechanism | |
CN111291016A (en) | Layered mixed storage and indexing method for mass remote sensing image data | |
US20180203874A1 (en) | Space-efficient secondary indexing on distributed data stores | |
CN114265814B (en) | Data lake file system based on object storage | |
US11762932B2 (en) | Spatial search using key-value store | |
Zhong et al. | A distributed geospatial data storage and processing framework for large-scale WebGIS | |
CN113688193A (en) | Track data storage and indexing method and device, electronic equipment and readable medium | |
CN104408084A (en) | Method and device for screening big data | |
CN114372034A (en) | Access method based on remote sensing image map service | |
CN116595025B (en) | Dynamic updating method, terminal and medium of vector tile | |
He et al. | Dynamic multidimensional index for large-scale cloud data | |
Jiang et al. | MOIST: A scalable and parallel moving object indexer with school tracking | |
Wang et al. | HBase storage schemas for massive spatial vector data | |
CN117909408A (en) | Distributed storage method and system for geographic raster data | |
Zhong et al. | Elastic and effective spatio-temporal query processing scheme on hadoop | |
US12067249B2 (en) | Data shaping to reduce memory wear in a multi-tenant database | |
Yang et al. | Efficient storage method for massive remote sensing image via spark-based pyramid model | |
Li et al. | SP-phoenix: a massive spatial point data management system based on phoenix | |
CN113905252A (en) | Data storage method and device for live broadcast room, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |