CN114338718A

CN114338718A - Distributed storage method, device and medium for massive remote sensing data

Info

Publication number: CN114338718A
Application number: CN202111573294.3A
Authority: CN
Inventors: 杜震洪; 孙逍遥; 郝亚楠; 陈奕君
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-04-12
Anticipated expiration: 2041-12-21
Also published as: CN114338718B

Abstract

The invention discloses a distributed storage method, a distributed storage device and a distributed storage medium for massive remote sensing data, and belongs to the field of distributed storage. According to the invention, the data partitioning rule of the metadata of the massive remote sensing data is optimized, and the data block containing a plurality of space elements is used as a fragmentation unit, so that the relative balance of the calculated amount required by the remote sensing data in different fragmentation nodes is realized, and the distributed storage of the massive remote sensing data and the mass calculation requirement of the subsequent remote sensing data are considered. The invention can improve the stability and load balance of the distributed engine and has great significance for the application of global comprehensive observation results.

Description

Distributed storage method, device and medium for massive remote sensing data

Technical Field

The invention belongs to the technical field of distributed storage, and particularly relates to a distributed storage method, a distributed storage device and a distributed storage medium for massive remote sensing data.

Background

The remote sensing data is generally stored in a format such as a raster or a vector, and a model of raster data is simple, but the data size is large, so that most of the conventional research on raster data focuses on storage and management of raster data. In the storage of vast amounts of remote sensing data, a lot of related work has been carried out by experts, researchers and commercial companies both at home and abroad. The existing remote sensing data storage modes can be mainly divided into the following five types: remote sensing data storage based on a centralized file system, remote sensing data storage based on a database, network storage, file storage based on a distributed file system and remote sensing data storage based on a distributed database.

The data storage based on the centralized file system means that remote sensing data are stored on a single server in a centralized mode, a file body of the remote sensing data is stored and managed by the file system, the storage capacity and performance of the centralized file storage almost depend on the single server completely, and the centralized file storage has strong dependence on hardware performance, so that high reading speed is difficult to provide, and data operation by the hardware is difficult.

The method can support the storage of large-size remote sensing data by combining the relational database and the geoscience data middleware, however, the remote sensing data storage by using the database has some problems, for example, the relational database has limited support for single large files, the storage of unstructured data has many limits, and the performance is greatly reduced under the condition of increasing data volume.

In order to solve the problem of the capacity limit of a single storage device, the whole storage cluster capacity can be improved by expanding the nodes. The single nodes are transversely connected in series through the network, so that a plurality of nodes with relatively small capacity can be integrated to form a large centrally managed data storage server. However, since the situation of a single node is different, it is difficult to centrally manage and maintain all the nodes.

The distributed file system refers to a storage relationship formed by associating a large number of computing nodes or storage nodes which are spatially and physically dispersed together through a logical relationship through a network architecture, wherein the storage relationship is physically dispersed, logically uniform and data flow through a network. But the data retrieval is not customized according to the characteristics of the remote sensing data, and the data access delay is increased along with the increase of the data storage quantity, thereby causing the reduction of the overall operation rate.

The distributed database architecture is a spatial data storage form which provides a uniform access interface to the outside by combining a relational database technology and a spatial data middleware. However, since all service requests need to be processed and forwarded again through the network server, the proxy server is easily blocked by too many network requests, and thus becomes a performance bottleneck of the whole system.

In addition, the traditional distributed storage technology only considers the requirements of storage and access, so the slicing mechanism in storage is mainly based on the load balancing of file sizes. However, for the remote sensing data, a large amount of calculation requirements are often involved, and the calculated amount of the remote sensing data is not related to the file size, so that unbalance of calculation resources occurs in the calculation process of the distributed storage remote sensing data, and an additional performance bottleneck is caused.

Therefore, the invention provides a distributed efficient storage method for massive remote sensing data, and the storage method for the massive remote sensing data organization and calculation requirements is constructed.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a distributed storage method, a distributed storage device and a distributed storage medium for massive remote sensing data.

The invention adopts the following specific technical scheme:

in a first aspect, the present invention provides a distributed storage method for massive remote sensing data, which includes:

s1, after receiving a remote sensing data write-in request submitted by a client, a routing node in the distributed database cluster reads all spatial elements related to the write-in request to form a spatial element set;

s2, counting the total size of the surface feature areas of all the space elements in the space element set, and calculating the ratio of the total size of the surface feature areas to the total number of the data blocks according to the total number of preset data blocks, wherein the ratio is used as the distribution area threshold value of each data block;

s3, converting the geometric center point coordinates of all the space elements in the space element set into space encoding values through space filling curves, and sequencing all the space elements according to the space encoding values to form a space element sequence;

s4, giving unique identifiers to all empty data blocks and arranging the empty data blocks in sequence, taking out the space elements in the space element sequence one by one in sequence and filling the space elements into the first data block which is not filled currently, determining that the data block is filled when the total size of the terrain area of the space elements in one data block exceeds the distribution area threshold value, and continuing to fill the next empty data block until all the space elements in the space element sequence are taken out and filled into the data block;

and S5, converting the unique identifier of each data block into a hash value through a hash algorithm by the routing node, establishing a corresponding relation between each data block and the fragmentation node according to the hash value, executing write-in operation according to the corresponding relation, and storing each data block to the physical machine hard disk corresponding to the fragmentation node.

In the first aspect, the space-filling curve is a hilbert curve.

Preferably, in the first aspect, the spatial element type in the spatial element set includes video data including a whole video and video tiles, and vector data including a point element, a line element, and a plane element.

Preferably, in the first aspect, a geometric center point of the point element is the point element itself.

Preferably, in the first aspect, the preset total number of the data blocks is a positive integer multiple of the total number of all sharded nodes in the distributed database cluster.

Preferably, in the first aspect, the hash algorithm is a consistent hash algorithm.

Preferably, in the first aspect, each sharded node in the distributed database cluster stores a mapping relationship table between the sharded node and a physical machine hard disk, each physical machine hard disk has a plurality of sharded nodes, and the mapping relationship table is adjusted in real time along with the change of the number of hard disks.

As a preferred aspect of the first aspect, each database stores both the active and standby databases based on a multi-copy consistency protocol, so as to implement read-write separation.

In a second aspect, the present invention provides a distributed storage apparatus for massive remote sensing data, comprising a memory and a processor;

the memory for storing a computer program;

the processor is configured to, when executing the computer program, implement the distributed storage method for massive remote sensing data according to any one of the above first aspects.

In a third aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the distributed storage method for massive remote sensing data as described in any aspect of the first aspect can be implemented.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, the data partitioning rule of the metadata of the massive remote sensing data is optimized, and the data block comprising a plurality of space elements is used as a unit of the fragment, so that the relative balance of the calculated amount required by the remote sensing data in different fragment nodes is realized. The distributed storage of massive remote sensing data and the mass calculation requirement of subsequent remote sensing data are considered, the uniform storage of heterogeneous remote sensing data is met, and the problems of hot spot data and unbalanced load caused by the fact that multiple services access the same set of system concurrently under mass users are solved. The invention can improve the stability and load balance of the distributed engine and has great significance for the application of global comprehensive observation results.

Drawings

FIG. 1 is a flow chart of steps of a distributed storage method for massive remote sensing data;

FIG. 2 is a schematic diagram of a hash algorithm executing sharded storage;

FIG. 3 is a schematic diagram of a sublibrary store.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. The technical characteristics in the embodiments of the present invention can be combined correspondingly without mutual conflict.

In a preferred embodiment of the present invention, a distributed storage method for massive remote sensing data is provided, which is used for distributed storage of massive remote sensing data, and in particular, the method can be used for distributed storage of global comprehensive observation results. The global comprehensive observation result has typical multi-source heterogeneous characteristics and comprises remote sensing data of various different forms such as images, vectors and the like. For example, in the aspect of images, the data sources include satellites such as GF1, GF2, HJ, ZY, and the like, and in the aspect of storage structure, the metadata schema adopted by different acquisition devices is different, but the structure description thereof often includes certain spatial location information. Therefore, the basic design idea of the global comprehensive observation result distributed storage model is to design a distributed storage method which supports multiple structures, is high in expansion and easy to realize by considering the characteristics of multisource isomerism, spatial relation, time sequence and the like of observation data on the basis of a distributed database storage architecture, and provide basic distributed storage, efficient parallel query and computing service for remote sensing data of global mass comprehensive observation results.

It should be noted that the "huge amount" in the present invention is only used to describe that the data size of the remote sensing data is large, and can reach the hundred PB level, but the specific data size is not limited.

The distributed database cluster comprises the routing nodes and the fragment nodes, and the routing nodes are responsible for distributing data to the rest fragment nodes according to a routing mechanism. The routing mechanism is different from the traditional storage, centralized metadata management is not needed, metadata service is not the performance bottleneck of the system any more, and the problem of storage expansibility can be effectively solved by the mode.

As shown in fig. 1, the following describes the steps of the distributed storage method for massive remote sensing data in detail:

and S1, after receiving the remote sensing data write-in request submitted by the client, the routing nodes in the distributed database cluster read all the spatial elements involved in the write-in request to form a spatial element set.

In the embodiment, each remote sensing data writing request specifies remote sensing data to be stored, and the remote sensing data may be composed of different types of space elements. The types of the spatial elements comprise image data and vector data, wherein the image data can be a whole image in a grid form or an image tile, and the specific requirement is determined according to the type of the image data which is specified to be written; the vector data includes three types, that is, a point element, a line element, and a plane element. All the space elements specified in the write request form a space element set, and the subsequent routing nodes store the space element set in a slicing manner according to the steps from S2 to S5.

And S2, counting the total size T of the ground feature areas of all the space elements in the space element set, and calculating the ratio V of the total size T of the ground feature areas to the total number M of the data blocks, which is T/M, according to the preset total number M of the data blocks, and using the ratio V as the distribution area threshold value of each data block.

It should be noted that each spatial element in the remote sensing data corresponds to a certain type of ground feature when viewed from the ground, and the size of the ground feature area corresponding to each spatial element is also stored in the field carried by the original data. Therefore, in this embodiment, the field value of the corresponding feature area size may be directly read from the attribute field of the space element, and then the feature areas of all space elements in the space element set are summed, so as to obtain the total feature area size T of all space elements. However, it should be particularly noted that the feature area size of the space element may be an actual area value or a scaled relative value, and when the relative value is adopted, it is sufficient to ensure that the scaling ratios of the feature areas of all the space elements are consistent.

And S3, respectively determining the coordinates of the geometric center point of each space element in the space element set, converting the coordinates of the geometric center points of all the space elements in the space element set into space encoding values through space filling curves, and sequencing all the space elements according to the space encoding values to form a space element sequence O.

In this embodiment, since the spatial elements have two types, i.e., remote sensing images and vector data, the geometric center point thereof needs to be determined for the type. For image data containing four to four position information, the central point of the four to four range is adopted as the geometric central point; for point elements, such as global GPS data, each point itself can be taken as its geometric center point; for line elements or surface elements, the center point of the element may be determined as the geometric center point. After determining the geometric center point of the spatial element, the spatial coding aspect may be implemented by using a space-filling curve such as Peano (Peano), Hilbert (Hilbert), and preferably by using a Hilbert curve.

S4, initializing and emptying the M data blocks in advance, assigning unique identifiers to all empty data blocks and arranging them in sequence, sequentially taking out the spatial elements in the spatial element sequence O one by one and filling the space elements into the first data block that is not filled currently, determining that the data block is filled when the total size of the physical area of the spatial elements in one data block exceeds the allocated area threshold V, and continuing to fill the next empty data block until all the spatial elements in the spatial element sequence O are taken out and filled into the M data blocks.

In order to facilitate balanced distribution, the preset total number M of data blocks is recommended to be a positive integer multiple of the total number of all sharded nodes in the distributed database cluster, for example, if there are 10 sharded nodes in total, M may be set to be three times the number of sharded nodes, that is, M is 30.

It should be noted that the aforementioned allocated area threshold V is not the maximum area of the elements that can be stored in a data block, but is an estimated average value. In the process of executing the step, the M data blocks can be initialized to set the total area S of the ground features of the space elements in the M data blocks during initialization_i0, after space elements are filled with data blocks one by one S_iWith a consequent increase in S_iIf the value of V is exceeded, the filling of the data block is stopped. Thus, the final sum of the feature areas S in a data block_iIs slightly greater than V.

In addition, it should be noted that the space elements in the space element sequence O are sequentially fetched one by one and filled in M data blocks. Alternatively, in the space element sequence O, if the first n is₁The sum of the surface area of the ground features of each space element is greater than V but n is before₁-if the sum of the surface feature areas of 1 spatial element is not greater than V, then the first n spatial elements are taken out from the spatial element sequence O and filled into the 1 st data block; front n₁In the remaining space element sequence O after the extraction of the individual space elements, if the first n is₂The sum of the surface area of the ground features of each space element is greater than V but n is before₂-the sum of the surface feature areas of 1 space element is not more than V, then the first n is₂Taking out the space elements from the space element sequence O and filling the space elements into the 2 nd data block; push … … with the subclass until all space elements are fetched, the sum of the terrain areas in the last data block may not exceed V.

In the invention, the ground feature area is used as the data block partitioning basis, mainly for considering the calculation requirement of the remote sensing data. Since the remote sensing data actually corresponds to the feature in the earth observation, when the remote sensing data is calculated, the calculated amount of the remote sensing data is often related to the area of the corresponding feature, and the larger the area of the feature is, the larger the required calculation resource is. Through the division rule of the above-mentioned S4 in the present invention, all the space elements to be written are divided into data blocks, and the sum of the surface area and the feature area of the space elements in each data block is substantially uniform, so that it can be ensured that the consumption of the computing resources of each subsequent node is relatively balanced after the distributed storage.

S5, when the division of the data blocks is finished, the routing node converts the unique identifier of each data block into a hash value through a hash algorithm, establishes a corresponding relation between each data block and the fragmentation node according to the hash value, executes write-in operation according to the corresponding relation, and stores each data block to the physical machine hard disk corresponding to the fragmentation node.

In the distributed storage of the present invention, the hash algorithm may be a hard hash algorithm or a consistent hash algorithm. In this embodiment, in order to facilitate dynamic adjustment of the distributed storage nodes, a consistent hash algorithm is recommended. The consistent hash algorithm can realize the balanced distribution of the data blocks on the fragmentation nodes, and the specific process belongs to the prior art. For ease of understanding, the following process for consistent hashing algorithm is briefly described as follows: firstly, solving the hash value of each fragmentation node, and configuring the hash value to a hash ring; then, the hash value of the main key of each data block is calculated by the same method and is mapped to the same hash ring; and finally, clockwise searching is started from the position to which the data block is mapped, the data block is stored on the found first fragment node, so that the corresponding relation between each data block and the fragment node to be stored is established, and the data is stored by mapping between the fragment node and the physical hard disk, so that the data Value i is written into the hard disk according to the primary Key Key i, as shown in FIG. 2.

In addition, the problem of expansibility of a storage layer needs to be solved for storage of massive remote sensing data, the storage layer can adopt a Distributed Hash Table (DHT) routing algorithm, each storage node is responsible for storing a small part of data, and addressing and storage of the whole system data are realized based on SDHT. In a specific implementation, the storage layer may divide the hash space into N equal portions, each equal portion being 1 Partition node (Partition), and divide the N equal portions equally according to the number of hard disks. For example, system N defaults to 3600, and assuming that the current system has 50 hard disks, each hard disk carries 72 partitions. When the system is initialized, the mapping relation table between the fragment nodes and the physical machine hard disk is established, and the mapping relation is adjusted in real time along with the change of the number of the hard disks in the system. The mapping table is small in space, and a mapping relation table between the fragmentation nodes and a physical machine hard disk is stored in each fragmentation node in the distributed database cluster and used for rapid routing. The routing mechanism is different from the traditional storage, centralized metadata management is not needed, metadata service is not the performance bottleneck of the system any more, and the problem of storage expansibility can be effectively solved by the mode.

In addition, remote sensing data has the typical characteristic of multi-source isomerism, so in order to meet the requirement of unified storage of heterogeneous metadata, a data storage and organization structure with a flexible structure needs to be designed. Considering that the currently unified metadata exchange mostly adopts extensible data description modes such as XML and JSON, the metadata of the global comprehensive observation result is also described by adopting the data model with a flexible structure in the distributed metadata database, and the binary stream is adopted for physical storage.

In addition, in order to ensure the reliability of distributed storage of remote sensing data, the organization structure of the distributed database needs to be designed. As a preferred implementation form of the present invention, each database needs to store two primary and secondary databases based on a multi-copy consistency protocol, and stores the two primary and secondary databases in different sub-databases. As shown in fig. 3, the master database (M) is used to provide metadata writing service, and the slave database (S) is mainly used to provide metadata reading service, so as to implement read-write separation during data access, and improve the concurrent service capability of the system.

Therefore, the distributed storage method for massive remote sensing data described in S1-S5 can perform balanced and reliable distributed storage on the multi-source heterogeneous remote sensing data, and can balance the mass calculation requirements of subsequent remote sensing data and balance the calculation resources required by the data stored on different sharded nodes. When the client needs to access data, the fragment node where the query result is located is calculated according to the query condition and the routing rule.

It should be noted that, according to the embodiments of the present disclosure, the functions of the above-mentioned memories for implementing the steps may be implemented by writing a computer software program, and the computer program includes a program code for executing the corresponding method. Thus, the present invention can further provide a corresponding distributed storage apparatus and a medium storing the program code, corresponding to the distributed storage method described above.

In another embodiment of the present invention, there is also provided a distributed storage apparatus for mass remote sensing data, comprising a memory and a processor;

the memory for storing a computer program;

the processor is configured to, when executing the computer program, implement the distributed storage method for mass remote sensing data as described in S1-S5.

In another embodiment of the present invention, a computer-readable storage medium is further provided, wherein the storage medium stores thereon a computer program, and when the computer program is executed by a processor, the computer program can implement the distributed storage method for massive remote sensing data as described in the above S1-S5.

It should be noted that the Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. Of course, the device should also have the necessary components to implement the program operation, such as power supply, communication bus, etc.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. A distributed storage method for massive remote sensing data is characterized by comprising the following steps:

2. The distributed mass remote sensing data-oriented storage method of claim 1, wherein the space-filling curve is a hilbert curve.

3. The mass remote sensing data-oriented distributed storage method according to claim 1, wherein the types of spatial elements in the spatial element set include image data including a whole image and image tiles, and vector data including point elements, line elements, and plane elements.

4. The distributed mass remote sensing data-oriented storage method according to claim 1, wherein the geometric center point of the point element is the point element itself.

5. The distributed storage method for the massive remote sensing data as claimed in claim 1, wherein the preset total number of data blocks is a positive integer multiple of the total number of all sharded nodes in the distributed database cluster.

6. The mass remote sensing data-oriented distributed storage method according to claim 1, wherein the hash algorithm is a consistent hash algorithm.

7. The distributed storage method for the massive remote sensing data as claimed in claim 1, wherein a mapping relationship table between the sharded nodes and the physical machine hard disk is stored in each sharded node in the distributed database cluster, each physical machine hard disk has a plurality of sharded nodes, and the mapping relationship table is adjusted in real time along with the change of the number of hard disks.

8. The distributed storage method for massive remote sensing data as claimed in claim 1, wherein each database stores both the primary and the secondary databases based on a multi-copy consistency protocol to realize read-write separation.

9. A distributed storage device facing massive remote sensing data is characterized by comprising a memory and a processor;

the memory for storing a computer program;

the processor is used for realizing the distributed storage method facing the massive remote sensing data according to any one of claims 1 to 8 when the computer program is executed.

10. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, is capable of implementing a distributed storage method for mass remote sensing data according to any one of claims 1 to 8.