CN114338718B

CN114338718B - Distributed storage method, device and medium for massive remote sensing data

Info

Publication number: CN114338718B
Application number: CN202111573294.3A
Authority: CN
Inventors: 杜震洪; 孙逍遥; 郝亚楠; 陈奕君
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-09-20
Anticipated expiration: 2041-12-21
Also published as: CN114338718A

Abstract

The invention discloses a distributed storage method, device and medium for huge amount of remote sensing data, belonging to the field of distributed storage. The invention optimizes the data division rules of the metadata of the huge amount of remote sensing data, and uses the data blocks containing multiple spatial elements as the unit of sharding, thereby realizing the relative balance of the calculation amount required by the remote sensing data in different sharding nodes, and taking into account Distributed storage of huge amounts of remote sensing data and massive computing requirements for subsequent remote sensing data. The invention can improve the stability and load balance of the distributed engine, and is of great significance to the application of the global comprehensive observation results.

Description

Distributed storage method, device and medium for massive remote sensing data

技术领域technical field

本发明属于分布式存储技术领域，具体涉及一种面向巨量遥感数据的分布式存储方法、装置及介质。The invention belongs to the technical field of distributed storage, and in particular relates to a distributed storage method, device and medium for huge amounts of remote sensing data.

背景技术Background technique

遥感数据通常以栅格、矢量等格式进行存储，栅格数据的模型较为简单，但数据量大，因此现有关于栅格数据的研究大多集中于栅格数据的存储和管理。在巨量遥感数据的存储方面，国内外专家学者、研究机构以及商业公司已经开展了许多相关工作。现有的遥感数据存储方式可以主要分为以下五类：基于集中式文件系统的遥感数据存储、基于数据库的遥感数据存储、网络存储、基于分布式文件系统的文件存储和基于分布式数据库的遥感数据存储。Remote sensing data is usually stored in raster, vector and other formats. The model of raster data is relatively simple, but the amount of data is large. Therefore, most of the existing research on raster data focuses on the storage and management of raster data. In the storage of huge amounts of remote sensing data, domestic and foreign experts, scholars, research institutions and commercial companies have carried out a lot of related work. The existing remote sensing data storage methods can be mainly divided into the following five categories: remote sensing data storage based on centralized file system, remote sensing data storage based on database, network storage, file storage based on distributed file system and remote sensing based on distributed database. data storage.

基于集中式文件系统的数据存储指遥感数据集中存储于单个服务器上，利用文件系统对遥感数据的文件本体进行存储和管理，集中式文件存储的存储容量和性能几乎完全取决于单台服务器，对硬件性能具有强依赖性，不仅难以提供高读取速度，利用硬件进行数据运算也很困难。Data storage based on a centralized file system refers to the centralized storage of remote sensing data on a single server, and the use of a file system to store and manage the file body of remote sensing data. The storage capacity and performance of centralized file storage depend almost entirely on a single server. Hardware performance is strongly dependent, and it is not only difficult to provide high reading speed, but also difficult to use hardware to perform data operations.

通过将关系数据库和地学数据中间件相结合的方式，可以支持大尺寸遥感数据的存储，然而，利用数据库进行遥感数据存储也存在一些问题，例如关系型数据库对单体大文件的支持有限，对于非结构化数据的存储也有很多限制，在数据量上升的情况下性能会有较大幅度的下降。By combining relational database and geoscience data middleware, the storage of large-scale remote sensing data can be supported. However, there are some problems in using database for remote sensing data storage. For example, relational database has limited support for single large files. The storage of unstructured data also has many limitations, and the performance will drop significantly when the amount of data increases.

为了解决单台存储设备容量极限的问题，可以通过扩展节点来提升整体的存储集群容量。通过网络对单个节点进行横向串联可以整合一些容量相对较小的节点，成为大型的中心化管理的数据存储服务器。然而，由于单个节点的情况各异，难以对所有节点进行集中管理和维护。In order to solve the problem of the capacity limit of a single storage device, the overall storage cluster capacity can be increased by expanding nodes. The horizontal series connection of a single node through the network can integrate some nodes with relatively small capacity to become a large-scale centralized management data storage server. However, due to the different situations of individual nodes, it is difficult to centrally manage and maintain all nodes.

分布式文件系统是指经过网络架构，将空间和物理分散的大量计算节点或者存储节点通过逻辑关系关联在一起，形成一种物理分散、逻辑统一、数据通过网络流动的存储关系。但其数据检索不根据遥感数据特点定制，数据访问延迟会随数据存储量的增长而增加，从而导致整体运行速率的下降。A distributed file system refers to a network architecture that associates a large number of spatially and physically dispersed computing nodes or storage nodes through logical relationships to form a storage relationship that is physically dispersed, logically unified, and data flows through the network. However, its data retrieval is not customized according to the characteristics of remote sensing data, and the data access delay will increase with the growth of data storage, resulting in a decrease in the overall running rate.

分布式数据库架构是一种通过将关系型数据库技术与空间数据中间件相结合，对外提供统一存取接口的空间数据存储形式。然而，由于所有的服务请求都需要经过网络服务器进行处理和二次转发，代理服务器很容易因为过多的网络请求而产生阻塞，从而成为整个系统的性能瓶颈。Distributed database architecture is a spatial data storage form that provides a unified access interface by combining relational database technology with spatial data middleware. However, since all service requests need to be processed and forwarded by the network server, the proxy server is easily blocked due to excessive network requests, thus becoming a performance bottleneck of the entire system.

另外，传统的分布式存储技术仅仅考虑存储和访问的需求，因此其存储时的分片机制主要是针对文件大小负载均衡为基准的。但是对于遥感数据而言，往往还涉及到大量的计算需求，而遥感数据的计算量与文件大小并不存在关联性，由此导致分布式存储的遥感数据在计算过程中出现计算资源的不均衡，导致额外的性能瓶颈。In addition, the traditional distributed storage technology only considers storage and access requirements, so the fragmentation mechanism during storage is mainly based on file size load balancing. However, for remote sensing data, it often involves a large amount of computing requirements, and there is no correlation between the computing amount of remote sensing data and the file size, which leads to the imbalance of computing resources in the computing process of distributed remote sensing data. , resulting in additional performance bottlenecks.

因此，本发明提出了一种面向巨量遥感数据的分布式高效存储方法，构建满足巨量遥感数据组织与计算需求的存储方法。Therefore, the present invention proposes a distributed and efficient storage method for huge amounts of remote sensing data, and constructs a storage method that meets the organization and computing requirements of huge amounts of remote sensing data.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于解决现有技术中存在的问题，并提供一种面向巨量遥感数据的分布式存储方法、装置及介质。The purpose of the present invention is to solve the problems existing in the prior art, and to provide a distributed storage method, device and medium for huge amounts of remote sensing data.

本发明所采用的具体技术方案如下：The concrete technical scheme adopted in the present invention is as follows:

第一方面，本发明提供了一种面向巨量遥感数据的分布式存储方法，其包括：In a first aspect, the present invention provides a distributed storage method for huge amounts of remote sensing data, which includes:

S1、分布式数据库集群中的路由节点收到客户端提交的遥感数据写入请求后，读取写入请求中涉及的所有空间要素组成空间要素集；S1. After receiving the remote sensing data writing request submitted by the client, the routing node in the distributed database cluster reads all the spatial elements involved in the writing request to form a spatial element set;

S2、统计所述空间要素集中所有空间要素的地物面积总大小，并根据预设的数据块总数，计算所述地物面积总大小与所述数据块总数的比值，作为每个数据块的分配面积阈值；S2. Count the total size of the ground object area of all the spatial elements in the spatial element set, and calculate the ratio of the total size of the ground object area to the total number of the data blocks according to the preset total number of data blocks, as the value of each data block Allocating area threshold;

S3、将所述空间要素集中所有空间要素的几何中心点坐标通过空间填充曲线转换为空间编码值，按照空间编码值对所有空间要素进行排序，形成空间要素序列；S3, converting the geometric center point coordinates of all spatial elements in the spatial element set into spatial coding values through a spatial filling curve, and sorting all spatial elements according to the spatial coding values to form a spatial element sequence;

S4、将所有空的数据块赋予唯一标志符并按序排列，将所述空间要素序列中的空间要素按序逐个取出并填入当前未填满的第一个数据块中，一个数据块中的空间要素的地物面积总大小超过所述分配面积阈值时视为该数据块已填满，继续对下一个空的数据块进行数据填充，直至所述空间要素序列中的所有空间要素均被取出并填入数据块中；S4. All the empty data blocks are given unique identifiers and arranged in order, and the spatial elements in the spatial element sequence are taken out one by one in sequence and filled into the first data block that is not currently filled. When the total size of the ground object area of the spatial elements exceeds the allocated area threshold, the data block is considered to be full, and the next empty data block continues to be filled with data until all the spatial elements in the spatial element sequence are filled. Take out and fill in the data block;

S5、路由节点将每个数据块的唯一标志符通过哈希算法转化为哈希值，并根据哈希值建立各数据块与分片节点之间的对应关系，并按照该对应关系执行写入操作，将各数据块存储至分片节点所对应物理机硬盘上。S5. The routing node converts the unique identifier of each data block into a hash value through a hash algorithm, establishes a corresponding relationship between each data block and the sharding node according to the hash value, and executes writing according to the corresponding relationship operation to store each data block on the hard disk of the physical machine corresponding to the sharding node.

作为上述第一方面的优选，所述空间填充曲线为希尔伯特曲线。As a preferred option of the first aspect, the space filling curve is a Hilbert curve.

作为上述第一方面的优选，所述空间要素集中的空间要素类型包括影像数据和矢量数据，所述影像数据包括整幅影像和影像瓦片，所述矢量数据包括点要素、线要素和面要素。As a preferred option of the first aspect, the types of spatial elements in the spatial element set include image data and vector data, the image data includes entire images and image tiles, and the vector data includes point elements, line elements and area elements .

作为上述第一方面的优选，所述点要素的几何中心点为该点要素自身。As a preferred aspect of the first aspect, the geometric center point of the point element is the point element itself.

作为上述第一方面的优选，所述预设的数据块总数为分布式数据库集群中所有分片节点总数的正整数倍。As a preferred aspect of the first aspect, the preset total number of data blocks is a positive integer multiple of the total number of all shard nodes in the distributed database cluster.

作为上述第一方面的优选，所述哈希算法为一致性哈希算法。As a preference of the first aspect, the hash algorithm is a consistent hash algorithm.

作为上述第一方面的优选，分布式数据库集群中的每个分片节点中存储有分片节点和物理机硬盘之间的映射关系表，每一块物理机硬盘上具有多个分片节点，且该映射关系表随着硬盘数量的变化实时调整。As a preferred aspect of the first aspect, each shard node in the distributed database cluster stores a mapping relationship table between the shard node and the physical machine hard disk, and each physical machine hard disk has multiple shard nodes, and The mapping relationship table is adjusted in real time as the number of hard disks changes.

作为上述第一方面的优选，每个数据库均基于多副本一致性协议同时存储主备两份，以实现读写分离。As a preference of the above-mentioned first aspect, each database stores two primary and secondary copies at the same time based on a multi-copy consistency protocol, so as to realize read-write separation.

第二方面，本发明提供了一种面向巨量遥感数据的分布式存储装置，其包括存储器和处理器；In a second aspect, the present invention provides a distributed storage device for massive remote sensing data, which includes a memory and a processor;

所述存储器，用于存储计算机程序；the memory for storing computer programs;

所述处理器，用于当执行所述计算机程序时，实现如上述第一方面任一方案所述的面向巨量遥感数据的分布式存储方法。The processor is configured to, when executing the computer program, implement the distributed storage method for huge amounts of remote sensing data according to any one of the solutions in the first aspect.

第三方面，本发明提供了一种计算机可读存储介质，所述存储介质上存储有计算机程序，当所述计算机程序被处理器执行时，能实现如上述第一方面任一方案所述的面向巨量遥感数据的分布式存储方法。In a third aspect, the present invention provides a computer-readable storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the solution described in any one of the above-mentioned first aspect can be implemented. A distributed storage method for massive remote sensing data.

本发明相对于现有技术而言，具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明通过对巨量遥感数据元数据的数据划分规则进行优化，以包含多个空间要素的数据块作为分片的单位，实现了不同分片节点中遥感数据所需计算量的相对均衡。本发明兼顾巨量遥感数据的分布式存储以及后续遥感数据的海量计算需求，不仅满足了异构遥感数据的统一存储，也解决在海量用户下多业务并发访问同一套系统造成的热点数据以及负载不均衡问题。本发明可以提升分布式引擎的稳定性和负载均衡性，对全球综合观测成果的应用具有重大意义。The invention optimizes the data division rules of the huge amount of remote sensing data metadata, and uses the data blocks containing multiple spatial elements as the unit of sharding, thereby realizing the relative balance of the required calculation amount of remote sensing data in different sharding nodes. The invention takes into account the distributed storage of massive remote sensing data and the massive computing requirements of subsequent remote sensing data, not only satisfies the unified storage of heterogeneous remote sensing data, but also solves the hot data and load caused by concurrent access to the same system by multiple services under massive users imbalance problem. The invention can improve the stability and load balance of the distributed engine, and has great significance for the application of the global comprehensive observation results.

附图说明Description of drawings

图1为面向巨量遥感数据的分布式存储方法的步骤流程图；Fig. 1 is the step flow chart of the distributed storage method for huge amount of remote sensing data;

图2为哈希算法执行分片存储的示意图；Fig. 2 is the schematic diagram that hash algorithm performs fragmentation storage;

图3为分库存储的示意图。Figure 3 is a schematic diagram of sub-library storage.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图对本发明的具体实施方式做详细的说明。在下面的描述中阐述了很多具体细节以便于充分理解本发明。但是本发明能够以很多不同于在此描述的其它方式来实施，本领域技术人员可以在不违背本发明内涵的情况下做类似改进，因此本发明不受下面公开的具体实施例的限制。本发明各个实施例中的技术特征在没有相互冲突的前提下，均可进行相应组合。In order to make the above objects, features and advantages of the present invention more clearly understood, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, the present invention can be implemented in many other ways different from those described herein, and those skilled in the art can make similar improvements without departing from the connotation of the present invention. Therefore, the present invention is not limited by the specific embodiments disclosed below. The technical features in each embodiment of the present invention can be combined correspondingly on the premise that there is no conflict with each other.

在本发明的一个较佳实施例中，提供了一种面向巨量遥感数据的分布式存储方法，用于对海量的遥感数据进行分布式存储，具体而言该方法可用于全球综合观测成果的分布式存储。全球综合观测成果具有典型的多源异构特征，包含影像、矢量等多种不同形式的遥感数据。例如，在影像方面，数据来源包括GF1、GF2、HJ、ZY等卫星，而在存储结构方面，不同的采集设备采用的元数据模式亦有所不同，但其结构描述中多包含一定的空间位置信息。因此全球综合观测成果分布式存储模型的基本设计思想是在分布式数据库存储架构的基础上，考虑观测数据的多源异构、空间关系、时间序列等特征，设计支持多结构、高扩展、易实现的分布式存储方法，为全球海量综合观测成果的遥感数据提供基础分布式存储与高效并行查询、计算服务。In a preferred embodiment of the present invention, a distributed storage method for massive remote sensing data is provided, which is used for distributed storage of massive remote sensing data. Specifically, the method can be used for global comprehensive observation results. Distributed storage. The global comprehensive observation results have typical multi-source heterogeneous characteristics, including remote sensing data in various forms such as images and vectors. For example, in terms of images, the data sources include satellites such as GF1, GF2, HJ, ZY, etc. In terms of storage structure, the metadata modes adopted by different acquisition devices are also different, but their structure descriptions often include certain spatial locations. information. Therefore, the basic design idea of the distributed storage model of the global comprehensive observation results is based on the distributed database storage architecture, considering the multi-source heterogeneity, spatial relationship, time series and other characteristics of the observation data, and the design supports multiple structures, high expansion, and easy access. The realized distributed storage method provides basic distributed storage and efficient parallel query and computing services for remote sensing data of massive global comprehensive observation results.

需说明的是，本发明中所谓的“巨量”，仅用于描述遥感数据的数据量较大，可达到百PB级别，但并不限制其具体的数据大小。It should be noted that the so-called "giant amount" in the present invention is only used to describe the large amount of data of remote sensing data, which can reach the level of hundreds of petabytes, but does not limit its specific data size.

在本发明的分布式数据库集群中，包含了路由节点和分片节点，由路由节点负责对剩余的分片节点按照路由机制进行数据分发。路由机制不同于传统存储，并没有集中的元数据管理，元数据服务也不再成为系统的性能瓶颈，通过这种方式可以有效解决存储扩展性的问题。In the distributed database cluster of the present invention, a routing node and a fragmentation node are included, and the routing node is responsible for distributing data to the remaining fragmented nodes according to the routing mechanism. The routing mechanism is different from traditional storage. There is no centralized metadata management, and the metadata service is no longer the performance bottleneck of the system. In this way, the problem of storage scalability can be effectively solved.

如图1所示，该下面对该面向巨量遥感数据的分布式存储方法的步骤进行详细描述：As shown in Figure 1, the steps of the distributed storage method for huge amounts of remote sensing data are described in detail below:

S1、分布式数据库集群中的路由节点收到客户端提交的遥感数据写入请求后，读取写入请求中涉及的所有空间要素组成空间要素集。S1. After receiving the remote sensing data writing request submitted by the client, the routing node in the distributed database cluster reads all the spatial elements involved in the writing request to form a spatial element set.

在本实施例中，每一个遥感数据写入请求中均会指定所需要存储的遥感数据，这些遥感数据可能由不同类型的空间要素组成。其中，空间要素的类型包括影像数据和矢量数据，影像数据既可以是栅格形式的整幅影像，也可以是影像瓦片，具体需要根据指定写入的影像数据类型而定；另外，矢量数据包括点要素、线要素和面要素三类。写入请求中指定的所有空间要素组成了空间要素集，后续路由节点将按照S2～S5的步骤对这个空间要素集进行分片存储。In this embodiment, each remote sensing data writing request specifies the remote sensing data to be stored, and these remote sensing data may be composed of different types of spatial elements. Among them, the types of spatial elements include image data and vector data. Image data can be either an entire image in the form of a raster, or an image tile, depending on the type of image data to be written; in addition, vector data There are three types of point features, line features and polygon features. All spatial elements specified in the write request form a spatial element set, and subsequent routing nodes will store this spatial element set in shards according to the steps of S2 to S5.

S2、统计空间要素集中所有空间要素的地物面积总大小T，并根据预设的数据块总数M，计算地物面积总大小T与数据块总数M的比值V＝T/M，作为每个数据块的分配面积阈值。S2. Count the total size T of the ground object area of all spatial elements in the spatial element set, and calculate the ratio V=T/M of the total size T of the ground object area to the total number of data blocks M according to the preset total number M of data blocks, as each The allocation area threshold for the data block.

需要说明的是，遥感数据中的每一个空间要素均对应了对地观测时的某一类地物，在其原始数据所携带的字段中也会存储每一个空间要素对应的地物面积大小。因此，本实施例中，可以直接从空间要素的属性字段中读取相应的地物面积大小的字段值，然后将空间要素集中所有空间要素的地物面积加和，即可得到所有空间要素的地物面积总大小T。但特别需要注意的是，此处空间要素的地物面积大小，可以是实际面积值，也可以是一个按缩放比例折算的相对值，当采用相对值时保证所有空间要素的地物面积的缩放比例一致即可。It should be noted that each spatial element in the remote sensing data corresponds to a certain type of ground object during earth observation, and the field of the original data will also store the size of the ground object corresponding to each spatial element. Therefore, in this embodiment, the field value of the corresponding object area size can be directly read from the attribute field of the spatial element, and then the ground object area of all the spatial elements in the spatial element set can be summed to obtain the value of all the spatial elements. The total size of the terrain area T. However, it is particularly important to note that the size of the ground object area of the space element here can be the actual area value or a relative value converted according to the scaling ratio. When the relative value is used, the scaling of the ground object area of all spatial elements is guaranteed. The proportions are the same.

S3、针对前述空间要素集中每一个空间要素，分别确定其几何中心点坐标，然后将空间要素集中所有空间要素的几何中心点坐标通过空间填充曲线转换为空间编码值，按照空间编码值对所有空间要素进行排序，形成空间要素序列O。S3. For each space element in the aforementioned space element set, determine the coordinates of its geometric center point respectively, and then convert the geometric center point coordinates of all the space elements in the space element set into a space code value through a space filling curve. The elements are sorted to form a spatial element sequence O.

在本实施例中，由于空间要素具有遥感影像和矢量数据两类，其几何中心点需要针对类型来确定。对于包含四至位置信息的影像数据，采用四至范围的中心点作为几何中心点即可；对于点要素，如全球GPS数据，可将每个点自身作为其几何中心点；对于线要素或者面要素，则确定要素的中心点作为几何中心点即可。确定空间要素的几何中心点后，空间编码方面可采用皮亚诺(Peano)、希尔伯特(Hilbert)等空间填充曲线，优选采用Hilbert曲线来实现。In this embodiment, since the spatial element has two types of remote sensing image and vector data, its geometric center point needs to be determined according to the type. For image data containing four-to-position information, the center point of the four-to-range can be used as the geometric center point; for point elements, such as global GPS data, each point can be used as its geometric center point; for line elements or area elements, Then determine the center point of the element as the geometric center point. After the geometric center point of the spatial element is determined, the spatial coding aspect can be realized by using a space filling curve such as Peano and Hilbert, preferably using a Hilbert curve.

S4、前述M个数据块预先进行初始化置空，将所有空的数据块赋予唯一标志符并按序排列，将前述空间要素序列O中的空间要素按序逐个取出并填入当前未填满的第一个数据块中，一个数据块中的空间要素的地物面积总大小超过分配面积阈值V时视为该数据块已填满，继续对下一个空的数据块进行数据填充，直至空间要素序列O中的所有空间要素均被取出并填入M个数据块中。S4. The aforementioned M data blocks are initialized and emptied in advance, all the empty data blocks are given unique identifiers and arranged in order, and the spatial elements in the aforementioned spatial element sequence O are taken out one by one in order and filled in the currently unfilled ones. In the first data block, when the total size of the ground object area of the spatial elements in a data block exceeds the allocated area threshold V, the data block is considered to be full, and the next empty data block continues to be filled with data until the spatial elements are filled. All spatial elements in sequence O are taken out and filled into M data blocks.

为了便于均衡地分配，预设的数据块总数M推荐为分布式数据库集群中所有分片节点总数的正整数倍，例如一共有10个分片节点，则可以设置M为分片节点数量的三倍，即M＝30。In order to facilitate balanced distribution, the preset total number of data blocks M is recommended to be a positive integer multiple of the total number of shard nodes in the distributed database cluster. For example, if there are 10 shard nodes in total, M can be set to three times the number of shard nodes. times, that is, M=30.

需要注意的是，前述的分配面积阈值V并非是一个数据块所能够存储的要素的最大面积，而是一个预估的平均值。在本步骤执行的过程中，M个数据块在初始化时可以初始化设定其中空间要素的地物面积总和S_i为0，在空间要素逐个填入数据块后S_i随之增大，当S_i超过V值时，停止对这个数据块的填充。因此，一个数据块中最终的地物面积总和S_i是略大于V的。It should be noted that the aforementioned allocation area threshold V is not the maximum area of elements that can be stored in a data block, but an estimated average value. During the execution of this step, the M data blocks can be initialized to set the total area S _i of the spatial elements to be 0. After the spatial elements are filled into the data blocks one by one _, Si increases accordingly. When _i exceeds the value of V, the filling of this data block is stopped. Therefore, the final sum of the surface area S _i in a data block is slightly larger than V.

另外，需要注意的是，空间要素序列O中的空间要素按序逐个取出并填满M个数据块的。换一种形式也就是说，在空间要素序列O中，如果前n₁个空间要素的地物面积之和大于V但前n₁-1个空间要素的地物面积之和不大于V，则前n个空间要素从空间要素序列O中取出并填入第1个数据块；前n₁个空间要素取出后的剩余空间要素序列O中，如果前n₂个空间要素的地物面积之和大于V但前n₂-1个空间要素的地物面积之和不大于V，则再将这前n₂个空间要素从空间要素序列O中取出并填入第2个数据块；以次类推……，直至所有的空间要素都被取出，最后一个数据块中的地物面积之和可能未超过V。In addition, it should be noted that the spatial elements in the spatial element sequence O are taken out one by one in order and filled with M data blocks. In another form, that is, in the spatial element sequence O, if the sum of the ground object areas of the first n ₁ spatial elements is greater than V but the sum of the ground object areas of the first n ₁ -1 spatial elements is not greater than V, then The first n space elements are taken out from the space element sequence O and filled in the first data block; in the remaining space element sequence O after the first n ₁ space elements are taken out, if the sum of the ground object areas of the first n ₂ space elements If it is greater than V but the sum of the ground object areas of the first n ₂ -1 spatial elements is not greater than V, then the first n ₂ spatial elements are taken out from the spatial element sequence O and filled into the second data block; and so on. ..., until all spatial elements are taken out, the sum of the area of the objects in the last data block may not exceed V.

本发明中，以地物面积来作为数据块分块依据，主要是为了顾及遥感数据的计算需求。因为遥感数据实际对应的是对地观测中的地物，因此在对遥感数据进行计算时，其计算量往往与对应的地物面积有关系，地物面积越大所需的计算资源也越大。经过本发明上述S4的划分规则，所有待写入的空间要素都被划分成了数据块，每一个数据块中空间要素的地物面积之和是基本均匀的，由此可以保证其在分布式存储后，后续每一个节点的计算资源消耗也相对均衡。In the present invention, the ground object area is used as the basis for dividing the data block, mainly to take into account the calculation requirements of the remote sensing data. Because the remote sensing data actually corresponds to the ground objects in the earth observation, when the remote sensing data is calculated, the calculation amount is often related to the corresponding ground object area. The larger the ground object area, the greater the computing resources required. . Through the division rules of S4 in the present invention, all the spatial elements to be written are divided into data blocks, and the sum of the ground object areas of the spatial elements in each data block is basically uniform, which can ensure that it is distributed in the distributed After storage, the computing resource consumption of each subsequent node is relatively balanced.

S5、当完成数据块的划分后，路由节点将每个数据块的唯一标志符通过哈希算法转化为哈希值，并根据哈希值建立各数据块与分片节点之间的对应关系，并按照该对应关系执行写入操作，将各数据块存储至分片节点所对应物理机硬盘上。S5. After completing the division of the data blocks, the routing node converts the unique identifier of each data block into a hash value through a hash algorithm, and establishes the corresponding relationship between each data block and the sharding node according to the hash value, The write operation is performed according to the corresponding relationship, and each data block is stored on the hard disk of the physical machine corresponding to the sharding node.

在本发明的分布式存储中，哈希算法可以是硬哈希算法，也可以是一致性哈希算法。在本实施例中，为了便于实现分布式存储节点的动态调整，推荐采用一致性哈希算法。一致性哈希算法可以实现数据块在分片节点上均衡分配，其具体过程属于现有技术。为了便于理解，下面对一致性哈希算法的过程简单叙述如下：首先，求出各分片节点的哈希值，并将其配置到哈希环；然后采用同样的方法求出每一个数据块的主键的哈希值，并映射到相同的哈希环上；最后从数据块映射到的位置开始顺时针查找，将数据块存储到找到的第一个分片节点上，由此建立每一个数据块与待存储的分片节点之间的对应关系，后续通过分片节点与物理硬盘之间映射进行存储，由此将根据主键Key i将数据Value i写入硬盘，如图2所示。In the distributed storage of the present invention, the hash algorithm may be a hard hash algorithm or a consistent hash algorithm. In this embodiment, in order to facilitate dynamic adjustment of distributed storage nodes, it is recommended to use a consistent hash algorithm. The consistent hash algorithm can realize the balanced distribution of data blocks on the sharding nodes, and the specific process thereof belongs to the prior art. For ease of understanding, the process of the consistent hash algorithm is briefly described as follows: First, find the hash value of each shard node and configure it to the hash ring; then use the same method to find each data The hash value of the primary key of the block is mapped to the same hash ring; finally, the data block is searched clockwise from the location where the data block is mapped, and the data block is stored on the first shard node found, thereby establishing each The correspondence between a data block and the sharding node to be stored is subsequently stored through the mapping between the sharding node and the physical hard disk, so that the data Value i will be written to the hard disk according to the primary key Key i, as shown in Figure 2 .

另外，巨量遥感数据的存储还需要解决存储层的扩展性问题，存储层可采用分布式哈希表(Distribute Hash Table，DHT)路由算法，每个存储节点负责存储小部分数据，基于SDHT实现整个系统数据的寻址和存储。在具体实现时，存储层可以将哈希空间划分为N等份，每等份为1个分区节点(Partition)，对这N等份按照硬盘数量进行均分。例如系统N默认为3600，假设当前系统有50块硬盘，则每块硬盘承载72个分区。在系统初始化时会建立上述的分片节点和物理机硬盘之间的映射关系表，而且该映射关系后续会随着系统中硬盘数量的变化进行实时调整。该映射表所需要的空间很小，分布式数据库集群中中的每个分片节点中存储有分片节点和物理机硬盘之间的映射关系表用于快速路由。路由机制不同于传统存储，并没有集中的元数据管理，元数据服务也不再成为系统的性能瓶颈，通过这种方式可以有效解决存储扩展性的问题。In addition, the storage of huge amounts of remote sensing data also needs to solve the scalability problem of the storage layer. The storage layer can use the Distribute Hash Table (DHT) routing algorithm. Each storage node is responsible for storing a small part of the data, which is implemented based on SDHT. Addressing and storage of data throughout the system. In specific implementation, the storage layer may divide the hash space into N equal parts, each equal part is a partition node (Partition), and the N equal parts are equally divided according to the number of hard disks. For example, the default value of system N is 3600. Assuming that the current system has 50 hard disks, each hard disk carries 72 partitions. When the system is initialized, the above-mentioned mapping relationship table between the sharding node and the physical machine hard disk will be established, and the mapping relationship will be adjusted in real time as the number of hard disks in the system changes subsequently. The space required for the mapping table is very small, and each shard node in the distributed database cluster stores a mapping relationship table between the shard node and the hard disk of the physical machine for fast routing. The routing mechanism is different from traditional storage. There is no centralized metadata management, and the metadata service is no longer the performance bottleneck of the system. In this way, the problem of storage scalability can be effectively solved.

另外，遥感数据具有多源异构的典型特点，因此为了满足异构元数据的统一存储，还需要设计结构灵活的数据存储与组织结构。考虑到当前对于元数据统一交换多采用XML、JSON等可扩展的数据描述模式，因此在分布式元数据库中亦采用这种结构灵活的数据模型对全球综合观测成果的元数据进行描述，并采用二进制流进行物理存储。In addition, remote sensing data has the typical characteristics of multi-source heterogeneity. Therefore, in order to meet the unified storage of heterogeneous metadata, it is necessary to design a flexible data storage and organization structure. Considering that XML, JSON and other extensible data description modes are currently used for the unified exchange of metadata, this kind of data model with flexible structure is also used in the distributed metadata database to describe the metadata of the global comprehensive observation results. A binary stream is physically stored.

另外，为了保证遥感数据分布式存储的可靠性，还需对分布式数据库的组织架构进行设计。作为本发明的一种优选实现形式，每个数据库均需要基于多副本一致性协议同时存储主备两份，存储于不同的分库中。如图3所示，其中主数据库(M)用于提供元数据写入服务，而从数据库(S)主要用于提供元数据的读取服务，以此实现数据访问时的读写分离，提高系统的并发服务能力。In addition, in order to ensure the reliability of the distributed storage of remote sensing data, the organizational structure of the distributed database needs to be designed. As a preferred implementation form of the present invention, each database needs to store two primary and backup copies at the same time based on the multi-copy consistency protocol, which are stored in different sub-databases. As shown in Figure 3, the master database (M) is used to provide metadata writing services, while the slave database (S) is mainly used to provide metadata reading services, so as to achieve read-write separation during data access and improve The concurrent service capability of the system.

由此，上述S1～S5所述的面向巨量遥感数据的分布式存储方法，可以将多源异构的遥感数据进行均衡、可靠地分布式存储，而且其能够兼顾后续遥感数据的海量计算需求，平衡不同分片节点上存储的数据所需的计算资源。当客户端需要访问数据时，依据查询条件及路由规则，计算查询结果所在的分片节点即可。Therefore, the distributed storage method for massive remote sensing data described in the above S1 to S5 can perform balanced and reliable distributed storage of multi-source and heterogeneous remote sensing data, and it can also take into account the massive computing requirements of subsequent remote sensing data. , balancing the computing resources required for data stored on different shard nodes. When the client needs to access data, it can calculate the shard node where the query result is located according to the query conditions and routing rules.

需要说明的是，根据本发明公开的实施例，上述存储器中的具体实现各步骤的功能可以通过编写的计算机软件程序来实现，计算机程序中包含用于执行相应方法的程序代码。由此，与前述的分布式存储方法对应，本发明还可以进一步提供对应的分布式存储装置以及存储前述程序代码的介质。It should be noted that, according to the embodiments disclosed in the present invention, the functions of the above-mentioned memory for realizing each step can be realized by a written computer software program, and the computer program includes program codes for executing the corresponding method. Therefore, corresponding to the aforementioned distributed storage method, the present invention can further provide a corresponding distributed storage device and a medium for storing the aforementioned program code.

在本发明的另一实施例中，还提供了一种面向巨量遥感数据的分布式存储装置，其包括存储器和处理器；In another embodiment of the present invention, a distributed storage device for massive remote sensing data is also provided, which includes a memory and a processor;

所述处理器，用于当执行所述计算机程序时，实现如上述S1～S5所述的面向巨量遥感数据的分布式存储方法。The processor is configured to, when executing the computer program, implement the distributed storage method for huge amounts of remote sensing data as described in S1 to S5 above.

在本发明的另一实施例中，还提供了一种计算机可读存储介质，该存储介质上存储有计算机程序，当所述计算机程序被处理器执行时，能实现如上述S1～S5所述的面向巨量遥感数据的分布式存储方法。In another embodiment of the present invention, a computer-readable storage medium is also provided, and a computer program is stored on the storage medium. When the computer program is executed by a processor, the above-mentioned S1 to S5 can be implemented. A distributed storage method for massive remote sensing data.

需要注意的是，上述的存储器可以包括随机存取存储器(Random Access Memory，RAM)，也可以包括非易失性存储器(Non-Volatile Memory，NVM)，例如至少一个磁盘存储器。上述的处理器可以是通用处理器，包括中央处理器(Central Processing Unit，CPU)、网络处理器(Network Processor，NP)等；还可以是数字信号处理器(Digital SignalProcessing，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。当然，还装置中还应当具有实现程序运行的必要组件，例如电源、通信总线等等。It should be noted that the above-mentioned memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk memory. The above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; may also be a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Of course, the device should also have necessary components to realize program running, such as power supply, communication bus and so on.

以上所述的实施例只是本发明的一种较佳的方案，然其并非用以限制本发明。有关技术领域的普通技术人员，在不脱离本发明的精神和范围的情况下，还可以做出各种变化和变型。因此凡采取等同替换或等效变换的方式所获得的技术方案，均落在本发明的保护范围内。The above-mentioned embodiment is only a preferred solution of the present invention, but it is not intended to limit the present invention. Various changes and modifications can also be made by those of ordinary skill in the relevant technical field without departing from the spirit and scope of the present invention. Therefore, all technical solutions obtained by means of equivalent replacement or equivalent transformation fall within the protection scope of the present invention.

Claims

1. a distributed storage method for huge amount of remote sensing data, is characterized in that, comprises:

S1. After receiving the remote sensing data writing request submitted by the client, the routing node in the distributed database cluster reads all the spatial elements involved in the writing request to form a spatial element set;

S2. Count the total size of the ground object area of all the spatial elements in the spatial element set, and calculate the ratio of the total size of the ground object area to the total number of the data blocks according to the preset total number of data blocks, as the value of each data block Allocating area threshold;

S3, converting the geometric center point coordinates of all spatial elements in the spatial element set into spatial coding values through a spatial filling curve, and sorting all spatial elements according to the spatial coding values to form a spatial element sequence;

S4. All the empty data blocks are given unique identifiers and arranged in order, and the spatial elements in the spatial element sequence are taken out one by one in sequence and filled into the first data block that is not currently filled. When the total size of the ground object area of the spatial elements exceeds the allocated area threshold, the data block is considered to be full, and the next empty data block continues to be filled with data until all the spatial elements in the spatial element sequence are filled. Take out and fill in the data block;

S5. The routing node converts the unique identifier of each data block into a hash value through a hash algorithm, establishes a corresponding relationship between each data block and the sharding node according to the hash value, and executes writing according to the corresponding relationship operation to store each data block on the hard disk of the physical machine corresponding to the sharding node.

2 . The distributed storage method for massive remote sensing data according to claim 1 , wherein the space filling curve is a Hilbert curve. 3 .

3 . The distributed storage method for massive remote sensing data according to claim 1 , wherein the spatial element types in the spatial element set include image data and vector data, and the image data includes entire images and images. 4 . Tiles, the vector data includes point features, line features, and polygon features.

4 . The distributed storage method for massive remote sensing data according to claim 3 , wherein the geometric center point of the point element is the point element itself. 5 .

5 . The distributed storage method for massive remote sensing data according to claim 1 , wherein the preset total number of data blocks is a positive integer multiple of the total number of all fragmented nodes in the distributed database cluster. 6 .

6 . The distributed storage method for massive remote sensing data according to claim 1 , wherein the hash algorithm is a consistent hash algorithm. 7 .

7. The distributed storage method for huge amount of remote sensing data as claimed in claim 1, characterized in that, each fragment node in the distributed database cluster stores the mapping relationship between the fragment node and the physical machine hard disk Each physical machine hard disk has multiple shard nodes, and the mapping relationship table is adjusted in real time as the number of hard disks changes.

8 . The distributed storage method for huge amounts of remote sensing data according to claim 1 , wherein each database stores two copies of master and backup at the same time based on a multi-copy consistency protocol to realize read-write separation. 9 .

9. A distributed storage device for huge amounts of remote sensing data, comprising a memory and a processor;

the memory for storing computer programs;

The processor is configured to, when executing the computer program, implement the distributed storage method for massive remote sensing data according to any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that, a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program as described in any one of claims 1 to 8 can be implemented. A distributed storage method for remote sensing data.