CN115827907B

CN115827907B - Cross-cloud multi-source data cube discovery and integration method based on distributed memory

Info

Publication number: CN115827907B
Application number: CN202310148563.4A
Authority: CN
Inventors: 马艳; 宋婕; 周憶芯
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-04-28
Anticipated expiration: 2043-02-22
Also published as: CN115827907A

Abstract

The invention provides a cross-cloud multi-source data cube discovery and integration method based on a distributed memory, which comprises the following steps: constructing a cross-cloud distributed memory access layer, and carrying out data mounting on massive multi-source remote sensing data of different data source sites; a data discovery strategy of large station priority and space-time matching is adopted to discover a target remote sensing data set from different data source stations; and dynamically creating a cross-cloud remote sensing data cube by adopting a virtual data cube storage model, indexing and storing the found target remote sensing data set so as to realize virtual data integration of the remote sensing data cube. Aiming at the data source sites of a plurality of cloud platforms, the invention realizes cross-cloud data access of the target remote sensing data set at the access speed of the near-local memory level, and effectively solves the problems of integration, large-scale retrieval and high-efficiency access of mass multi-source target remote sensing data of the cross-cloud.

Description

Cross-cloud multi-source data cube discovery and integration method based on distributed memory

Technical Field

The invention relates to the technical field of remote sensing data integration and processing, in particular to a cross-cloud multi-source data cube discovery and integration method based on a distributed memory.

Background

With the continuous development of earth observation technology, the number and variety of earth observation satellites are dramatically increased, and a plurality of PB-level earth observation analysis ready data have been freely disclosed on the cloud due to the free open data policies of multi-state governments and space institutions. However, in the current large-scale remote sensing data processing such as global environment change and resource monitoring facing to the region and even the global scale, most of the remote sensing data processing is not high in automation degree and is quite time-consuming to process. This is because the downloading of large-scale data requires a lot of time and resources, subject to the long processing chain of the conventional "download-preprocessing-storage-analysis". The requirement of large-scale multi-source remote sensing data on-line processing and analysis often faces high-density data throughput pressure, and becomes a bottleneck for remote sensing large data processing and analysis application. In addition, the remote sensing big data is logically a multi-dimensional array data structure, and the space-time splitting property exists in the storage structure and the processing mode of the traditional remote sensing image file, so that the space-time data analysis is extremely complex and time-consuming. In summary, the unprecedented availability of large-scale space-time remote sensing data has made conventional remote sensing data integration, storage, processing, analysis, management, and sharing, among other things, a significant challenge.

The advent of the earth observation data cube (Earth Observation Data Cube, EODC) paradigm has subverted the traditional way of storing and managing earth observation (Earth Observation, EO) data. The method utilizes the multi-dimensional array structure to store and express the space-time remote sensing data with a high-dimensional complex structure, and realizes the processing conversion from the remote sensing image file to the pixel. However, EODC is an emerging paradigm, currently not well-defined, and diverse data cube schemes continue to emerge. With the increase of the processing and analysis demands of large-scale space-time remote sensing data, how to use the multi-source space-time remote sensing data of the cloud platform and the data cube platform in combination for efficient data integration and management is very challenging.

Disclosure of Invention

The invention provides a cross-cloud multi-source data cube discovery and integration method based on a distributed memory, which aims to solve the problems in the prior art.

The invention provides a cross-cloud multi-source data cube discovery and integration method based on a distributed memory, which comprises the following steps:

constructing a cross-cloud distributed memory access layer, wherein the distributed memory access layer carries out data mounting on massive multi-source remote sensing data of different data source sites and provides a globally uniform virtual memory access view;

A data discovery strategy of large station priority and space-time matching is adopted to discover a target remote sensing data set from different data source stations;

and dynamically creating a cross-cloud remote sensing data cube by adopting a virtual data cube storage model, wherein the remote sensing data cube is used for indexing and storing the found target remote sensing data set so as to realize virtual data integration of the remote sensing data cube.

According to the method for finding and integrating the cross-cloud multi-source data cubes based on the distributed memory, the distributed memory access layer is constructed by deploying the distributed cluster system based on the distributed memory file system, and the distributed cluster system can simultaneously perform single-point access on a plurality of data source sites through the globally unified virtual memory access view.

According to the method for finding and integrating the cross-cloud multi-source data cubes based on the distributed memory, the data finding strategy adopting large-station priority and space-time matching is used for finding the target remote sensing data set from different data source stations, and the method comprises the following steps:

constructing a global data discovery task based on a cache, and determining the type of the global data discovery task;

Based on a distributed crawler, obtaining virtual mount catalogs of all data source sites from the distributed memory access layer, wherein the virtual mount catalogs are tree structure catalogs;

traversing all data source sites in the virtual mounting directory based on the global data discovery task, calculating site priority values of all data source sites, constructing new site data discovery tasks for all data source sites, and adding the new site data discovery tasks into a directory task scheduling queue;

creating a plurality of data search threads, selecting a site data discovery task with the largest site priority value from the catalog task scheduling queue by the plurality of data search threads, recursively traversing the catalog tree of the selected data source site to calculate the catalog priority values of all subdirectories, and generating corresponding catalog data discovery tasks to be added into the catalog task scheduling queue;

the data searching threads select a directory data discovery task with the largest directory priority value from the directory task scheduling queue to recursively search data until scene data is discovered so as to realize a data discovery strategy with big station priority;

obtaining the time and space range of the current foreground data, and comparing the time and space range with the target parameters of the data discovery task to obtain the time matching degree and the space matching degree;

Calculating the space-time matching degree of the scene data according to the time matching degree and the space matching degree so as to realize a data discovery strategy of space-time matching;

creating a scene data analysis task for the rest scene data in the catalog of the scene data, and adding the scene data analysis task into a scene data analysis task scheduling queue;

and the data search thread selects the scene data analysis task with the largest space-time matching degree from the scene data analysis task scheduling queue, and stores the directory path where the scene data meeting the data discovery requirement is located so as to finish the discovery of the target remote sensing data set.

According to the method for finding and integrating the cross-cloud multi-source data cubes based on the distributed memory, which is provided by the invention, the global data finding task is built based on the cache, and the type of the global data finding task is determined, and the method comprises the following steps:

acquiring target parameters of the global data discovery task, wherein the target parameters comprise space-time areas;

retrieving, in a distributed memory database, whether the global data discovery task having spatiotemporal regional intersections has been cached based on the target parameter;

if the data exists in the cache, creating an incremental global data discovery task according to the existing global data discovery task, and only carrying out data discovery on the updated data in the time-space area;

If the data in the time-domain area does not exist in the cache, a new global data discovery task is created, and data discovery is carried out on all the data in the time-domain area.

According to the method for finding and integrating the cross-cloud multi-source data cubes based on the distributed memory, the cross-cloud remote sensing data cubes are dynamically created by adopting a virtual data cube storage model, and the remote sensing data cubes are used for indexing and storing the found target remote sensing data sets so as to realize virtual data integration of the remote sensing data cubes, and the method comprises the following steps:

mapping the directory path in which the scene data meeting the data discovery requirement is located into a virtual access path of the distributed cluster system by virtual path mapping;

dynamically creating a cross-cloud remote sensing data cube based on a virtual data cube storage model, wherein the remote sensing data cube performs cross-cloud access on the target remote sensing data set according to the virtual access path;

creating a data cube index to the target remote sensing dataset and storing the data cube index into a database.

According to the method for finding and integrating the cross-cloud multi-source data cubes based on the distributed memory, the cross-cloud remote sensing data cubes are dynamically created based on the virtual data cube storage model, and the remote sensing data cubes perform cross-cloud access on the target remote sensing data set according to the virtual access path, and the method comprises the following steps:

Constructing a virtual data cube storage model for uniformly representing the target remote sensing data sets with the same or overlapping space-time ranges in a multi-dimensional data cube model form;

and constructing a cloud-crossing remote sensing data cube based on the virtual data cube storage model, and prefetching and caching the target remote sensing data set by the remote sensing data cube according to the virtual access path so as to provide data access of a memory level.

According to the method for finding and integrating the cross-cloud multi-source data cubes based on the distributed memory, which is provided by the invention, a cross-cloud remote sensing data cube is dynamically created by adopting a virtual data cube storage model, and the remote sensing data cube is used for indexing and storing the found target remote sensing data set so as to realize virtual data integration of the remote sensing data cube, and then the method further comprises the steps of adopting a quality priority data screening method to carry out data screening and data optimization on the target remote sensing data set, and specifically comprises the following steps:

partitioning the target query region to obtain a plurality of target subspace regions with equal areas, and carrying out data screening on the plurality of target subspace regions in parallel;

Respectively carrying out cross-cloud remote sensing data cube retrieval on each target subspace region to obtain a data list meeting target requirements;

generating corresponding polygons according to the boundaries of the spatial range of each remote sensing data cube in the data list, and sorting all data in the data list according to a preset scoring function to form a candidate list;

when the candidate list is not empty and the target subspace area is not fully covered, selecting the foreground data with the highest score from the candidate list each time, judging whether the polygon is intersected with the target subspace area, adding the foreground data into a result set if the polygon is intersected with the target subspace area, subtracting the area coverage part of the foreground data from the target space area, and selecting the next highest-score foreground data from the candidate list to judge if the polygon is not intersected with the target subspace area, so as to realize data screening and data optimization on the target remote sensing data set.

The invention also provides a cross-cloud multi-source data cube discovery and integration device based on the distributed memory, which comprises:

the distributed memory access layer construction module is used for constructing a cross-cloud distributed memory access layer, and the distributed memory access layer carries out data mounting on massive multi-source remote sensing data of different data source sites and provides a globally unified virtual memory access view;

The target remote sensing data set discovery module is used for discovering a target remote sensing data set from different data source sites by adopting a data discovery strategy of large-site priority and space-time matching;

and the remote sensing data cube integration module is used for dynamically creating a cloud-crossing remote sensing data cube by adopting a virtual data cube storage model, and the remote sensing data cube is used for indexing and storing the found target remote sensing data set so as to realize virtual data integration of the remote sensing data cube.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the cross-cloud multi-source data cube discovery and integration method based on the distributed memory when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a distributed memory-based cross-cloud multisource data cube discovery and integration method as described in any of the above.

According to the cross-cloud multi-source data cube discovery and integration method based on the distributed memory, a cross-cloud distributed memory access layer is constructed, data mounting is conducted on massive multi-source remote sensing data of different data source sites, a global unified virtual memory access view is provided, a large-site priority and space-time matching data discovery strategy is adopted to point a target remote sensing data set from different data source sites, a virtual data cube storage model is adopted to dynamically create a cross-cloud remote sensing data cube, and the remote sensing data cube is used for indexing and storing the discovered target remote sensing data set and provides data support for subsequent application. For the data source sites of a plurality of cloud platforms, the cross-cloud data access of the target remote sensing data set is realized at the access speed of the near-local memory level, and the problems of integration, large-scale retrieval and high-efficiency access of the cross-cloud mass multi-source target remote sensing data are effectively solved through the indexing and storage of the target remote sensing data set by the remote sensing data cube. Therefore, explicit downloading of massive multi-source remote sensing data is avoided, cross-cloud memory data access and calculation support is provided for larger-scale remote sensing data processing application, so that I/O data throughput pressure caused by data intensive calculation is reduced, and a user can concentrate on analysis, calculation and application of remote sensing big data. The method promotes the full release of the information potential of the remote sensing data, and has very important practical significance for promoting large-scale scientific researches such as global environment change, global sustainable development and the like.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for finding and integrating cross-cloud multi-source data cubes based on a distributed memory;

FIG. 2 is a diagram of a architecture for constructing a cross-cloud distributed memory access layer in accordance with a first embodiment of the present invention;

FIG. 3 is an application diagram of a virtual memory access view based on a global unified for a distributed memory access layer in accordance with a first embodiment of the present invention;

FIG. 4 is a second flow chart of a method for finding and integrating cross-cloud multi-source data cubes based on distributed memory according to the present invention;

FIG. 5 is a flow chart of a data discovery strategy employing large station priority and space-time matching in accordance with a first embodiment of the present invention;

FIG. 6 is a third flow chart of a method for finding and integrating cross-cloud multi-source data cubes based on distributed memory according to the present invention;

FIG. 7 is a schematic flow chart of a method for finding and integrating cross-cloud multi-source data cubes based on distributed memory according to the present invention;

FIG. 8 is a schematic diagram of cross-cloud virtual data integration of a target remote sensing dataset using a virtual data cube storage model in accordance with a first embodiment of the present invention;

FIG. 9 is a logic implementation diagram of a method for finding and integrating cross-cloud multi-source data cubes based on a distributed memory;

FIG. 10 is a schematic flow chart of a method for finding and integrating cross-cloud multi-source data cubes based on distributed memory according to the present invention;

FIG. 11 is a schematic diagram of a cross-cloud multi-source data cube discovery and integration device based on a distributed memory according to the present invention;

fig. 12 is a schematic structural diagram of an electronic device provided by the present invention.

Reference numerals:

21: a distributed memory access layer construction module; 22: a target remote sensing dataset discovery module; 23: and a remote sensing data cube integration module.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, the present embodiment provides a method for discovering and integrating cross-cloud multi-source data cubes based on distributed memory, including:

step S1: constructing a cross-cloud distributed memory access layer, wherein the distributed memory access layer carries out data mounting on massive multi-source remote sensing data of different data source sites and provides a globally uniform virtual memory access view;

in the step, a distributed memory access layer is constructed by deploying a distributed cluster system based on a distributed memory file system, and the distributed cluster system can simultaneously perform single-point access on a plurality of data source sites through a globally unified virtual memory access view.

In particular, a cross-cloud distributed memory access layer is built as shown in FIG. 2, wherein a distributed cluster system deploying a high availability model employs an Alluxio platform that can be used as a distributed shared cache service such that computing applications in communication with the Alluxio can transparently cache frequently accessed data (especially from remote locations) to provide memory level I/O data throughput rates. In addition, the layering storage mechanism of the Alluxio can fully utilize the memory, the solid state disk or the magnetic disk, and reduce the cost overhead of the data driving type application with the elastic expansion characteristic.

The data source sites of different cloud platforms comprise: amazon cloud (AWS), ali cloud (OSS), messenger Cloud (COS), and the like, as well as from LOCAL distributed file systems (HDFS) and LOCAL (LOCAL). Since cloud storage (various cloud platforms) and object storage (HDFS, LOCAL) use different semantics, the impact of these semantics on performance is also different from traditional file systems. Common file system operations (such as listing directories and renaming) on cloud storage and object storage systems typically incur significant performance overhead. And deploying the Alluxio together with the plurality of data source sites, and carrying out cross-cloud virtual data mounting on mass remote sensing data stored by the plurality of data source sites. Such deployment will retrieve read data from the Alluxio instead of retrieving reads from the underlying cloud storage or object storage, so deploying Alluxio with cloud storage or object storage may alleviate these problems. The access of different data source sites is simplified.

Regardless of the physical location of these data source sites, the globally unified virtual memory access view provided by Alluxio enables single point access to the telemetry data storage systems of multiple data source sites simultaneously. In addition to connecting different types of data sources, alluxio also allows users to connect different versions of the same storage system, such as multiple versions of HDFS, at the same time, and without requiring complex system configuration and management. Simplifying data management. Meanwhile, as shown in fig. 3, the provided globally unified virtual memory access view and operation interface support transparently caching frequently accessed data (especially from a remote location), and provide memory level I/O data throughput rates for data discovery in subsequent step S2, and data cube integration and access in subsequent steps S3, S4.

Step S2: a data discovery strategy of large station priority and space-time matching is adopted to discover a target remote sensing data set from different data source stations;

in this embodiment, referring to fig. 4, step S2 specifically includes:

step S21: constructing a global data discovery task based on a cache, and determining the type of the global data discovery task;

step S22: based on the distributed crawler, obtaining virtual mount catalogs of all data source sites from a distributed memory access layer, wherein the virtual mount catalogs are tree-structure catalogs;

step S23: traversing all data source sites in the virtual mount directory based on the global data discovery task, calculating site priority values of all the data source sites, constructing new site data discovery tasks for all the data source sites, and adding the new site data discovery tasks into a directory task scheduling queue;

step S24: creating a plurality of data search threads, selecting a site data discovery task with the largest site priority value from a catalog task scheduling queue by the plurality of data search threads, recursively traversing a catalog tree of a selected data source site to calculate the catalog priority values of all subdirectories, and generating corresponding catalog data discovery tasks to be added into the catalog task scheduling queue;

step S25: a plurality of data searching threads select a directory data discovery task with the largest directory priority value from the directory task scheduling queue to recursively search data until the scene data is discovered, so as to realize a data discovery strategy with large station priority;

Step S26: obtaining the time and space range of the current foreground data, and comparing the time and space range with the target parameters of the data discovery task to obtain the time matching degree and the space matching degree;

step S27: calculating the space-time matching degree of the scene data according to the time matching degree and the space matching degree so as to realize a data discovery strategy of space-time matching;

step S28: creating a scene data analysis task for the rest scene data in the catalog of the scene data, and adding the scene data analysis task into a scene data analysis task scheduling queue;

step S29: and the data search thread selects a scene data analysis task with the largest space-time matching degree from the scene data analysis task scheduling queue, and stores a directory path where scene data meeting the data discovery requirement is located so as to complete the discovery of the target remote sensing data set.

Specifically, a virtual mount directory and a directory metadata description file of each data source site are obtained from a distributed memory access layer based on a distributed crawler, wherein the directory metadata description file is a file based on a space-time resource directory (STAC: spatio Temporal Asset Catalogs) standard which supports a data cube access standard and is a general standard for describing geospatial information, and data can be more easily processed, indexed and found. The virtual mount directory is a tree structure directory list, the first layer is a root node, the second layer is a site node of each data source, and the third layer and above are hierarchical directory nodes under a certain site. Traversing the virtual mount directory and calculating priority values of all levels, wherein the priority values comprise site priority values of all data source sites and directory priority values of all levels of directories included by the data source sites. In a specific application, a priority value rankA is designed to represent the access priority of a data source site, rankB represents the access priority of a directory under the data source site, and a match value represents the space-time matching degree of foreground data and target requirements.

And selecting a data source site of a 'large station' cloud platform from all cloud platforms registered by a current system to perform data discovery preferentially based on a data discovery strategy of the large station priority, wherein the data source site with higher data set scale, data access frequency and access amount can be searched preferentially so as to quickly locate data and improve the data discovery efficiency. The specific operation method comprises the following steps:

polling all data source sites (including Amazon cloud, arian cloud, tencent cloud, distributed file system, local and the like) registered with the cloud platform in the current distributed memory access layer;

respectively calculating a site priority value rankA of each data source site according to the data set scale, the data access frequency and the access quantity;

constructing new site data discovery tasks for each data source site, and adding the new site data discovery tasks into a catalog task scheduling queue;

and for the directory tasks of the third layer and above in the data discovery process, calculating a directory priority value rankB for different directories to represent the priority of directory access. According to the structure information of the directory tree and the initial information of the data set during mounting, the rankB value is calculated, and the directory with the larger rankB value can be preferentially accessed to help a user to find target remote sensing data faster. The specific calculation of the rankB value is shown in formula (1):

（1）

Wherein, level is the level of the data directory, site_priority is the access priority of the data source site, type_priority is the access priority of the data type, and num_priority is the number priority of subdirectories.

In the whole, the rank B value and four indexes of level, site_ priority, type _priority and num_priority are in positive correlation. The level and site_ priority, type _priority have a larger influence on the rankB, and the num_priority has a smaller influence on the rankB. The four indexes are used for measuring the rankB value, and the larger the rankB value is, the earlier the access order is, so that the target data can be found from a reliable and stable data site earlier.

Performing further recursion data search and analysis on a site or a directory, traversing all subdirectories in the directory (virtual directory in a distributed memory access layer) corresponding to the site data discovery task, calculating the directory priority value rankB of the subdirectories, generating a new directory data discovery task for the subdirectory, and adding into a directory task scheduling queue (stored in a distributed memory database Redis);

traversing the directory tree until scene data appear in the directory, randomly selecting one scene data from the directory, calculating a space-time matching degree match value, replacing match values of other scene data, generating a new scene data analysis task, and adding the new scene data analysis task into a scene data task scheduling queue (stored in a distributed memory database Redis). Specifically, when analyzing the scene data in the leaf nodes, randomly selecting one scene data from the father catalog of each leaf node, analyzing the catalog metadata description file, obtaining specific time, space and other information, comparing the specific time, space and other information with the target parameters (target time, target space range and the like) of the data discovery task, calculating a match value to represent the matching degree of one scene data and the target remote sensing data, and using the match value to represent the matching degree of other scene data in the same catalog as the priority of the subsequent analysis scene data analysis task. The specific calculation of the match value is shown in formula (2):

（2）

Wherein, the liquid crystal display device comprises a liquid crystal display device,time_matchis the matching degree of the time of the foreground data and the target time,space_matchthe matching degree of the space range of the foreground data and the target space range is the association weight of the two indexes.

Wherein, in the calculationtime_matchIn the process, we measure mainly by the time distance between the time of the actual scene data and the target discovery time. The specific calculation process is discussed by distinguishing whether the data discovery target is a time point or a time period. When the data discovery goal is a period of time,time_matchthe calculation formula of (2) is shown as formula (3):

（3）

wherein, the liquid crystal display device comprises a liquid crystal display device,T _s is the start time point of the data discovery target period,T _e is the end time point of the data discovery target period,T _o is the data point in time obtained by parsing the directory metadata description file of a scene data.

When the data discovery target is a time point, we consider this time point as the minimum one period (data discovery time is in days, i.e., 1 day), and the judgment process is as shown in formulas (4) and (5):

（4）

（5）

wherein, the liquid crystal display device comprises a liquid crystal display device,T _p is the point in time of the data discovery target,T _s is the start time point of the data discovery target period,T _e is the end time point of the data discovery target period,T _o is the data point in time obtained by parsing the directory metadata description file of a scene data.

Sum equation (4)5) Is brought intotime_matchIn the calculation formula of (2), the time matching degree of the time point can be obtainedtime_matchThe calculation formula is shown as formula (6):

（6）

the scene data is an irregular polygon, and for the simplicity of query analysis, the spatial range of each scene data is often represented by a minimum bounding rectangle MBR of the scene data. Thus, in calculating the spatial matching degree of the dataspace_matchWhen the method is used, the key is to compare MBR of each scene data with a data rectangular frame to be queried by a data discovery target, and the space distance between the two rectangles. Spatial matching degreespace_matchThe calculation formula of (2) is shown as formula (7):

（7）

wherein rectangle B represents the MBR of the scene data, rectangle S represents the rectangle of the data discovery target,B _x 、B _y the lengths of the rectangle B in the x and y directions respectively,S _x 、S _y the lengths of the rectangle S in the x and y directions respectively,O _x 、O _y the distances between the center point of the rectangle B and the center point of the rectangle S in the x and y directions are respectively.

When the two rectangles are intersected, the space range of the representative scene data is intersected with the target space range, meets the space requirement of data discovery,space_match1 is shown in the specification; when two rectangles are disjoint, we calculate the space distance ratio of the two rectangles in the x and y directions respectively by the center point and the side length of each rectangle, and select the minimum distance ratio in the x and y directions as the space distance ratio of the whole of the two rectangles. The smaller the spatial distance ratio, the smaller the spatial distance, the greater the spatial matching degree, space_matchThe closer to 1.

w1、w2 is the association weight of two indexes, which satisfiesw1+w2=1。w1 representstime_matchIs used for the weight of the (c),w2 representsspace_matchIs a weight of (2). When the telemetry dataset is time-based ordered,w1>w2; when the telemetry dataset is spatially-based ordered,w2>w1。

overall, we pass throughtime_matchAndspace_matchtwo indexes are used for measuring space-time matching degree of scene data and target remote sensing datamatchValues.matchThe larger the value is, the earlier analysis and judgment are carried out in a scene data analysis task, so that the target remote sensing data set meeting the requirements can be found earlier in millions of mass multi-source remote sensing data, and the data finding efficiency is improved.

Selecting a scene data analysis task with a larger match value from a scene data task scheduling queue to perform further metadata analysis, judging whether the target parameter requirement of a global data discovery task is met, and storing a directory path where remote sensing data meeting the data discovery requirement is located in a distributed memory database (Redis) for storage; and then updating the directory priority value of the parent directory where the foreground data is located and the space-time matching degree of the rest foreground data by using the directory priority value rankB and the space-time matching degree match value calculated by the current foreground data, so as to dynamically adjust the priority of the task, help to locate the target remote sensing data faster and discover the target remote sensing data earlier, and the processing process is specifically shown in fig. 5.

Further, referring to fig. 6, step S21 specifically includes:

step S211: acquiring target parameters of a global data discovery task, wherein the target parameters comprise space-time areas;

step S212: searching whether a global data discovery task with space-time region intersections is cached in a distributed memory database based on the target parameters;

if the cache exists, the step S213 is performed: creating an incremental global data discovery task according to the existing global data discovery task, and only carrying out data discovery on the updated data in the time-space region;

if not, go to step S214: creating a new global data discovery task, and performing data discovery on all data in the time domain.

Specifically, the distributed memory database is a Redis, and the data discovery task with space-time region intersections is searched in the Redis to judge whether the data discovery task is executed before, if not, a new data discovery task is generated to discover all data meeting the requirements; if the data discovery task is available, an incremental data discovery task is generated, and only data discovery is performed on the data which is not updated before in the time-free area. The global data discovery task is an overall data discovery task, independent of the site data discovery task. Compared with the method for directly executing the new global data discovery task, the incremental global data discovery task is only used for data discovery from the updated catalogue so as to discover the data which is not discovered before, so that the data discovery time can be shortened, and the data discovery efficiency can be improved.

Step S3: dynamically creating a cloud-crossing remote sensing data cube by adopting a virtual data cube storage model, wherein the remote sensing data cube is used for indexing and storing a found target remote sensing data set so as to realize virtual data integration of the remote sensing data cube;

in this embodiment, referring to fig. 7, step S3 specifically includes:

step S31: virtual path mapping is carried out on the directory paths where the scene data meeting the data discovery requirement are located, and the directory paths are mapped into virtual access paths of the distributed cluster system;

step S32: dynamically creating a cross-cloud remote sensing data cube based on the virtual data cube storage model, and performing cross-cloud access on a target remote sensing data set by the remote sensing data cube according to a virtual access path;

step S33: the remote sensing data cube creates a data cube index to the target remote sensing dataset and stores the data cube index to the database.

Specifically, virtual path mapping is performed on the directory path where the scene data meeting the data discovery requirement is located, and the directory path is mapped into a virtual access path of the aluxio, so that transparent cross-cloud data access of a target remote sensing data set is supported. The virtual Data Cube storage model is constructed and is used for uniformly representing a target remote sensing Data set with the same or overlapping space-time ranges in a multi-dimensional Data Cube model mode, specifically, a multi-dimensional array structure of a xarray component is used as a core of a memory Data storage structure, and the virtual Data Cube storage model is supported to be constructed through expansion and optimization of an Open Data Cube (OpenData Cube). And constructing a cross-cloud remote sensing data cube according to the target requirements of the user, and supporting virtual data integration, namely, the data in the target remote sensing data set is not really acquired into a memory or a local area, but is stored in a remote cloud storage, but can be subjected to virtual data integration through a virtual data cube storage model. That is, for the target remote sensing data stored by the cloud platform, the pre-fetching and caching of the scene data can be performed based on the distributed memory access layer, and the data access of the memory level can be provided; for data stored locally distributed file system (HDFS) or locally (LOCAL), the actual integration of the entity data is performed directly. A Data Cube index to the target telemetry dataset is created and stored into the Open Data Cube database PostgreSQL, as shown in fig. 8.

In summary, as shown in fig. 9, according to the method for finding and integrating cross-cloud multi-source data cubes based on distributed memory, by constructing a cross-cloud distributed memory access layer, data mounting is performed on massive multi-source remote sensing data of different data source sites, a global unified virtual memory access view is provided, a target remote sensing data set is sent from different data source sites by adopting a data finding strategy of big-station priority and space-time matching, a cross-cloud remote sensing data cube is dynamically created by adopting a virtual data cube storage model, and the remote sensing data cube is used for indexing and storing the found target remote sensing data set and providing data support for subsequent applications. For the data source sites of a plurality of cloud platforms, the cross-cloud data access of the target remote sensing data set is realized at the access speed of the near-local memory level, and the problems of integration, large-scale retrieval and high-efficiency access of the cross-cloud mass multi-source target remote sensing data are effectively solved through the indexing and storage of the target remote sensing data set by the remote sensing data cube. Therefore, explicit downloading of massive multi-source remote sensing data is avoided, cross-cloud memory data access and calculation support is provided for larger-scale remote sensing data processing application, so that I/O data throughput pressure caused by data intensive calculation is reduced, and a user can concentrate on analysis, calculation and application of remote sensing big data. The method promotes the full release of the information potential of the remote sensing data, and has very important practical significance for promoting large-scale scientific researches such as global environment change, global sustainable development and the like.

In this embodiment, after step S3, the method further includes performing data screening and data optimization on the target remote sensing dataset by using a quality-priority data screening method, and referring to fig. 10, the method specifically includes:

step S41: partitioning the target query region to obtain a plurality of target subspace regions with equal areas, and carrying out data screening on the plurality of target subspace regions in parallel;

step S42: respectively carrying out cross-cloud remote sensing data cube retrieval on each target subspace region to obtain a data list meeting target requirements;

step S43: generating corresponding polygons according to the boundary of the space range of each remote sensing data cube in the data list, and sorting all data in the data list according to a preset scoring function to form a candidate list;

step S44: when the candidate list is not empty and the target subspace area is not fully covered, selecting the foreground data with the highest score from the candidate list each time, judging whether the polygon and the target subspace area intersect, and if so, performing step S45: adding the scene data into the result set, and subtracting the region coverage part of the scene data from the target space region; if not, selecting the next highest-scoring foreground data from the candidate list, returning to the step S44, and continuing to judge so as to realize data screening and data optimization on the target remote sensing data set.

Specifically, the scoring function used is determined by the quality of the data (cloud coverage), and the better the quality of the data, the higher the score, i.e., score= -closed_cover. The user can freely customize the scoring function according to the actual demands to be used as the basis for sorting and screening the data list. Therefore, a quality priority data screening method is adopted to screen out high-quality remote sensing data covering a target space-time range, and the found scene data set is subjected to optimized data screening and data optimization, so that the searching and optimization of a cross-cloud remote sensing data cube are realized.

Example two

Referring to fig. 11, the present embodiment provides a cross-cloud multi-source data cube discovery and integration device based on a distributed memory, including:

the distributed memory access layer construction module 21 is configured to construct a cross-cloud distributed memory access layer, where the distributed memory access layer performs data mounting on massive multi-source remote sensing data of different data source sites and provides a globally unified virtual memory access view;

a target remote sensing dataset discovery module 22 for locating a target remote sensing dataset from different data source sites using a data discovery strategy based on large-site prioritization and space-time matching;

The remote sensing data cube integrating module 23 is configured to dynamically create a cloud-crossing remote sensing data cube based on the virtual data cube storage model, where the remote sensing data cube is configured to index and store the found target remote sensing data set, so as to implement virtual data integration of the remote sensing data cube.

Further, the target remote sensing dataset discovery module 22 specifically includes: the global data discovery task creation unit is used for constructing a global data discovery task based on the cache and determining the type of the global data discovery task; the acquisition unit is used for acquiring virtual mount catalogs of all data source sites from the distributed memory access layer based on the distributed crawler, wherein the virtual mount catalogs are tree structure catalogs; the site priority value calculation unit is used for traversing all data source sites in the virtual mount catalog based on the global data discovery task and calculating the site priority values of all the data source sites; the site data discovery task creation unit is used for building new site data discovery tasks for all data source sites and adding the new site data discovery tasks into the catalog task scheduling queue; a data search thread creation unit configured to create a plurality of data search threads; the first calling unit of the data searching thread is used for calling a plurality of data searching threads to select a site data discovery task with the maximum site priority value from the catalog task scheduling queue; a catalog priority value calculation unit for recursively traversing the catalog tree of the selected data source site to calculate catalog priority values of all the subdirectories; the catalog data discovery task creation unit is used for generating a corresponding catalog data discovery task and adding the catalog data discovery task into the catalog task scheduling queue; the second calling unit of the data searching thread is used for calling a plurality of data searching threads to select a directory data discovery task with the largest directory priority value from the directory task scheduling queue to recursively search data until scene data is discovered so as to realize a data discovery strategy with large station priority; the scene data comparison unit is used for acquiring the time and space range of the foreground data and comparing the time and space range with the target parameters of the data discovery task to obtain the time matching degree and the space matching degree; the space-time matching degree calculation unit is used for calculating the space-time matching degree of the scene data according to the time matching degree and the space matching degree so as to realize a space-time matching data discovery strategy; the scene data analysis task creation unit is used for creating scene data analysis tasks for the rest scene data in the catalog where the scene data are located and adding the scene data analysis tasks into the scene data analysis task scheduling queue; and the virtual data path storage unit is used for storing the catalog path of the scene data meeting the data discovery requirement for the scene data analysis task with the largest space-time matching degree selected from the scene data analysis task scheduling queue by the data search thread so as to finish the discovery of the target remote sensing data set.

The remote sensing data cube integration module 23 specifically includes: the virtual path mapping unit is used for carrying out virtual path mapping on the directory path where the scene data meeting the data discovery requirement is located, and mapping the directory path into a virtual access path of the distributed cluster system; a virtual data cube storage model construction unit for constructing a virtual data cube storage model to uniformly represent a target remote sensing data set having the same or overlapping space-time ranges in a multi-dimensional data cube model form; the remote sensing data cube construction module is used for constructing a cloud-crossing remote sensing data cube based on the virtual data cube storage model, and the remote sensing data cube prefetches and caches the target remote sensing data set according to the virtual access path so as to provide data access of a memory level; and the index storage unit is used for creating a data cube index for the target remote sensing data set by the remote sensing data cube and storing the data cube index into the database.

The device also comprises: the data screening and optimizing module 24 is configured to perform data screening and data optimization on the target remote sensing dataset by using a quality-priority-based data screening method.

Further, the data filtering and optimizing module 24 specifically includes: the target query region blocking unit is used for blocking the target query region to obtain a plurality of target subspace regions with equal areas; the data screening unit is used for carrying out data screening on a plurality of target subspace areas in parallel; the remote sensing data cube searching unit is used for respectively carrying out cross-cloud remote sensing data cube searching on each target subspace area so as to obtain a data list meeting the target requirements; the polygon generation unit is used for generating a corresponding polygon according to the boundary of the space range of each remote sensing data cube in the data list; the sorting unit is used for sorting all data in the data list according to a preset scoring function to form a candidate list; the scene data selecting unit is used for selecting the scene data with the highest score from the candidate list each time when the candidate list is not empty and the target subspace area is not fully covered; the space region intersection judging unit is used for judging whether the polygon is intersected with the target subspace region or not; and the data optimization unit is used for adding the scene data into the result set when the polygon is intersected with the target subspace area and subtracting the area coverage part of the scene data from the target space area.

The implementation process of the functions and actions of each module in the above device is specifically detailed in the implementation process of the corresponding steps in the above method, so relevant parts only need to be referred to in the description of the method embodiments, and are not repeated here. The above-described embodiment of the apparatus is merely illustrative, and for example, the division of the modules is merely a logic function division, and there may be other division manners in actual implementation, such as: multiple modules or components may be combined, or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or modules, whether electrically, mechanically, or otherwise.

Example III

As shown in fig. 12, the present embodiment provides an electronic apparatus including: processor 310, communication interface (Communications Interface) 320, memory 330 and communication bus 340, wherein processor 310, communication interface 320, memory 330 accomplish communication with each other through communication bus 340. Processor 310 may invoke logic instructions in memory 330, processor 310 performing the distributed memory-based cross-cloud multi-source data cube discovery and integration method described in the method embodiments above, the method comprising:

and dynamically creating a cloud-crossing remote sensing data cube by adopting a virtual data cube storage model, wherein the remote sensing data cube is used for indexing and storing the found target remote sensing data set so as to realize virtual data integration of the remote sensing data cube.

Further, the logic instructions in the memory 330 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions to cause a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Example IV

The present embodiment provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for discovery and integration of cross-cloud multi-source data cubes based on distributed memory described in the foregoing method embodiment, where the method includes:

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The method for finding and integrating the cross-cloud multi-source data cubes based on the distributed memory is characterized by comprising the following steps of:

dynamically creating a cross-cloud remote sensing data cube by adopting a virtual data cube storage model, wherein the remote sensing data cube is used for indexing and storing the found target remote sensing data set and specifically comprises the following steps:

virtual path mapping is carried out on the directory paths where the scene data meeting the data discovery requirement are located, and the directory paths are mapped into virtual access paths of the distributed cluster system;

creating a data cube index for the target remote sensing data set, and storing the data cube index into a database to realize virtual data integration of the remote sensing data cubes.

2. The method for discovering and integrating cross-cloud multi-source data cubes based on distributed memory according to claim 1, wherein the distributed memory access layer is constructed by deploying a distributed cluster system based on a distributed memory file system, and the distributed cluster system can simultaneously perform single-point access to a plurality of data source sites through the globally unified virtual memory access view.

3. The method for discovery and integration of cross-cloud multi-source data cubes based on distributed memory according to claim 2, wherein the data discovery strategy employing large-station prioritization and space-time matching is to discover target remote sensing data sets from different data source stations, comprising:

4. The method for finding and integrating a cross-cloud multi-source data cube based on a distributed memory according to claim 3, wherein the constructing a global data finding task based on a cache, determining a type of the global data finding task, comprises:

5. The method for finding and integrating a cross-cloud multi-source data cube based on a distributed memory according to claim 1, wherein the creating a cross-cloud remote sensing data cube based on a virtual data cube storage model dynamically, the remote sensing data cube performing cross-cloud access on the target remote sensing data set according to the virtual access path comprises:

6. The method for finding and integrating a cross-cloud multi-source data cube based on a distributed memory according to claim 1, wherein after dynamically creating a cross-cloud remote sensing data cube by using a virtual data cube storage model, the remote sensing data cube is used for indexing and storing the found target remote sensing data set so as to realize virtual data integration of the remote sensing data cube, the method further comprises adopting a quality priority data screening method to perform data screening and data optimization on the target remote sensing data set, and specifically comprises the following steps:

when the candidate list is not empty and the target subspace area is not fully covered, selecting the foreground data with the highest score from the candidate list each time, judging whether the polygon is intersected with the target subspace area, adding the foreground data into a result set if the polygon is intersected with the target subspace area, subtracting the area coverage part of the foreground data from the target subspace area, and selecting the next highest-score foreground data from the candidate list if the polygon is not intersected with the target subspace area, judging, so as to realize data screening and data optimization on the target remote sensing data set.

7. A distributed memory-based cross-cloud multi-source data cube discovery and integration apparatus, comprising:

the remote sensing data cube integration module is used for dynamically creating a cloud-crossing remote sensing data cube by adopting a virtual data cube storage model, and the remote sensing data cube is used for indexing and storing the found target remote sensing data set and is particularly used for:

8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the distributed memory-based cross-cloud multi-source data cube discovery and integration method of any of claims 1-6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the distributed memory-based cross-cloud multisource data cube discovery and integration method of any of claims 1-6.