CN103761291A - Geographical raster data parallel reading-writing method based on request aggregation - Google Patents

Geographical raster data parallel reading-writing method based on request aggregation Download PDF

Info

Publication number
CN103761291A
CN103761291A CN201410020074.1A CN201410020074A CN103761291A CN 103761291 A CN103761291 A CN 103761291A CN 201410020074 A CN201410020074 A CN 201410020074A CN 103761291 A CN103761291 A CN 103761291A
Authority
CN
China
Prior art keywords
geographical
raster data
data
file
geographical raster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410020074.1A
Other languages
Chinese (zh)
Inventor
熊伟
陈荦
景宁
刘露
吴秋云
赫高进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201410020074.1A priority Critical patent/CN103761291A/en
Publication of CN103761291A publication Critical patent/CN103761291A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead

Abstract

The invention provides a geographical raster data parallel reading-writing method based on request aggregation. According to the technical scheme, for all processes, a GDAL (geographical data abstract library) is called to read geographical raster data files to be processed; geographical raster metadata information is acquired from the files; all processes calculate partition size and offset of respective reading-required geographical raster data in the geographical raster data files by means of uniform data partitioning; any process is response of creating a GTIFF output file; after creating, the process broadcasts the status of creating completion to other processes; the other processes read the geographical raster data to be processed; each process completes its respective calculation task, and results of calculation task completion are written to output files by means of uniform data partitioning. The method has the advantages that various formats of data can be processed, a parallel processing mechanism is good, and overall input/output efficiency is improved.

Description

A kind of geographical raster data concurrent reading and concurrent writing method based on polymerization request
Technical field
The present invention relates to a kind of concurrent reading and concurrent writing method towards geographical raster data file under multinode multiprocessor cluster environment, technical applications is the parallel processing of extensive geographical raster data in Geographic Information System.
Background technology
Geographical raster data is very important a kind of data type in Geographic Information System and spatial information application, all kinds of samplings and the statistical information that are mainly used in describing and expressing earth's surface, have been widely used at aspect tools such as remote sensing image processing, digital Terrain Analysis, spatial statisticses.Geographical raster data press that grid cell is capable to be arranged with row, by the data structure that equal and opposite in direction is evenly distributed, closely connected pixel (grid cell) array comes representation space atural object or phenomenon distribution.The size of grid cell has determined the precision of geodata within the scope of its earth's surface covering, and grid cell is thinner, and represented geodata is meticulousr.
Along with the quick progress of remote sensing technology and surveying and mapping technology, spatial resolution and the temporal resolution of geographical raster data all have increased significantly, spatial information is applied calculative region and is constantly increased, complexity to geographical process model and computational accuracy demand strengthen day by day, and geocomputation presents data-intensive and feature computation-intensive more and more significantly.How to realize its high-performance treatments and become its key point of further applying of restriction.The efficient processing problem that the cluster computing environment of employing multiprocessor and parallel computing solve geographical raster data becomes a kind of inevitable development trend.By improving the mode of processor performance and increase processor number, can promote the parallel processing performance of parallel cluster, if but for the I/O (Input/Output of geographical raster data, I/O) still adopt serial mode, I/O performance will become the bottleneck that affects overall performance.Under this background, the concurrent access technology of geographical raster data becomes the efficient important content of processing of geographical raster data.
Support that at present the tool storage room of geographical raster data concurrent reading and concurrent writing mainly contains two kinds.One is to adopt GDAL(Geospatial Data Abstract Library, the abstract storehouse of geographical spatial data) read and write, GDAL provides unified data access interface, by abstract data model, supports extendible geographical raster data form.Because the geographical raster data of concurrent reading and concurrent writing need to carry out data division, existing parallel processing algorithm conventionally adopts and divides according to row, column or the mode of piece.The subject matter that GDAL exists is the concurrent reading and concurrent writing of only supporting that geographical raster data is divided according to row, when multiple processes are used the division of GDAL concurrent reading and concurrent writing row or piece to divide, read-write efficiency is very low on the one hand, and the data correctness writing out on the other hand also cannot be guaranteed.The second is the geographical raster data model bank that is applicable to concurrent reading and concurrent writing, as: HDF5 (Hierarchical DataFormatFive, hierarchical data form the 5th edition), NetCDF(Network Common Data Form, network universal data format) etc., but it is relatively less with these data models, to store the application of geographical raster data, when processing, other conventional data layouts need to be changed, increase the triviality of application.
The concurrent reading and concurrent writing method of geographical raster data comprises two kinds.One is DDC(Data Distribution and Collection, Data dissemination/collection) method.The method will participate in that the multi-process of parallel computation is divided into host process and from process, only have host process to be responsible for all geographical raster data read-write operations, from process, be responsible for data processing, from complete reception and the transmission of institute's deal with data between process and host process by inter-process messages pass through mechanism.The shortcoming of DDC method is that host process reading and writing data easily becomes bottleneck, and when parallel processing process increases, master and slave interprocess communication cost easily increases computing relay.Another is concurrent reading and concurrent writing method, do not rely on host process and carry out the distribution of data, collection, but each process can be carried out the accessing operation of data relatively independently.Like this, each process is carried out the access of data simultaneously, can increase largely overall I/O bandwidth, thereby promotes overall I/O efficiency.But this mode needs bottom to have the support of parallel file system, in non-parallel file system, if read-write requests distribution randomness is strong, I/O efficiency will significantly reduce.
Summary of the invention
The object of the invention is to improve the performance of geographical raster data concurrent reading and concurrent writing, by introducing MPI(Message Passing Interface, message passing interface) in the mechanism of file view, while reducing the geographical raster data of multi-process concurrent access, the quantity of discrete, scrappy request of data, asks polymerization to become request of data a small amount of, monoblock I/0.In the present invention, between multi-process, only carry out status information communication, and do not carry out data communication, improve the concurrent reading and concurrent writing performance of geographical raster data file under multinode multiprocessor cluster environment.
Technical solution of the present invention is: a kind of geographical raster data concurrent reading and concurrent writing method based on polymerization request, and be provided with several treatment progress and process same pending geographical raster data file simultaneously, it is characterized in that, comprise the steps:
The first step, under multinode multiprocessor cluster environment, all process transfer GDAL read in storehouse pending geographical raster data file, therefrom obtain the information of geographical grid metadata and be recorded in internal storage data structure PDataset, wherein geographical grid metadata information comprises: MPI file handle, raster data grid cell columns, raster data grid cell line number, raster data wave band number, grid grid cell data type, data type byte number, raster data is the absolute drift address of (being geographical raster data file) hereof.
Second step, each treatment progress is according to geographical grid metadata information, calculates in geographical raster data file the required geographical raster data reading separately divide size and side-play amount according to unified data dividing mode.Data dividing mode can read according to the mode of row, column or piece.
The 3rd step, by any one process, be responsible for reading the Geographic Reference information in pending geographical grid metadata, create GTIFF(Georeferenced Tagged Image File Format, Geographic Reference label image file form) output file, and in output file, write the metadata information in Geographic Reference information and internal storage data structure PDataset.After establishment, this process is broadcasted complete establishment state to other treatment progress, and other treatment progress reads in pending geographical raster data from geographical raster data file according to unified data dividing mode.
The 4th step, each treatment progress completes calculation task separately, then opens output file, and file view is separately set, and the result that calculation task is completed is written out to output file according to unified data dividing mode.
The invention has the beneficial effects as follows:
(1) the present invention can process several data form.Because the geographical raster data of multiple format can be read in GDAL storehouse itself, so the geographical raster data form that all treatment progress in the present invention read is unrestricted.
(2) when geographical raster data is read and write, can read and write according to the mode of row, column or piece, do not limit the mode that geographical raster data is divided.
(3) parallel processor of the present invention makes.Only creating during output file, each treatment progress need to once wait for, and only completes the operation to output file header while creating output file, and therefore the stand-by period is negligible.
(4) each treatment progress adopts file view after calculation task completes, and random I/O request can be aggregating, and improves overall I/O efficiency.
Accompanying drawing explanation
Fig. 1 is schematic flow sheet of the present invention;
Fig. 2 is the file view schematic diagram creating in a certain embodiment of the present invention;
Fig. 3 is the emulation experiment schematic diagram that the present invention and other method contrast.
Embodiment
The invention will be further described by reference to the accompanying drawings.
Fig. 1 is schematic flow sheet of the present invention.As shown in the figure, suppose to have n process (P0, P1, P2, Pn) process same pending geographical raster data file simultaneously, all process transfer GDAL read in storehouse pending geographical raster data file, therefrom obtain the information of geographical grid metadata and are recorded in internal storage data structure PDataset; Each treatment progress is according to geographical grid metadata information, calculates in geographical raster data file the required geographical raster data reading separately divide size and side-play amount according to unified data dividing mode; Any one process is responsible for reading the Geographic Reference information in pending geographical grid metadata, creates the output file of GTIFF, and in output file, writes the metadata information in Geographic Reference information and internal storage data structure PDataset; After establishment, this process is broadcasted complete establishment state to other treatment progress, and other treatment progress reads in pending geographical raster data from geographical raster data file according to unified data dividing mode; Each treatment progress completes calculation task separately, then opens output file, and file view is separately set, and the result that calculation task is completed is written out to output file according to unified data dividing mode.
Fig. 2 is the file view schematic diagram creating in a certain embodiment of the present invention.In the present embodiment, unified data dividing mode adopts the mode of piece.As shown in the figure, in the 4th step of the present invention is processed, each treatment progress arranges the file view of oneself, and file view defines each treatment progress exercisable Data Position in output file.File view comprises three element definitions: absolute drift address (Displacement), element fundamental type (ElementType) and file type (FileType).Suppose n treatment progress P0, P1, P2, Pn is recorded in internal storage data structure PDataset, pending geographical raster data grid cell line number is RasterYSize, pending geographical raster data grid cell columns is RasterXSize, and grid grid cell data type is element fundamental type ElementType, and raster data absolute drift address is hereof absolute drift address D isplacement.
According to unified piece dividing mode, geographical raster data is divided into n piece, the required geographical raster data reading of calculating each treatment progress in geographical raster data file is divided size and side-play amount, obtain following parameters: the initial row BlockFirstRow of required, end line BlockLastRow, the initial row BlockFirstColumn of required, end column BlockLastColumn.For each treatment progress, the data chunk line unit number BlockYSize=BlockLastRow-BlockFirstRow processing, column unit is counted BlockXSize=BlockLastColumn-BlockFirstColumn, on supposing to be expert at, there is m piece, the data block of each treatment progress processing size is BlockXSize*BlockYSize, file type is BlockXSize element fundamental type, adds RasterXSize-BlockXSize cavity and forms.Arrange after file type, each treatment progress just can arrange file view according to above-mentioned parameter.
Fig. 3 is the emulation experiment schematic diagram that the present invention and other method contrast.As shown in the figure, with the curve of rectangle marked, represent not use file view, mode with non-polymeric request is written out to data file by parallel result of calculation, and I/O performance (shown in ordinate) will be significantly lower than polymerization request mode (being result of the present invention, with the curve of diamond indicia).When treatment progress number increases, non-polymeric request concurrent reading and concurrent writing mode I/O performance will increase and reduce with process number, in the embodiment that the present invention provides, and at process number during lower than 32, I/O performance kept stable.
Non-elaborated part of the present invention belongs to general knowledge known in this field.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (3)

1. the geographical raster data concurrent reading and concurrent writing method based on polymerization request, is provided with several treatment progress and processes same pending geographical raster data file simultaneously, it is characterized in that, comprises the steps:
The first step, under multinode multiprocessor cluster environment, pending geographical raster data file is read in the abstract storehouse of all process transfer geographical spatial datas, therefrom obtains the information of geographical grid metadata and is recorded in internal storage data structure PDataset;
Second step, each treatment progress is according to geographical grid metadata information, calculates in geographical raster data file the required geographical raster data reading separately divide size and side-play amount according to unified data dividing mode;
The 3rd step, by any one process, be responsible for reading the Geographic Reference information in pending geographical grid metadata, create the output file of Geographic Reference label image file form, and in output file, write the metadata information in Geographic Reference information and internal storage data structure PDataset; After establishment, this process is broadcasted complete establishment state to other treatment progress, and other treatment progress reads in pending geographical raster data from geographical raster data file according to unified data dividing mode;
The 4th step, each treatment progress completes calculation task separately, then opens output file, and file view is separately set, and the result that calculation task is completed is written out to output file according to unified data dividing mode.
2. the geographical raster data concurrent reading and concurrent writing method based on polymerization request according to claim 1, it is characterized in that, the geographical grid metadata information obtaining comprises: message passing interface file handle, raster data grid cell columns, raster data grid cell line number, raster data wave band number, grid grid cell data type, data type byte number, the absolute drift address of raster data in geographical raster data file.
3. the geographical raster data concurrent reading and concurrent writing method based on polymerization request according to claim 2, is characterized in that, unified data dividing mode reads according to the mode of row, column or piece.
CN201410020074.1A 2014-01-16 2014-01-16 Geographical raster data parallel reading-writing method based on request aggregation Pending CN103761291A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410020074.1A CN103761291A (en) 2014-01-16 2014-01-16 Geographical raster data parallel reading-writing method based on request aggregation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410020074.1A CN103761291A (en) 2014-01-16 2014-01-16 Geographical raster data parallel reading-writing method based on request aggregation

Publications (1)

Publication Number Publication Date
CN103761291A true CN103761291A (en) 2014-04-30

Family

ID=50528528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410020074.1A Pending CN103761291A (en) 2014-01-16 2014-01-16 Geographical raster data parallel reading-writing method based on request aggregation

Country Status (1)

Country Link
CN (1) CN103761291A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268237A (en) * 2014-09-28 2015-01-07 南京国图信息产业股份有限公司 Electronic map making batch parallel generation system and generation method thereof
CN104636491A (en) * 2015-02-28 2015-05-20 南京国图信息产业股份有限公司 Batch generating system and batch generating method for making electronic maps
CN105677488A (en) * 2016-01-12 2016-06-15 中国人民解放军国防科学技术大学 Method for constructing raster image pyramid in hybrid parallel mode
US10409814B2 (en) 2017-01-26 2019-09-10 International Business Machines Corporation Network common data form data management
CN113568736A (en) * 2021-06-24 2021-10-29 阿里巴巴新加坡控股有限公司 Data processing method and device
CN116662266A (en) * 2023-08-02 2023-08-29 中国科学院大气物理研究所 NetCDF data-oriented parallel reading and writing method and system
WO2024012153A1 (en) * 2022-07-14 2024-01-18 华为技术有限公司 Data processing method and apparatus

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8214371B1 (en) * 2003-07-18 2012-07-03 Teradata Us, Inc. Spatial indexing
CN102542035A (en) * 2011-12-20 2012-07-04 南京大学 Polygonal rasterisation parallel conversion method based on scanning line method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8214371B1 (en) * 2003-07-18 2012-07-03 Teradata Us, Inc. Spatial indexing
CN102542035A (en) * 2011-12-20 2012-07-04 南京大学 Polygonal rasterisation parallel conversion method based on scanning line method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周建鑫: "地理栅格数据并行I/O的研究与实现", 《地理信息世界》 *
欧阳柳: "地理栅格数据的并行访问方法研究", 《计算机科学》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268237A (en) * 2014-09-28 2015-01-07 南京国图信息产业股份有限公司 Electronic map making batch parallel generation system and generation method thereof
CN104268237B (en) * 2014-09-28 2017-11-03 南京国图信息产业有限公司 The batch parallel generation system and its generation method of electronic cartography
CN104636491A (en) * 2015-02-28 2015-05-20 南京国图信息产业股份有限公司 Batch generating system and batch generating method for making electronic maps
CN105677488A (en) * 2016-01-12 2016-06-15 中国人民解放军国防科学技术大学 Method for constructing raster image pyramid in hybrid parallel mode
CN105677488B (en) * 2016-01-12 2019-05-17 中国人民解放军国防科学技术大学 A kind of hybrid parallel mode Raster Images pyramid construction method
US10409814B2 (en) 2017-01-26 2019-09-10 International Business Machines Corporation Network common data form data management
US10558665B2 (en) 2017-01-26 2020-02-11 International Business Machines Corporation Network common data form data management
CN113568736A (en) * 2021-06-24 2021-10-29 阿里巴巴新加坡控股有限公司 Data processing method and device
WO2024012153A1 (en) * 2022-07-14 2024-01-18 华为技术有限公司 Data processing method and apparatus
CN116662266A (en) * 2023-08-02 2023-08-29 中国科学院大气物理研究所 NetCDF data-oriented parallel reading and writing method and system
CN116662266B (en) * 2023-08-02 2023-10-03 中国科学院大气物理研究所 NetCDF data-oriented parallel reading and writing method and system

Similar Documents

Publication Publication Date Title
CN103761291A (en) Geographical raster data parallel reading-writing method based on request aggregation
US11405051B2 (en) Enhancing processing performance of artificial intelligence/machine hardware by data sharing and distribution as well as reuse of data in neuron buffer/line buffer
CN103336758B (en) The sparse matrix storage means of a kind of employing with the sparse row of compression of local information and the SpMV implementation method based on the method
Wang et al. A parallel file system with application-aware data layout policies for massive remote sensing image processing in digital earth
WO2020252799A1 (en) Parallel data access method and system for massive remote-sensing images
CN103761215B (en) Matrix transpose optimization method based on graphic process unit
CN103810125A (en) Active memory device gather, scatter, and filter
CN104036537A (en) Multiresolution Consistent Rasterization
CN108388527B (en) Direct memory access engine and method thereof
Yang et al. EdgeDB: An efficient time-series database for edge computing
CN104537125B (en) A kind of remote sensing image pyramid parallel constructing method based on message passing interface
Jain et al. Input/output in parallel and distributed computer systems
US20170357462A1 (en) Method and apparatus for improving performance of sequential logging in a storage device
US20210334234A1 (en) Distributed graphics processor unit architecture
He et al. A MPI-based parallel pyramid building algorithm for large-scale remote sensing images
CN104516822A (en) Memory access method and device
Puri et al. MPI-Vector-IO: Parallel I/O and partitioning for geospatial vector data
CN104679670A (en) Shared data caching structure and management method for FFT (fast Fourier transform) and FIR (finite impulse response) algorithms
US11030714B2 (en) Wide key hash table for a graphics processing unit
Lan et al. A lightweight time series main-memory database for IoT real-time services
US20140310507A1 (en) Methods of and apparatus for multidimensional indexing in microprocessor systems
Palmer et al. Efficient data IO for a parallel global cloud resolving model
No et al. High-performance scientific data management system
Lustosa et al. SAVIME: A multidimensional system for the analysis and visualization of simulation data
CN111788552A (en) System and method for low latency hardware memory

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140430

WD01 Invention patent application deemed withdrawn after publication