CN116662266A - NetCDF data-oriented parallel reading and writing method and system - Google Patents

NetCDF data-oriented parallel reading and writing method and system Download PDF

Info

Publication number
CN116662266A
CN116662266A CN202310961228.6A CN202310961228A CN116662266A CN 116662266 A CN116662266 A CN 116662266A CN 202310961228 A CN202310961228 A CN 202310961228A CN 116662266 A CN116662266 A CN 116662266A
Authority
CN
China
Prior art keywords
data
read
netcdf
dimension
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310961228.6A
Other languages
Chinese (zh)
Other versions
CN116662266B (en
Inventor
董理
王斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Atmospheric Physics of CAS
Original Assignee
Institute of Atmospheric Physics of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Atmospheric Physics of CAS filed Critical Institute of Atmospheric Physics of CAS
Priority to CN202310961228.6A priority Critical patent/CN116662266B/en
Publication of CN116662266A publication Critical patent/CN116662266A/en
Application granted granted Critical
Publication of CN116662266B publication Critical patent/CN116662266B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4482Procedural
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a parallel read-write method and a system for NetCDF data, which relate to the field of computer data read-write, and the invention does not need a user to declare a large number of handle variables in the use process, and all variables related to read-write are opened up and managed at the bottom layer, and the user only needs to use unique character strings to mark the program objects; the invention does not need a user to write a program related to parallel reading and writing, all functions related to parallel reading and writing of process grouping are realized at the bottom layer, the user only needs to enter an MPI communication domain used externally, and the number range of the array is set when the reading and writing interface is called. Meanwhile, the invention supports direct reading and writing of data from one dimension to more dimensions, does not need a user to convert the multidimensional data into one dimension to be transmitted to a reading and writing interface, reduces the workload of the user and makes the reading and writing of the NetCDF data easier.

Description

NetCDF data-oriented parallel reading and writing method and system
Technical Field
The invention relates to the field of computer data reading and writing, in particular to a parallel reading and writing method and system for NetCDF data.
Background
The weather and ocean fields produce a large amount of grid-point data, such as global atmospheric analysis data, weather simulation data, numerical weather forecast data and the like, the scientific data are very important for researching weather changes, weather forecast and other related application scenes, and along with the continuous improvement of spatial resolution, particularly the weather and weather modes enter 10 km resolution, the size of a single data file also rises sharply, so that the data are read in and written out efficiently and conveniently, and the method is very remarkable for improving the productivity of scientific research and business personnel. Generally, the formats of commonly used binned data include: 1) The GRIB2 format is currently the mainstream of GRIB2 format, is a proprietary data format with high compression rate, is mainly applied to weather forecast modes, and is specified and limited in terms of mode data to ensure data portability, but is less supported for other application scenes; 2) The NetCDF format (Network Common Data Format) developed by the university of America atmospheric research association through the Unidata project is currently mainly a NetCDF-4/HDF5 format, and the NetCDF-4/HDF5 format realizes the bottom layer functions based on the HDF5 (Hierarchical Data Format version), including parallel reading and writing, and compared with the HDF5, the NetCDF format is simpler and accords with the use habit of related personnel, so that the NetCDF format gradually becomes the most used data format in the weather and ocean fields, such as the participation mode in the international coupling mode comparison program (Coupled Model Intercomparison Project, CMIP) and all outputs data in the NetCDF standardized format.
The NetCDF is a self-description data format, and can very conveniently query data content, such as dimension information, variable information and the like, through an official or third party tool, but reading data into a user program requires a user to write the read-in program by himself or writing data out of a disk and also requires the user to write a corresponding program, but a general programming interface of the NetCDF is basic or original, the user needs to declare a large number of variables in the program, and parallel read-write functions provided by the NetCDF4 format require the user to write corresponding program logic such as data blocking, caching, array number calculation and the like. When the data volume is small and the use scene is simple, the program can be directly written through C, C ++ of NetCDF, fortran and other language interfaces, but when the data volume is too large, such as hundreds of GB, or the parallel process reaches more than ten thousands of cores, the problem that the simple read-in or write-out program written by a user by oneself can be encountered, such as slow serial reading, difficult realization of parallel read-write, large memory occupation caused by non-blocking parallel repeated reading and the like. On the other hand, the read-in data generally needs to perform operations such as interpolation, and when encountering a data boundary, large-scale logic needs to be realized, so that errors are easy to occur, and the program is difficult to read and maintain.
Disclosure of Invention
Aiming at the defects in the prior art, the parallel reading and writing method and system for the NetCDF data provided by the invention solve the problem of difficult reading and writing of the NetCDF4 format data.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the parallel read-write method for NetCDF data comprises the following steps:
s1, selecting a current operation, and if the current operation is read-in data, entering a step S2; if the data is written out, the step S8 is carried out; the current operation is to read in data or write out data;
s2, opening a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file; accessing a dimension named as a name in the NetCDF data file, and acquiring the size of the name dimension through optional parameters;
s3, setting an attribute named as a name dimension in the NetCDF data file;
s4, enabling a parallel reading function through an incoming MPI communication domain, and setting the number of groups divided by parallel processes through an optional parameter ngroup; calling a program interface NF 90-OPEN of a netCDF OPEN file through a main process in a current process group, and setting a file reading mode;
s5, judging whether the data numbers on the longitude and latitude grids need to be selected through the coordinate range, if so, entering a step S7; otherwise, entering step S6;
S6, reading operation is carried out in a mode of creating a cache on a process, and the memory occupied by the data file in the bottom layer program is released after the reading operation;
s7, calculating data covered by a coordinate range transmitted by a user based on an attribute named as a name dimension; if the covered data cross the 0 longitude line, the covered data are segmented according to the size of the latitude, different segments are read in by adopting the same method as the step S6, the read segmented data are spliced, and the memory occupied by the data file in the bottom layer program is released; if the covered data does not cross the 0 longitude line, the same method as the step S6 is adopted for reading operation, and the memory occupied by the data file in the bottom program is released;
s8, creating a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file; adding a dimension into the created NetCDF data file by an add_dim method, setting the name and the size of the dimension, performing parallel subdivision or not, and setting whether a coordinate variable named name is added or not;
s9, adding a variable into the created NetCDF data file through an add_var method, and setting the name, long name, unit, dimension name, variable type and default value of the variable;
S10, calling a program interface NF90_CREATE of the NetCDF open file, setting corresponding parameters according to parallel write-out data, and adding added dimensions and variables into the NetCDF data file;
s11, enabling a parallel writing-out function through an incoming MPI communication domain, completing data writing-out, and releasing the memory occupied by the data file in the bottom program.
The parallel read-write system for the NetCDF data comprises an operation selection module, an index establishment module, a dimension acquisition module, an attribute setting module, a read-in data preparation module, a first data read-in module, a second data read-in module, a basic data setting module, a variable adding module, a write-out data preparation module, a data write-out module and a memory release module; wherein:
the operation selection module is used for selecting read-in data or write-out data;
the index establishing module is used for opening a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file;
the dimension acquisition module is used for accessing a dimension named as a name in the NetCDF data file and acquiring the size of the name dimension through selectable parameters;
the attribute setting module is used for setting an attribute named as a name dimension in the NetCDF data file;
The read-in data preparation module is used for enabling a parallel read-in function through an incoming MPI communication domain and setting the number of groups divided by parallel processes through an optional parameter ngroup; calling a program interface NF 90-OPEN of a netCDF OPEN file through a main process in a current process group, and setting a file reading mode;
the first data reading-in module is used for designating the position of each process group in the global array through the optional parameter start and the optional parameter count, creating a buffer memory on a main process in each process group, reading the global array number range which is required to be read in by a slave process and is responsible for each main process into the buffer memory one by one, and asynchronously sending the global array number range to a corresponding slave process, so as to finish the reading-in operation of data which needs to be read in once;
the second data reading module is used for calculating data covered by a coordinate range transmitted by a user based on an attribute named as a name dimension; if the covered data cross 0 longitude line, the covered data are segmented according to the size of the latitude, the same operation as the first data reading module is adopted to read the data of different segments, and the read segmented data are spliced; if the covered data does not cross the 0 longitude line, the data is read by adopting the same operation as the first data reading module;
The basic data setting module is used for creating a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file; adding a dimension into the created NetCDF data file by an add_dim method, setting the name and the size of the dimension, performing parallel subdivision or not, and setting whether a coordinate variable named name is added or not;
the variable adding module is used for adding a variable into the created NetCDF data file through an add_var method and setting the name, long name, unit, dimension name, variable type and default value of the variable;
the write-out data preparation module is used for calling a program interface NF90 CREATE of the NetCDF open file, setting corresponding parameters according to parallel write-out data, and adding added dimensions and variables into the NetCDF data file;
the data writing-out module is used for enabling a parallel reading-in function through an incoming MPI communication domain and setting the number of groups divided by parallel processes through an optional parameter ngroup; designating the position of each process group in the global array through the optional parameter start and the optional parameter count, creating a buffer memory on a main process in each process group, collecting the global array number range required to be written out by a slave process responsible for each main process, asynchronously receiving data sent from the process to the corresponding buffer memory, writing out the data into a NetCDF data file one by one, and finishing a data writing-out operation;
And the memory releasing module is used for releasing the memory occupied by the data file in the bottom layer program after finishing single data reading or data writing.
The beneficial effects of the invention are as follows:
1. in the use process, a user does not need to declare a large number of handle variables (such as a data set handle, a dimension handle and a variable handle), all variables related to reading and writing are opened and managed at the bottom layer (a hash table data structure is adopted to store the data set, the dimension and the variable information), and the user only needs to mark the program objects by using a unique character string; meanwhile, the read-write interface supports direct read-write of data from one dimension to more dimensions, a user is not required to convert the multidimensional data into one dimension to be transmitted to the read-write interface, the workload of the user is reduced, and the NetCDF data read-write is easier.
2. The invention does not need a user to write a program related to parallel reading and writing, all functions related to parallel reading and writing of process grouping are realized at the bottom layer, the user only needs to enter an MPI communication domain used externally, and the number range of the array is set when the reading and writing interface is called.
3. Aiming at the longitude and latitude grid NetCDF data file, a user only needs to set a longitude and latitude range (namely, the minimum longitude, the maximum longitude, the minimum latitude and the maximum latitude) to be read in, the invention can automatically process a data boundary, such as a cycle boundary in the east-west direction, when the read-in range comprises 0 degree longitude, the two-dimensional data to be read in at the moment is divided into a plurality of parts, such as two sides of the 0 degree longitude, the invention can open up a needed memory for a plurality of groups of pointers transmitted by the user, collect a plurality of needed data blocks and store the data blocks in the opened user array, so as to realize automatic boundary filling.
Drawings
FIG. 1 is a schematic flow chart of the method;
FIG. 2 is a flow chart of one-time array reading in the embodiment;
FIG. 3 is a flow chart of writing an array once in the embodiment;
FIG. 4 is a schematic diagram of a block of data across 0 longitude lines on a longitude and latitude grid;
fig. 5 is a schematic diagram of a Fortran program for implementing parallel data reading in according to the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
As shown in fig. 1, the NetCDF data-oriented parallel read-write method includes the following steps:
s1, selecting a current operation, and if the current operation is read-in data, entering a step S2; if the data is written out, the step S8 is carried out; the current operation is to read in data or write out data;
s2, opening a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file; accessing a dimension named as a name in the NetCDF data file, and acquiring the size of the name dimension through optional parameters;
S3, setting an attribute named as a name dimension in the NetCDF data file;
s4, enabling a parallel reading function through an incoming MPI communication domain, and setting the number of groups divided by parallel processes through an optional parameter ngroup; calling a program interface NF 90-OPEN of a netCDF OPEN file through a main process in a current process group, and setting a file reading mode;
s5, judging whether the data numbers on the longitude and latitude grids need to be selected through the coordinate range, if so, entering a step S7; otherwise, entering step S6;
s6, reading operation is carried out in a mode of creating a cache on a process, and the memory occupied by the data file in the bottom layer program is released after the reading operation;
s7, calculating data covered by a coordinate range transmitted by a user based on an attribute named as a name dimension; if the covered data cross the 0 longitude line, the covered data are segmented according to the size of the latitude, different segments are read in by adopting the same method as the step S6, the read segmented data are spliced, and the memory occupied by the data file in the bottom layer program is released; if the covered data does not cross the 0 longitude line, the same method as the step S6 is adopted for reading operation, and the memory occupied by the data file in the bottom program is released;
S8, creating a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file; adding a dimension into the created NetCDF data file by an add_dim method, setting the name and the size of the dimension, performing parallel subdivision or not, and setting whether a coordinate variable named name is added or not;
s9, adding a variable into the created NetCDF data file through an add_var method, and setting the name, long name, unit, dimension name, variable type and default value of the variable;
s10, calling a program interface NF90_CREATE of the NetCDF open file, setting corresponding parameters according to parallel write-out data, and adding added dimensions and variables into the NetCDF data file;
s11, enabling a parallel writing-out function through an incoming MPI communication domain, completing data writing-out, and releasing the memory occupied by the data file in the bottom program.
In step S2, the specific method for using the character string parameter name transmitted by the user side as the unique index of the NetCDF data set in the NetCDF data file is as follows: storing the parameter name and the dataset program object dataset as key value pairs into a hash table, and accessing the dataset program object dataset through the name; wherein the dataset program object dataset includes NetCDF file handles, logical variables is_parallel, MPI communication field, process number, number of packets, packet number, and packet MPI communication field that mark whether parallel.
The logical variable is_parallel, flag, is set by whether MPI library compiler is employed, whether the user is entering MPI communication domain, and whether NetCDF data file employs parallel IO compilation options. Attributes named name dimension include coordinate span, periodic cycle, whether flip needs to be inverted.
The specific method for setting the number of groups divided by the parallel process through the optional parameter ngroup in the step S4 is as follows: the method comprises the steps of setting the number of groups divided by parallel processes according to an optional parameter ngroup input by a user, assigning the processes to different groups according to a principle of bisection as far as possible, taking the first process of each group as a master process of the group, and taking other processes of each group as slave processes.
The specific method of step S6 is as follows: the method comprises the steps of designating positions of each process group in a global array through an optional parameter start and an optional parameter count, creating a buffer memory on a main process in each process group, reading global array number ranges required to be read in by a slave process responsible for each main process into the buffer memory one by one, asynchronously sending the global array number ranges to a corresponding slave process, completing one-time reading operation, and releasing memory occupied by a data file in a bottom layer program.
The specific method of step S7 comprises the following sub-steps:
S7-1, setting a mode variable, acquiring a data coordinate range [ x1, x2] according to an attribute named as a name dimension, and acquiring a user-specified coordinate range [ r1, r2] according to a coordinate range parameter transmitted by a user;
s7-2, assigning a value to the mode:
if x1 is less than or equal to r1 and r2 is less than or equal to x2, making mode be 1, and entering a step S7-3;
if r1 is less than or equal to x1, x1 is less than r2, and r2 is less than or equal to x2, let mode be 2, and enter step S7-4;
if x1 is less than or equal to r1, r1 is less than x2, and x2 is less than or equal to r2, setting mode to be 3, and entering a step S7-5;
if r1 is less than or equal to x1 and x2 is less than or equal to r2, making mode be 4, and entering a step S7-6;
if x1, x2, r1 and x2 are in other size relationships, let mode be 0, and do no further processing;
s7-3, searching a first coordinate number i larger than r1, wherein the starting point number is of the data to be read is i-1; searching a first coordinate number j which is larger than r2, wherein the ending number ie of the data which is actually required to be read is j-1; if no corresponding coordinate number is searched, taking the dimension size of the data actually required to be read as the end number ie of the data actually required to be read, and entering step S7-7;
s7-4, judging whether the dimension of the data to be actually read has periodicity, if so, judging that the data to be actually read has two blocks, dividing the blocks by 0 longitude lines, searching a first number i smaller than r1 plus a coordinate span from back to front for the first block, marking i+1 as a starting number is (1) of the data to be actually read of the first block, and taking the dimension size of the data to be actually read of the first block as an ending number ie (1) of the data to be actually read of the first block; for the second block, the starting number is (2) of the data actually required to be read by the second block is recorded as 1, the first number j larger than r2 is searched from front to back, the ending number ie (2) of the data actually required to be read by the second block is recorded as j-1, and the step S7-7 is entered; otherwise, judging that only one block exists in the data to be read actually, marking the starting point number is of the data to be read actually as 1, searching the first number j larger than r2, marking the ending number ie of the data to be read actually as j-1, and entering the step S7-7;
S7-5, judging whether the dimension of the data to be actually read has periodicity, if so, judging that the data to be actually read has two blocks, dividing the blocks by 0 longitude lines, searching a first number i larger than r1 from front to back for the first block, marking a starting point number is (1) of the data to be actually read of the first block as i-1, and taking the dimension size of the data to be actually read as an ending number ie (1) of the data to be actually read of the first block; for the second block, the starting number is (2) of the data actually required to be read by the second block is recorded as 1, the first number j which is larger than r2 minus the dimension span is searched from front to back, the ending number ie (2) of the data actually required to be read by the second block is recorded as j-1, and the step S7-7 is entered; otherwise, judging that the actual data to be read only has one block, searching the first number i larger than r1 from front to back, marking the starting point number is of the actual data to be read as i-1, taking the dimension size of the actual data to be read as the ending number ie of the actual data to be read, and entering the step S7-7;
s7-6, judging whether the dimension of the data to be actually read has periodicity, if so, judging that the data to be actually read has three blocks, for the first block, searching the first number i smaller than r1 plus dimension span from back to front, marking the starting point number is (1) of the data to be actually read of the first block as i, and taking the dimension size of the data to be actually read as the ending number ie (1) of the data to be actually read of the first block; for the second block, the starting number is (2) of the data actually required to be read by the second block is recorded as 1, and the dimension size of the data actually required to be read is used as the ending number ie (2) of the data actually required to be read by the second block; for the third block, the starting number is (3) of the data actually required to be read by the third block is recorded as 1, the first number j which is larger than r2 minus the dimension span is searched from front to back, the ending number ie (3) of the data actually required to be read by the third block is recorded as j, and the step S7-7 is entered; otherwise, judging that only one block exists in the data to be read actually, marking the starting point number is of the data to be read actually as 1, taking the dimension size of the data to be read actually as the ending number ie of the data to be read actually, and entering the step S7-7;
S7-7, for the actual required read data with more than one block, reading different blocks by adopting the same method as the step S6, and splicing the read block data to finish one-time read operation, and releasing the memory occupied by the data file in the bottom program; and (3) for the actual required read data of only one block, the read operation is carried out by adopting the same method as the step S6, and the memory occupied by the data file in the bottom program is released.
The specific method of step S11 is as follows: enabling a parallel writing-out function through an incoming MPI communication domain, and setting the number of groups divided by parallel processes through an optional parameter ngroup; the position of each process group in the global array is designated through the optional parameter start and the optional parameter count, a buffer memory is created on the main process in each process group, the global array number range required to be written out by the slave process responsible for the main process is collected through each main process, data sent from the process are asynchronously received to the corresponding buffer memory, the data are written out to the NetCDF data file one by one, one data writing operation is completed, and the memory occupied by the data file in the bottom layer program is released.
The parallel read-write system for the NetCDF data comprises an operation selection module, an index establishment module, a dimension acquisition module, an attribute setting module, a read-in data preparation module, a first data read-in module, a second data read-in module, a basic data setting module, a variable adding module, a write-out data preparation module, a data write-out module and a memory release module; wherein:
The operation selection module is used for selecting read-in data or write-out data;
the index establishing module is used for opening a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file;
the dimension acquisition module is used for accessing a dimension named as a name in the NetCDF data file and acquiring the size of the name dimension through selectable parameters;
the attribute setting module is used for setting an attribute named as a name dimension in the NetCDF data file;
the read-in data preparation module is used for enabling a parallel read-in function through an incoming MPI communication domain and setting the number of groups divided by parallel processes through an optional parameter ngroup; calling a program interface NF 90-OPEN of a netCDF OPEN file through a main process in a current process group, and setting a file reading mode;
the first data reading-in module is used for designating the position of each process group in the global array through the optional parameter start and the optional parameter count, creating a buffer memory on a main process in each process group, reading the global array number range which is required to be read in by a slave process and is responsible for each main process into the buffer memory one by one, and asynchronously sending the global array number range to a corresponding slave process, so as to finish the reading-in operation of data which needs to be read in once;
The second data reading module is used for calculating data covered by a coordinate range transmitted by a user based on an attribute named as a name dimension; if the covered data cross 0 longitude line, the covered data are segmented according to the size of the latitude, the same operation as the first data reading module is adopted to read the data of different segments, and the read segmented data are spliced; if the covered data does not cross the 0 longitude line, the data is read by adopting the same operation as the first data reading module;
the basic data setting module is used for creating a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file; adding a dimension into the created NetCDF data file by an add_dim method, setting the name and the size of the dimension, performing parallel subdivision or not, and setting whether a coordinate variable named name is added or not;
the variable adding module is used for adding a variable into the created NetCDF data file through an add_var method and setting the name, long name, unit, dimension name, variable type and default value of the variable;
the write-out data preparation module is used for calling a program interface NF90 CREATE of the NetCDF open file, setting corresponding parameters according to parallel write-out data, and adding added dimensions and variables into the NetCDF data file;
The data writing-out module is used for enabling a parallel reading-in function through an incoming MPI communication domain and setting the number of groups divided by parallel processes through an optional parameter ngroup; designating the position of each process group in the global array through the optional parameter start and the optional parameter count, creating a buffer memory on a main process in each process group, collecting the global array number range required to be written out by a slave process responsible for each main process, asynchronously receiving data sent from the process to the corresponding buffer memory, writing out the data into a NetCDF data file one by one, and finishing a data writing-out operation;
and the memory releasing module is used for releasing the memory occupied by the data file in the bottom layer program after finishing single data reading or data writing.
In a specific implementation, the incoming MPI communication domain may be completed by an optional parameter MPI _comm, and it may be set how many groups parallel processes are divided by an optional parameter ngroup. The invention acquires the number of DIMENSIONs and the number of VARIABLEs by calling the nf90_INQUIRE, the nf90_INQUIRE_DIMENSION and the nf90_INQUIRE_VARIABLE of the NetCDF library, loops the DIMENSIONs to acquire the size of each DIMENSION, loops the VARIABLEs to acquire the DIMENSION corresponding to each VARIABLE, simultaneously creates a program object for each DIMENSION and VARIABLE, respectively named dim and var, records the size of the DIMENSION in dim, records the DIMENSION object array pointer dims of the DIMENSION object in var, and then stores the DIMENSION object into a data set program object in a hash table mode. The subsequent interface obtains the data set program object corresponding to the name by accessing the hash table.
The invention can access the data set program object dataset corresponding to the parameter dataset_name in the data set hash table by a get_dim method, then extract the dimension program object dim named name from the dimension hash table in the dataset, obtain corresponding information by providing optional parameters, and mainly obtain the dimension size (parameter size) at present.
The invention can access a data set program object dataset corresponding to an incoming parameter dataset_name in a data set hash table by a set_dim method, then extract the dimension program object dim named as name from the dimension hash table in the dataset, then set the attribute of the dimension dim, including coordinate span, periodicity, whether flip is required to be inverted, the attribute works when the number range is calculated when the array boundary is filled, for example, a longitude dimension with periodicity needs to consider 0 longitude line when the boundary is filled, and in addition, the latitude dimension of some data is from north pole (90 degrees) to south pole (-90 degrees), at this time, the attribute can be marked as the one which needs to be inverted, namely flip is set to be true.
The invention can prepare for reading data by a start_input method, in order to avoid that a great number of processes OPEN the NetCDF together and cause the process to be blocked, the invention only allows the main process to call a program interface NF90_OPEN of the NetCDF OPEN file, sets a file reading mode, such as a NF90_MPIIO parallel mode supported by a NetCDF4 data format, and other slave processes do not directly access the file, but acquire the data in the file through the main process.
The invention can read in by input method, its interface is basically consistent with nf90_get_var of NetCDF, except that the data file is specified by the character string set in step S2, instead of the integer handle variable (e.g. common name nci) d in NetCDF that needs to be separately declared. The variable array of the user can be multidimensional, the upper and lower bounds of the number of each dimension of the array are managed by the user program, and if the variable array is a parallel block array, the user can input optional parameters start and count to specify the positions of each process array in the global array.
The invention can create a NetCDF data file by the create_dataset method, and similar to the step S2, the character string parameter name which is input in the user program is used as the unique index of the data set, and the subsequent interface calls all use the character string index to access the data.
According to the invention, one dimension can be added into the NetCDF data file by an add_dim method, and whether a coordinate variable named name is added or not is selected by the parameter name to set the name of the dimension, the size of the parameter size to set the dimension, whether the dimension is subjected to parallel subdivision or not by the parameter decomp setting, and whether the parameter add_var is added or not.
According to the invention, a variable can be added into the NetCDF data file through an add_var method, the type of the variable, such as double-precision floating point, can be set through an optional parameter dtype, and a default value can be set through an optional parameter missing_value, wherein the variable is set through the name of the parameter name, the long name of the parameter long_name, the unit of the parameter units, and the dimension name corresponding to each dimension of the parameter dim_names.
The invention can write out data by the output method, the interface of the method is basically consistent with the nf90_put_var of NetCDF, except that the data file is indexed by the character string set in step S8, instead of the integer handle variable (such as the commonly used name ncid) in NetCDF, which needs to be separately declared. The variable array of the user can be multidimensional, the upper and lower bounds of the number of each dimension of the array are managed by the user program, and if the variable array is a parallel block array, the user can input optional parameters start and count to specify the positions of each process array in the global array. The output method creates a buffer memory on a main process in each group, the main process collects the number range of an array which is required to be written by a slave process and is responsible for the main process, then asynchronously receives data sent by the slave process to the corresponding buffer memory, finally writes the data into a data file one by one, and completes one data writing operation.
In one embodiment of the present invention, fig. 2 shows a specific operation of one array reading, taking four processes as an example, and process 1 and process 2 are divided into process group 1, process 3 and process 4 are divided into process group 2, so that the array is divided into four blocks, and each process needs to read one of the blocks. The master process of each process group is responsible for reading data from the data file into its opened cache and then distributing the data blocks to the corresponding processes.
In one embodiment of the present invention, a specific operation of writing an array is given in fig. 3, taking four processes as an example, and process 1 and process 2 are divided into process group 1, process 3 and process 4 are divided into process group 2, so that the array is divided into four blocks, and each process needs to write one of the blocks. The master process of each process group is responsible for collecting the data chunks of the slave process into the cache and writing them out to the NetCDF data file one by one in parallel.
In one embodiment of the present invention, fig. 4 shows a schematic diagram of reading a longitude and latitude grid data block, wherein the data block to be read spans 0 longitude line, and the longitude and latitude range of the data is assumed to be 0 to 360 degrees, so that the array returned to the user is actually divided into four sub-data blocks.
In one embodiment of the present invention, fig. 5 shows a Fortran program schematic diagram for implementing parallel data reading in the present invention, and it can be seen that the program amount used in the present invention is only about 20% of the official interface program amount, and on this basis, a user only needs to input an MPI communication domain and the parallel process packet number to implement parallel reading and writing, where the packet number can be set according to the actual hardware composition of the parallel file system used by the user, for example, the packet number can be set to be consistent with the number of the storage controller, and the parallel reading and writing efficiency of NetCDF data can be further improved.
In summary, the invention designs a user-friendly parallel read-write method and system aiming at the NetCDF data commonly used in the meteorological and ocean fields. The method and the system are simple and convenient to use, do not need a user to open up a large number of handle variables, enable the user program to be tidy and easy to manage, and can save a large number of programming quantities and greatly improve the readability of the program compared with the direct use of an official NetCDF program interface.

Claims (9)

1. The parallel read-write method for NetCDF data is characterized by comprising the following steps:
s1, selecting a current operation, and if the current operation is read-in data, entering a step S2; if the data is written out, the step S8 is carried out; the current operation is to read in data or write out data;
s2, opening a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file; accessing a dimension named as a name in the NetCDF data file, and acquiring the size of the name dimension through optional parameters;
s3, setting an attribute named as a name dimension in the NetCDF data file;
s4, enabling a parallel reading function through an incoming MPI communication domain, and setting the number of groups divided by parallel processes through an optional parameter ngroup; calling a program interface NF 90-OPEN of a netCDF OPEN file through a main process in a current process group, and setting a file reading mode;
S5, judging whether the data numbers on the longitude and latitude grids need to be selected through the coordinate range, if so, entering a step S7; otherwise, entering step S6;
s6, reading operation is carried out in a mode of creating a cache on a process, and the memory occupied by the data file in the bottom layer program is released after the reading operation;
s7, calculating data covered by a coordinate range transmitted by a user based on an attribute named as a name dimension; if the covered data cross the 0 longitude line, the covered data are segmented according to the size of the latitude, different segments are read in by adopting the same method as the step S6, the read segmented data are spliced, and the memory occupied by the data file in the bottom layer program is released; if the covered data does not cross the 0 longitude line, the same method as the step S6 is adopted for reading operation, and the memory occupied by the data file in the bottom program is released;
s8, creating a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file; adding a dimension into the created NetCDF data file by an add_dim method, setting the name and the size of the dimension, performing parallel subdivision or not, and setting whether a coordinate variable named name is added or not;
S9, adding a variable into the created NetCDF data file through an add_var method, and setting the name, long name, unit, dimension name, variable type and default value of the variable;
s10, calling a program interface NF90_CREATE of the NetCDF open file, setting corresponding parameters according to parallel write-out data, and adding added dimensions and variables into the NetCDF data file;
s11, enabling a parallel writing-out function through an incoming MPI communication domain, completing data writing-out, and releasing the memory occupied by the data file in the bottom program.
2. The parallel read-write method for NetCDF data according to claim 1, wherein in step S2, the specific method for using the character string parameter name transmitted by the user side as the unique index of the NetCDF data set in the NetCDF data file is as follows:
storing the parameter name and the dataset program object dataset as key value pairs into a hash table, and accessing the dataset program object dataset through the name; wherein the dataset program object dataset includes NetCDF file handles, logical variables is_parallel, MPI communication field, process number, number of packets, packet number, and packet MPI communication field that mark whether parallel.
3. The NetCDF data-oriented parallel read-write method of claim 2, wherein the flag is_parallel is set by whether an MPI library compiler is employed, whether a user is introduced into an MPI communication domain, and whether a NetCDF data file employs a parallel IO compilation option.
4. The NetCDF data-oriented parallel read-write method of claim 1, wherein the attribute named name dimension includes coordinate span, periodicity, and whether flip is needed to be inverted.
5. The parallel read-write method for NetCDF data according to claim 1, wherein the specific method for setting the number of parallel process divided groups by the optional parameter ngroup in step S4 is as follows:
the method comprises the steps of setting the number of groups divided by parallel processes according to an optional parameter ngroup input by a user, assigning the processes to different groups according to a principle of bisection as far as possible, taking the first process of each group as a master process of the group, and taking other processes of each group as slave processes.
6. The parallel read-write method for NetCDF data according to claim 1, wherein the specific method of step S6 is as follows:
the method comprises the steps of designating positions of each process group in a global array through an optional parameter start and an optional parameter count, creating a buffer memory on a main process in each process group, reading global array number ranges required to be read in by a slave process responsible for each main process into the buffer memory one by one, asynchronously sending the global array number ranges to a corresponding slave process, completing one-time reading operation, and releasing memory occupied by a data file in a bottom layer program.
7. The parallel read-write method for NetCDF data according to claim 1, wherein the specific method of step S7 comprises the following sub-steps:
s7-1, setting a mode variable, acquiring a data coordinate range [ x1, x2] according to an attribute named as a name dimension, and acquiring a user-specified coordinate range [ r1, r2] according to a coordinate range parameter transmitted by a user;
s7-2, assigning a value to the mode:
if x1 is less than or equal to r1 and r2 is less than or equal to x2, making mode be 1, and entering a step S7-3;
if r1 is less than or equal to x1, x1 is less than r2, and r2 is less than or equal to x2, let mode be 2, and enter step S7-4;
if x1 is less than or equal to r1, r1 is less than x2, and x2 is less than or equal to r2, setting mode to be 3, and entering a step S7-5;
if r1 is less than or equal to x1 and x2 is less than or equal to r2, making mode be 4, and entering a step S7-6;
if x1, x2, r1 and x2 are in other size relationships, let mode be 0, and do no further processing;
s7-3, searching a first coordinate number i larger than r1, wherein the starting point number is of the data to be read is i-1; searching a first coordinate number j which is larger than r2, wherein the ending number ie of the data which is actually required to be read is j-1; if no corresponding coordinate number is searched, taking the dimension size of the data actually required to be read as the end number ie of the data actually required to be read, and entering step S7-7;
S7-4, judging whether the dimension of the data to be actually read has periodicity, if so, judging that the data to be actually read has two blocks, dividing the blocks by 0 longitude lines, searching a first number i smaller than r1 plus a coordinate span from back to front for the first block, marking i+1 as a starting number is (1) of the data to be actually read of the first block, and taking the dimension size of the data to be actually read of the first block as an ending number ie (1) of the data to be actually read of the first block; for the second block, the starting number is (2) of the data actually required to be read by the second block is recorded as 1, the first number j larger than r2 is searched from front to back, the ending number ie (2) of the data actually required to be read by the second block is recorded as j-1, and the step S7-7 is entered; otherwise, judging that only one block exists in the data to be read actually, marking the starting point number is of the data to be read actually as 1, searching the first number j larger than r2, marking the ending number ie of the data to be read actually as j-1, and entering the step S7-7;
s7-5, judging whether the dimension of the data to be actually read has periodicity, if so, judging that the data to be actually read has two blocks, dividing the blocks by 0 longitude lines, searching a first number i larger than r1 from front to back for the first block, marking a starting point number is (1) of the data to be actually read of the first block as i-1, and taking the dimension size of the data to be actually read as an ending number ie (1) of the data to be actually read of the first block; for the second block, the starting number is (2) of the data actually required to be read by the second block is recorded as 1, the first number j which is larger than r2 minus the dimension span is searched from front to back, the ending number ie (2) of the data actually required to be read by the second block is recorded as j-1, and the step S7-7 is entered; otherwise, judging that the actual data to be read only has one block, searching the first number i larger than r1 from front to back, marking the starting point number is of the actual data to be read as i-1, taking the dimension size of the actual data to be read as the ending number ie of the actual data to be read, and entering the step S7-7;
S7-6, judging whether the dimension of the data to be actually read has periodicity, if so, judging that the data to be actually read has three blocks, for the first block, searching the first number i smaller than r1 plus dimension span from back to front, marking the starting point number is (1) of the data to be actually read of the first block as i, and taking the dimension size of the data to be actually read as the ending number ie (1) of the data to be actually read of the first block; for the second block, the starting number is (2) of the data actually required to be read by the second block is recorded as 1, and the dimension size of the data actually required to be read is used as the ending number ie (2) of the data actually required to be read by the second block; for the third block, the starting number is (3) of the data actually required to be read by the third block is recorded as 1, the first number j which is larger than r2 minus the dimension span is searched from front to back, the ending number ie (3) of the data actually required to be read by the third block is recorded as j, and the step S7-7 is entered; otherwise, judging that only one block exists in the data to be read actually, marking the starting point number is of the data to be read actually as 1, taking the dimension size of the data to be read actually as the ending number ie of the data to be read actually, and entering the step S7-7;
S7-7, for the actual required read data with more than one block, reading different blocks by adopting the same method as the step S6, and splicing the read block data to finish one-time read operation, and releasing the memory occupied by the data file in the bottom program; and (3) for the actual required read data of only one block, the read operation is carried out by adopting the same method as the step S6, and the memory occupied by the data file in the bottom program is released.
8. The parallel read-write method for NetCDF data according to claim 1, wherein the specific method of step S11 is as follows:
enabling a parallel writing-out function through an incoming MPI communication domain, and setting the number of groups divided by parallel processes through an optional parameter ngroup; the position of each process group in the global array is designated through the optional parameter start and the optional parameter count, a buffer memory is created on the main process in each process group, the global array number range required to be written out by the slave process responsible for the main process is collected through each main process, data sent from the process are asynchronously received to the corresponding buffer memory, the data are written out to the NetCDF data file one by one, one data writing operation is completed, and the memory occupied by the data file in the bottom layer program is released.
9. The parallel read-write system for the NetCDF data is characterized by comprising an operation selection module, an index establishment module, a dimension acquisition module, an attribute setting module, a read-in data preparation module, a first data read-in module, a second data read-in module, a basic data setting module, a variable adding module, a write-out data preparation module, a data write-out module and a memory release module; wherein:
the operation selection module is used for selecting read-in data or write-out data;
the index establishing module is used for opening a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file;
the dimension acquisition module is used for accessing a dimension named as a name in the NetCDF data file and acquiring the size of the name dimension through selectable parameters;
the attribute setting module is used for setting an attribute named as a name dimension in the NetCDF data file;
the read-in data preparation module is used for enabling a parallel read-in function through an incoming MPI communication domain and setting the number of groups divided by parallel processes through an optional parameter ngroup; calling a program interface NF 90-OPEN of a netCDF OPEN file through a main process in a current process group, and setting a file reading mode;
The first data reading-in module is used for designating the position of each process group in the global array through the optional parameter start and the optional parameter count, creating a buffer memory on a main process in each process group, reading the global array number range which is required to be read in by a slave process and is responsible for each main process into the buffer memory one by one, and asynchronously sending the global array number range to a corresponding slave process, so as to finish the reading-in operation of data which needs to be read in once;
the second data reading module is used for calculating data covered by a coordinate range transmitted by a user based on an attribute named as a name dimension; if the covered data cross 0 longitude line, the covered data are segmented according to the size of the latitude, the same operation as the first data reading module is adopted to read the data of different segments, and the read segmented data are spliced; if the covered data does not cross the 0 longitude line, the data is read by adopting the same operation as the first data reading module;
the basic data setting module is used for creating a NetCDF data file, and taking a character string parameter name transmitted through a user side as a unique index of a NetCDF data set in the NetCDF data file; adding a dimension into the created NetCDF data file by an add_dim method, setting the name and the size of the dimension, performing parallel subdivision or not, and setting whether a coordinate variable named name is added or not;
The variable adding module is used for adding a variable into the created NetCDF data file through an add_var method and setting the name, long name, unit, dimension name, variable type and default value of the variable;
the write-out data preparation module is used for calling a program interface NF90 CREATE of the NetCDF open file, setting corresponding parameters according to parallel write-out data, and adding added dimensions and variables into the NetCDF data file;
the data writing-out module is used for enabling a parallel reading-in function through an incoming MPI communication domain and setting the number of groups divided by parallel processes through an optional parameter ngroup; designating the position of each process group in the global array through the optional parameter start and the optional parameter count, creating a buffer memory on a main process in each process group, collecting the global array number range required to be written out by a slave process responsible for each main process, asynchronously receiving data sent from the process to the corresponding buffer memory, writing out the data into a NetCDF data file one by one, and finishing a data writing-out operation;
and the memory releasing module is used for releasing the memory occupied by the data file in the bottom layer program after finishing single data reading or data writing.
CN202310961228.6A 2023-08-02 2023-08-02 NetCDF data-oriented parallel reading and writing method and system Active CN116662266B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310961228.6A CN116662266B (en) 2023-08-02 2023-08-02 NetCDF data-oriented parallel reading and writing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310961228.6A CN116662266B (en) 2023-08-02 2023-08-02 NetCDF data-oriented parallel reading and writing method and system

Publications (2)

Publication Number Publication Date
CN116662266A true CN116662266A (en) 2023-08-29
CN116662266B CN116662266B (en) 2023-10-03

Family

ID=87722881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310961228.6A Active CN116662266B (en) 2023-08-02 2023-08-02 NetCDF data-oriented parallel reading and writing method and system

Country Status (1)

Country Link
CN (1) CN116662266B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761291A (en) * 2014-01-16 2014-04-30 中国人民解放军国防科学技术大学 Geographical raster data parallel reading-writing method based on request aggregation
US20190146980A1 (en) * 2017-11-15 2019-05-16 Mapr Technologies, Inc. Reading own writes using context objects in a distributed database

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761291A (en) * 2014-01-16 2014-04-30 中国人民解放军国防科学技术大学 Geographical raster data parallel reading-writing method based on request aggregation
US20190146980A1 (en) * 2017-11-15 2019-05-16 Mapr Technologies, Inc. Reading own writes using context objects in a distributed database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵强;龙绍桥;沈继平;唐德富;陆伟先;: "NetCDF数据格式在海洋断面调查数据存储中的应用", 浙江海洋学院学报(自然科学版), no. 01 *

Also Published As

Publication number Publication date
CN116662266B (en) 2023-10-03

Similar Documents

Publication Publication Date Title
CN112115198B (en) Urban remote sensing intelligent service platform
CN106779417A (en) The collection of engineering investigation information digitalization, management and integrated application method
CN102375879A (en) Mobile GIS (Geographic Information System) system based on intelligent mobile phone and application thereof
CN112579722A (en) High-customization remote sensing image automatic rapid map cutting method
CN101702245B (en) Extensible universal three-dimensional terrain simulation system
CN113568995B (en) Dynamic tile map manufacturing method and tile map system based on search conditions
CN108804602A (en) A kind of distributed spatial data storage computational methods based on SPARK
WO2023231665A1 (en) Distributed transaction processing method, system and device, and readable storage medium
CN114647716B (en) System suitable for generalized data warehouse
CN114048204A (en) Beidou grid space indexing method and device based on database inverted index
CN116662266B (en) NetCDF data-oriented parallel reading and writing method and system
CN105354310B (en) Map tile storage layout optimization method based on MapReduce
CN109597575A (en) A kind of storage of sectional type data and read method based on HDF5
CN110515993B (en) Tax data conversion method and system
CN115587084A (en) Comprehensive management system and method for geographic information data
CN108073706B (en) Method for transversely displaying longitudinal data of simulation system historical library
CN110019518B (en) Data processing method and device
Jin et al. Analysis of the Modeling Method and Application of 3D City Model based on the CityEngine
Tian et al. Marine information sharing and publishing system: a WebGIS approach
CN111159480A (en) Graph drawing method based on power grid GIS data
Chen et al. Research on embedded GIS based on wireless networks
CN116302579B (en) Space-time big data efficient loading rendering method and system for Web end
CN104809217B (en) A kind of GIS raster datas cloud storage method
Wu Research and development of mobile forestry GIS based on intelligent terminal
CN114969171B (en) Space-time consistent data display and playback method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant