CN105930381A

CN105930381A - Global Argo data storage and update method based on mixed database architecture

Info

Publication number: CN105930381A
Application number: CN201610230748.XA
Authority: CN
Inventors: 曹敏杰; 许建平; 刘增宏; 孙朝辉; 吴晓芬; 卢少磊
Original assignee: Second Institute of Oceanography SOA
Current assignee: Second Institute of Oceanography SOA
Priority date: 2016-04-13
Filing date: 2016-04-13
Publication date: 2016-09-07

Abstract

The invention discloses a global Argo data storage and update method based on mixed database architecture. The method comprises following steps: 1) a data monitor monitors an assigned directory, and new Argo data files are forwarded to a data server; 2) a data classifier classifies the Argo data files collected in the data server into three classes according to file formats; 3) a data controller checks that whether the database contains current data and checks that whether the data files are complete or not; 4) a data extractor extracts related metadata and data blocks from the data files; 5) a data input module uploads unstructured data blocks to a HDFS distributed storage system; 6) a data filing module performs filing on the input data to form log files. According to the global Argo data storage and update method based on mixed database architecture, data files of different classes in the global Argo data are integrated into one database platform to be stored and updated; efficient and flexibly extensible storage and update solution scheme is provided for the global Argo data.

Description

Global Argo data based on hybrid database framework storage and update method

Technical field

The present invention relates to data storage and update method field, particularly relate to a kind of based on hybrid database framework Global Argo data storage and update method.

Background technology

Whole world Argo plan is in 1998 by the U.S., the air of state, the Marine Sciences man such as French and Japanese The global oceanographic observation plan released, it is intended to collect global ocean upper strata quickly, accurately, on a large scale Sea water temperature and salinity profiles data, to improve the precision of climatic prediction, the gas that effectively the defence whole world is the most serious Wait the threat that disaster (such as hurricane, tornado, typhoon, ice storm, flood and arid etc.) causes to the mankind. Since 15 years, the Argo buoy quantity that various countries lay at global ocean, more than 12,000, accumulative obtains About 1,500,000 temperature and salinity profiles, define huge global Argo data ocean.

Along with being constantly incremented by of whole world Argo data volume, and owing to Argo data exist multi-source heterogeneous, dynamic The characteristic such as multidimensional and magnanimity, the whole world efficient storage of Argo data and renewal always be one challenging A difficult problem.Argo data store as mainly storing form using file at present, to crude initial data Extract and sorter is difficult to, be also unfavorable for further data mining, it is impossible to coupling currently increases day by day Long Argo data ocean.All kinds of Argo data that long term accumulation gets off are stored in the most isolated difference Place, it is impossible to carry out effective collaborative work, the renewal of data also cannot be accomplished the most ageing.Therefore, By different types of whole world Argo data unification to same database platform storing and updating, become The urgent needs of scientific research business.

Summary of the invention

It is an object of the invention to the problem for overcoming prior art to exist, it is provided that a kind of based on hybrid database frame The global Argo data storage of structure and update method.

Global Argo data based on hybrid database framework storage and update method, its step is as follows:

1) data monitor monitors assigned catalogue on teledata main frame, once has new Argo data literary composition Part generates, then by data file forwarding to data server；

2) the Argo data file being summarised on data server is divided into by data sorter according to file format Whole world Argo buoy metadata, whole world Argo buoy observation cross-sectional data and whole world Argo gridded data produce Three classifications of product；

3) whether recording controller exists current data in checking data base, and whether checks content data file Completely；

4) data extractor extracts relevant metadata and data block from data file；

5) non-structured data block is uploaded to HDFS distributed memory system by data loading module, will knot The metadata record of structure is in PostgreSQL relevant database, and sets up between data block and metadata Index；

6) data that warehouse-in completes are filed by data filing module, form journal file.

Described step 1) particularly as follows: data monitor is a module residing on data server, it Thread can be periodically turned on, for connecting each teledata main frame that whole world Argo data are relevant, and lead to Cross daily record document judges whether there is new Argo data genaration on assigned catalogue, once have new file generated, then By on this data file forwarding to data server, and record in daily record document.

Described step 2) particularly as follows: the data sorter Argo data literary composition to being summarised on data server Part is classified, and DAT file format is whole world Argo buoy metadata, and NetCDF file format is complete Ball Argo buoy observation cross-sectional data, PNG file format for the whole world Argo gridded data product, by This three class file is divided into respective data center by this.

Described step 3) particularly as follows: recording controller according to file name from PostgreSQL relationship type number Whether there is this data file according to storehouse is inquired about, and judge that this data file is the most complete according to file size, If meeting, to there is not this data file and data file in data base complete, then can be identified as new data file, Can put in storage.

Described step 4) particularly as follows: data extractor is from whole world Argo buoy meta data file, whole world Argo Buoy observation cross-sectional data file and whole world Argo gridded data product extract the metadata of correspondence respectively, Extract blocks of data from whole world Argo buoy observation cross-sectional data file and be converted to JSON formatted file simultaneously.

Described step 5) specifically include following sub-step:

5.1) data loading module is by step 4) in the structurized metadata record that extracts arrive In PostgreSQL relevant database, it is mainly stored in buoy metadata table, buoy observation cross-sectional data letter Breath table and this three classes table of buoy gridded data product information table, wherein buoy metadata table is used for storing all The metadata information of Argo buoy, the technical parameter of the most each Argo buoy, including WMO numbering, platform Number, transmission system, signal transmission repetitive rate, alignment system, manufacturer, section sample direction, sensor Information and cyclical information etc.；Buoy observation cross-sectional data information table is used for storing the relevant letter of all observation sections Breath, for improve search efficiency, this table by year divides multilist store, mainly include buoy ID, WMO number, Section period, profiling observation direction, date and longitude and latitude etc.；Buoy gridded data product information table For storing all gridded data product related informations, mainly include product category, product date, product Scope etc.；

5.2) non-structured JSON and PNG file data blocks is uploaded to HDFS and divides by data loading module Cloth storage system, and on multiple physical nodes, complete storage and redundancy backup, its data block access path Leave on the host node of cluster；

5.3) data loading module is deposited unified for metadata information corresponding with to it for data block access path simultaneously Storage, in PostgreSQL relevant database, sets up the index between data block and metadata with this, it is achieved Whole world Argo data are deposited in the mixing of HDFS distributed memory system and PostgreSQL relevant database Storage.

Described step 6), particularly as follows: the data that warehouse-in is completed by data filing module are filed, with day be Unit, forms journal file respectively according to three classifications of whole world Argo data.

The present invention compared with prior art has the beneficial effect that

1) instant invention overcomes current file storage mode and single-relation type data base cannot work in coordination with storage and have The deficiency that effect updates, puts down unified for different types of data file in the Argo data of the whole world to same data base Platform stores and updates, for whole world Argo data provide a kind of efficient, can the storage of flexible expansion With more new solution.

2) unstructured data during the present invention utilizes HDFS storage whole world Argo data, by each data Block copies to, on the multiple nodes in cluster, improve the fault-tolerance of data, it is possible to dynamically add or remove Node, it is ensured that the extensibility of data, efficient access and quick renewal for whole world Argo data provide Basic guarantee.

Accompanying drawing explanation

Fig. 1 is the flow chart of global Argo data based on hybrid database framework storage and update method；

Fig. 2 is hybrid database Organization Chart.

Detailed description of the invention

Below in conjunction with the accompanying drawings the present invention it is further elaborated and illustrates.The skill of each embodiment in the present invention Art feature, on the premise of not colliding with each other, all can carry out respective combination.

As it is shown in figure 1, the storage of a kind of global Argo data based on hybrid database framework and update method, Its step is as follows:

1) data monitor monitors assigned catalogue on teledata main frame, once has new Argo data literary composition Part generates, then by data file forwarding to data server.Particularly as follows:

Data monitor is a module residing on data server, and it can be periodically turned on thread, For connecting each teledata main frame that whole world Argo data are relevant, and judge to specify mesh by daily record document Whether there is new Argo data genaration in record, once have new file generated, then by this data file forwarding to number According on server, and record in daily record document.

2) the Argo data file being summarised on data server is divided into by data sorter according to file format Whole world Argo buoy metadata, whole world Argo buoy observation cross-sectional data and whole world Argo gridded data produce Three classifications of product.Particularly as follows:

The Argo data file being summarised on data server is classified by data sorter, DAT tray Formula be the whole world Argo buoy metadata, NetCDF file format for the whole world Argo buoy observation cross-sectional data, PNG file format for the whole world Argo gridded data product, thus this three class file is divided into respective Data center.

3) whether recording controller exists current data in checking data base, and whether checks content data file Completely.Particularly as follows:

Recording controller inquires about whether there is this number according to file name from PostgreSQL relevant database According to file, and judging that this data file is the most complete according to file size, there is not this if meeting in data base Data file and data file are complete, then can be identified as new data file, can put in storage；Otherwise can not enter Storehouse.The most predeterminable reduced value, is used for judging that data file is the most complete.Reduced value can be according at present The standard of international Argo file is determined, and normal section file size is all 38KB.

4) data extractor extracts relevant metadata and data block from data file.Particularly as follows:

Data extractor is from whole world Argo buoy meta data file, whole world Argo buoy observation cross-sectional data literary composition Part and whole world Argo gridded data product extract the metadata of correspondence respectively, floats from whole world Argo simultaneously Mark observation cross-sectional data file extracts blocks of data and is converted to JSON formatted file.

5) non-structured data block is uploaded to HDFS distributed memory system by data loading module, will knot The metadata record of structure is in PostgreSQL relevant database, and sets up between data block and metadata Index.Particularly as follows:

5.2) data loading module is by non-structured JSON and PNG file data blocks (the most aforementioned " JSON Formatted file " and " PNG file format for the whole world Argo gridded data product ") upload to HDFS and divide Cloth storage system, and on multiple physical nodes, complete storage and redundancy backup, its data block access path Leave on the host node of cluster；

6) data that warehouse-in completes are filed by data filing module, in units of day, according to whole world Argo Three classifications of data form journal file respectively.

Being further elaborated the present invention with embodiment below, the operating procedure of embodiment is consistent with said method, The most for purpose of brevity, part steps does not illustrates.

Embodiment

1) data monitor (ftp: //ftp.argo.org.cn/pub/ARGO) on teledata main frame is monitored and is referred to Determine catalogue, once have new Argo Generating Data File, such as listen to new data file 1900726_285.nc, then by this data file forwarding to data server；

2) data sorter by Argo data file 1900726_285.nc that is summarised on data server by It is judged as whole world Argo buoy observation cross-sectional data according to file format, this file is transferred in the data of correspondence The heart；

3) recording controller checks whether PostgreSQL relevant database exists current data, if not existing, Then can be identified as new data file, and judge that this content data file is the most complete according to file size, if should File size is 38KB, it is determined that this document content intact；

4) data extractor extracts relevant metadata and data block from 1900726_285.nc data file, Content metadata includes that WMO is numbered 1900726, platform number is 39506, transmission system is ARGOS, Alignment system is ARGOS, manufacturer is Webb, cycle period is 285, latitude and longitude information is 25.761 ° of south latitude The relevant informations such as 115.159 ° of west longitude；Data block contents then specifically observes profile information for extract, and changes For JSON formatted file 1900726_285.json；

5) non-structured 1900726_285.json file is uploaded to by data loading module as data block HDFS distributed memory system, is saved in each back end, structurized metadata record is arrived Buoy observation cross-sectional data information table in PostgreSQL relevant database, accesses road by data block simultaneously The footpath metadata information corresponding with to it is unified to be stored in PostgreSQL relevant database, sets up with this Index between data block and metadata, the most as shown in Figure 2；

Embodiments described above is the one preferably scheme of the present invention, so itself and be not used to limit this Invention.About the those of ordinary skill of technical field, without departing from the spirit and scope of the present invention, Can also make a variety of changes and modification.The most all modes taking equivalent or equivalent transformation are obtained Technical scheme, all falls within protection scope of the present invention.

Claims

1. global Argo data based on hybrid database framework storage and a update method, It is characterized in that its step is as follows:

1) data monitor monitors assigned catalogue on teledata main frame, once has new Argo Generating Data File, then by data file forwarding to data server；

2) data sorter by the Argo data file that is summarised on data server according to file Form is divided into whole world Argo buoy metadata, whole world Argo buoy observation cross-sectional data and the whole world Three classifications of Argo gridded data product；

3) whether recording controller exists current data in checking data base, and checks data literary composition Part content is the most complete；

4) data extractor extracts relevant metadata and data block from data file；

5) non-structured data block is uploaded to HDFS distributed storage by data loading module System, by structurized metadata record to PostgreSQL relevant database, and builds Vertical index between data block and metadata；

A kind of global Argo based on hybrid database framework the most according to claim 1 Data storage and update method, it is characterised in that described step 1) be: data monitor is One resides in the module on data server, and it can be periodically turned on thread, is used for connecting Each teledata main frame that whole world Argo data are relevant, and judge to specify by daily record document Whether there is new Argo data genaration in catalogue, once have new file generated, then by these data File is forwarded on data server, and records in daily record document.

A kind of global Argo based on hybrid database framework the most according to claim 1 Data storage and update method, it is characterised in that described step 2) be: data sorter pair The Argo data file being summarised on data server is classified, and DAT file format is complete Ball Argo buoy metadata, NetCDF file format for the whole world Argo buoy observation section Data, PNG file format for the whole world Argo gridded data product, thus by this three class File is divided into respective data center.

A kind of global Argo based on hybrid database framework the most according to claim 1 Data storage and update method, it is characterised in that described step 3) be: recording controller root From PostgreSQL relevant database, inquire about whether there is this data file according to file name, And judge that this data file is the most complete according to file size, there is not this if meeting in data base Data file and data file are complete, then can be identified as new data file, can put in storage.

A kind of global Argo based on hybrid database framework the most according to claim 1 Data storage and update method, it is characterised in that described step 4) be: data extractor from Whole world Argo buoy meta data file, whole world Argo buoy observation cross-sectional data file and the whole world Argo gridded data product extracts the metadata of correspondence respectively, floats from whole world Argo simultaneously Mark observation cross-sectional data file extracts blocks of data and is converted to JSON formatted file.

A kind of global Argo based on hybrid database framework the most according to claim 1 Data storage and update method, it is characterised in that described step 5) be:

5.1) data loading module is by step 4) in the structurized metadata record that extracts In PostgreSQL relevant database, it is mainly stored in buoy metadata table, buoy sight Survey cross-sectional data information table and this three classes table of buoy gridded data product information table, wherein buoy Metadata table is used for storing the metadata information of all Argo buoys, the most each Argo buoy Technical parameter, including WMO numbering, platform number, transmission system, signal transmission repetitive rate, Alignment system, manufacturer, section sample direction, sensor information and cyclical information；Buoy Observation cross-sectional data information table is used for storing all observation section relevant informations, for improving inquiry effect Rate, this table by year divides multilist to store, including buoy ID, WMO numbering, section period, Profiling observation direction, date and longitude and latitude；Buoy gridded data product information table is used for depositing Store up all gridded data product related informations, including product category, product date, product model Enclose；

5.2) non-structured JSON and PNG file data blocks is uploaded by data loading module To HDFS distributed memory system, and it is standby with redundancy to complete storage on multiple physical nodes Part, its data block access path leaves on the host node of cluster；

5.3) data loading module is simultaneously by data block access path and the metadata corresponding to it Information unification is stored in PostgreSQL relevant database, sets up data block and unit with this Index between data, it is achieved whole world Argo data at HDFS distributed memory system and The mixing storage of PostgreSQL relevant database.

A kind of global Argo based on hybrid database framework the most according to claim 1 Data storage and update method, it is characterised in that described step 6) be: data filing module The data completed by warehouse-in are filed, in units of day, according to the three of whole world Argo data Individual classification forms journal file respectively.