CN113111116B

CN113111116B - Ocean water environment data integration method of ocean comprehensive database

Info

Publication number: CN113111116B
Application number: CN202110516191.7A
Authority: CN
Inventors: 杨锦坤; 宋晓; 韩璐遥; 刘玉龙; 苗庆生; 徐珊珊; 董明媚; 宁鹏飞
Original assignee: NATIONAL MARINE DATA AND INFORMATION SERVICE
Current assignee: NATIONAL MARINE DATA AND INFORMATION SERVICE
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2022-10-18
Anticipated expiration: 2041-05-12
Also published as: CN113111116A

Abstract

The invention provides a marine water body environment data integration method of a marine comprehensive database, which comprises the following steps: s1, loading marine water environment data files for sorting, and intelligently removing duplication of the sorted data files; s2, verifying the quality range of the data file after the duplication elimination, and storing a verification result; and S3, analyzing the verified data file to obtain data, and storing the data into a database. The marine water body environment data integration method reduces the manual operation amount, reduces the operation error, improves the integration efficiency, and realizes the fine and platform integration capability of marine water body environment resources.

Description

Ocean water environment data integration method of ocean comprehensive database

Technical Field

The invention belongs to the technical field of data integration, and particularly relates to a marine water body environment data integration method of a marine comprehensive database.

Background

The method realizes efficient and accurate marine environment data integration processing, achieves efficient utilization of marine environment data, and is a key research direction for marine environment data integration.

The problems of low data quality, incomplete dimensionality, missing data quantity, deviation in reliability and the like commonly exist in the current collection and processing method of various marine environment information data. In the conventional data integration mode, an integration method and rules are manually matched according to the original states of various types of data to be integrated to form a one-to-one or one-to-many data integration strategy, and data integration flow rule configuration is performed according to data integration processing tools such as ETL (extract transform load) and the like to form a data integration flow so as to achieve the aim of data integration. Therefore, people are often confronted with a heavy workload, and the rule setting is mistaken, so that the embarrassment situation of testing and tuning is often required.

Aiming at the technical targets of efficient processing and orderly integration of various complex marine water environment data, the invention starts from the problems of the existing processing scheme, explores a data integration method by combining with a special application scene of marine environment, converts an integration technical thought from a manual configuration integration rule to a device processor method, aims at reducing the manual workload, reducing the operation error and improving the integration efficiency, and provides the marine water environment data integration method of the marine comprehensive database, thereby realizing the fine and platform integration capability of marine water environment resources.

Disclosure of Invention

In view of this, the invention aims to provide a method for integrating marine water environment data of a marine comprehensive database, so as to avoid manual processing errors and improve the integration efficiency of the water environment data.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the marine water body environment data integration method of the marine comprehensive database comprises the following steps:

s1, loading marine water environment data files for sorting, and intelligently removing duplication of the sorted data files;

s2, verifying the quality range of the data file after the duplication elimination, and storing a verification result;

and S3, analyzing the verified data file to obtain data, and storing the data into a database.

Further, step S1 specifically includes:

s11, loading the analyzed marine water environment data file to form a standard format file;

s12, sorting the data in the standard format file according to different dimensions, wherein the sorting result is used as an intelligent duplicate removal basis;

and S13, classifying the sorted data files by adopting different clustering algorithms according to different data characteristics, marking the classified data, and intelligently removing the duplicate.

Further, step S2 specifically includes:

s21, carrying out data quality detection on each data file subjected to intelligent duplicate removal according to a detection method corresponding to data characteristic matching;

and S22, marking the data according to the quality detection result.

Further, in step S3, the data file verified in step S2 is subjected to format conversion according to the parsing configuration file and then stored in the database.

Compared with the prior art, the method has the following advantages:

the method for integrating the marine water environment data reduces the manual workload, reduces the operation error, improves the integration efficiency, and realizes the fine and platform integration capability of the marine water environment resources.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic deployment diagram of an integrated device platform according to an embodiment of the present invention;

FIG. 2 is a data processing flow diagram of a cluster computation module according to an embodiment of the present invention;

fig. 3 is a specific processing flow diagram of the mass calculation module according to the embodiment of the present invention;

fig. 4 is a flowchart of outputting to a water body integration library according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

step 1, loading an analyzed marine water environment data file to form a standard format file;

step 2, sorting the data in the standard format file according to different dimensions;

and 3, classifying the sorted data files by adopting different clustering algorithms according to different data characteristics, marking the classified data, and intelligently removing the duplicate of the data according to the marks.

Step 4, performing data quality detection on each data file subjected to intelligent duplicate removal according to a detection method corresponding to data characteristic matching;

and 5, performing quality marking on the data according to the quality detection result.

And 6, carrying out format conversion on the data file verified in the step 5 according to the analysis configuration file, and storing the data file into a database.

The invention relates to a method for integrating marine water environment data of a marine comprehensive database, in particular to a method for automatically outputting multi-source marine water environment data into a marine water integration database by adopting an 'integration device platform'. Avoids manual processing errors and improves the integration efficiency of water body environment data. The method completes the intelligent data integration capability of data to the database by constructing an integration device, completes the processing and integration of complex and multi-source data and forms a water body database.

The integration device is a processing center of the invention, and is an intelligent processing part for realizing integration processing aiming at marine water environment data and forming a marine water comprehensive database. The integration device comprises a clustering calculation module, a quality calculation module and an output water body comprehensive library module. The ocean water environment data integration method of the ocean comprehensive database is formed through platform construction of server hardware and software processing modules in a specific application scene.

As shown in fig. 1, the integration device platform is deployed in a data loading server, and platform-based integration processing is performed on the loaded water-related data resources through the integration device to form a water comprehensive library, and the water comprehensive library finally lands on a database server.

1. Clustering calculation module

The cluster calculation module is a data integration function part of the integration device platform, automatically sorts the loaded marine water environment data from various sources according to multi-dimensional data such as time, longitude, latitude, depth and the like by a built-in duplicate removal technology, and intelligently removes duplicates according to a sorting result. Fig. 2 is a flow chart of data processing of the cluster calculating module.

1) The automatic sorting is a sorting method for automatically forming data of different sources according to historical data drilling in the clustering calculation module, so that the platform can form a judgment basis through communication with the loading server and provide basic judgment information for sorting and rearrangement of the clustering model.

Dividing water body data according to nine major disciplines of water temperature, salinity, water level, ocean current, wave, water color, transparency, sea luminescence and sea ice according to the data requirement of the ocean water body comprehensive library. A built-in ordering method is based on the characteristics of each subject.

2) And the clustering model performs clustering analysis on each kind of water body environment data according to the dimensionality such as time, longitude, latitude, depth and the like, marks the data judged to be repeated in the range, and performs identification and deletion of redundant data according to the sequencing to form a standard data set.

According to different data characteristics of the water environment data, different clustering algorithms are adopted for matching so as to seek a more accurate weight removing result, for example, a DBSCAN clustering mode is adopted for temperature and salinity observation data, the radius is determined through the training of historical data, the data larger than or equal to the temperature and salinity observation data are marked, and other data are marked separately to serve as an intelligent weight removing basis.

3) And the intelligent duplication elimination is to compare the time, longitude, latitude, depth and other dimensions of the multi-source data according to the sequencing result and the clustering analysis result, and generate an nc file according to the sequencing mark.

2. Mass calculation module

The quality calculation module carries out professional quality control through ocean service characteristics, verifies the data in a quality range according to the service standard specification of the element data, updates the verification result into a relevant quality field of the data record, and forms high-quality output capacity for the integrated water environment data.

And the quality calculation module detects each nc file subjected to clustering integration one by one according to a subject characteristic matching detection method, one by one file, one by one section and one by one method, modifies the corresponding quality symbol into 4 (bad flag) when finding out bad data, and stores the bad data into the nc file. A specific processing flow diagram of the mass calculation module is shown in fig. 3. The method comprises the following steps:

step 51, acquiring a basic library data table to be processed;

step 52, obtaining the final data extraction deadline in the log table and the current system time;

step 53, acquiring the element field name and the corresponding quality character field name which need to be processed in the current data table from the quality character configuration table;

step 54, updating the log table state: starting to calculate the quality indicator of the table;

step 55, judging whether element fields needing to be processed exist, if so, skipping to step 56, and if not, skipping to step 63;

step 56, obtaining the effective value range of the element field needing to be processed from the element specification table;

step 57, generating a judgment update quality indicator field SQL statement;

step 58, generating a summary statistics temporary table SQL statement: summarizing the statistical results under the statistical logic main keys corresponding to all the incremental data, and inserting the statistical results into a temporary table;

step 59, generating a deletion quality symbol statistical intermediate table SQL statement: deleting all logic main records existing in the temporary table in the intermediate table according to the statistical logic main key;

step 60, generating an SQL statement of the statistical intermediate table of the insertion quality indicator: summarizing data in the temporary table according to the statistical logic primary key, and inserting the data into a statistical intermediate table;

step 61, generating a SQL statement for deleting the temporary table;

step 62, executing the generated SQL statements in sequence;

step 63, updating the log table state: and finishing the execution.

3. Output to water body comprehensive library module

And directly outputting the nc file to a water body comprehensive library under the integration of cluster calculation and quality calculation model data.

And outputting the water body comprehensive library, converting the generated nc standard file content into corresponding JAVA objects by constructing an analysis configuration file and analyzing the configuration file by a loading program, and writing the analyzed content into the water body comprehensive library in batches.

The invention is explained by a specific example of an integration method of 'temperature and salinity data' of the marine water environment:

the thermohaline data mainly comprises observation information of time, longitude, latitude, depth, water temperature, salinity, density and sound velocity, and auxiliary information of observation instruments, navigation equipment, specialization items, tasks and the like, so that a complete and complex thermohaline data structure is constructed.

Step 1, loading thermohaline data files of various sources

And the clustering calculation module is used for butt-loading files such as standard csv and nc formed by analyzed marine water body temperature and salt related environment data from various sources.

Step 2, implementing multi-dimensional automatic sequencing for loaded files

And the aggregation calculation module automatically sorts the standard thermohaline csv, nc and other format files generated by analyzing each source through a built-in repetition removing technology according to the section dimensions of the ocean water body such as time, longitude, latitude, depth and the like.

And determining the radius by combining a DBSCAN clustering method, marking the data which is greater than or equal to the radius, taking the marked data as the mark for priority adjustment, and forming a final sequencing result by combining sequencing to be used as a basis for intelligently re-deleting repeated items.

And 3, intelligently removing the duplicate of the sorted data files.

And 3, intelligently removing the duplicate of the same section data according to the sequencing result in the step 2, wherein the intelligent duplicate removal adopted by the embodiment is a duplicate removal function built in the clustering calculation module, a duplicate removal rule and a clustering model are built in, and the duplicate removal processing is automatically executed according to the rule.

The rearrangement rule is the main processing basis of the rearrangement operation and is accurately configured according to the requirements of various disciplines. Examples are as follows:

void on_comboBox_lat_lon_currentTextChanged(const QString&arg1)；

the precision rule of the longitude and latitude is, for example, 0.01, the longitude and latitude will be accurate to 2 bits after the decimal point, and when comparing data, the data of 2 bits after the decimal point are accurate to the same even if they are the same.

void on_comboBox_psal_currentTextChanged(const QString&arg1)；

And (5) accuracy regulation of salinity data.

def removeDulpicate(self,df,same_dup):

For a completely identical profile, a duplicate is directly deleted.

Step 4, detecting the data file after the duplication elimination, and marking the quality symbol

The quality calculation module is responsible for matching the detection method for each file after clustering integration one by one according to subject characteristics, performing quality detection one by one, one by one section, one by one, and if bad data is found, modifying a corresponding quality indicator into 4 (bad flag) and storing the quality indicator into the nc file, wherein the nc file is taken as an example in the embodiment.

(1) Reading profile data from nc files

self.profiles＝argo.profile_from_nc(filePath,nc)

prof_count＝len(self.profiles)

(2) Reading quality control parameters from a configuration file

cfg_filePath＝"./argo.json"

file＝open(cfg_filePath,'r')

cfg＝json.load(file)

(2) Reading quality control parameters from a configuration file

pqc＝ProfileQC(prof,cfg＝cfg,csv＝greylist,metafile＝metafile)

(3) Quality detection by detection method

The quality calculation module integrates quality inspection test methods of various types of data of various disciplines, for example, a tukey53H _ norm detection method is used for carrying out quality inspection on a water body temperature and salt file, and the method is shown as follows:

the tukey53H norm detection method exploits the robustness of the median to create a smoother data sequence, which is then compared to the observed values. After removing the large scale variability, this difference was normalized by the standard deviation of the observed data sequence.

For a single measurement x _i Where i is the observed position, which is evaluated as follows:

x ⁽¹⁾ is from x _i-2 To x _i+2 Median of five points of (a);

x ⁽²⁾ is selected from

To

Median of three points;

x ⁽³⁾ is defined by the ringing smoothing filter:

if it is not

X is then _i Is the peak value where σ is the standard deviation of the low pass filtered data.

(4) And marking the quality symbols 1-4 according to the quality test result

The default behavior of the present system is to label 4 if the test generated value is greater than k = 1.5; if the value is below k =1.5, the flag is 1.

Step 6, outputting temperature and salt files to write in the water body comprehensive library

And directly outputting the warm saline water nc file subjected to the integration processing of the clustering calculation and the mass calculation model data to a water comprehensive library through an analysis loading program.

And the analysis loader converts the generated nc standard file content into corresponding JAVA objects according to the constructed analysis configuration file, and writes the analyzed content into the water comprehensive library in batches.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. The marine water environment data integration method of the marine comprehensive database is characterized by comprising the following steps of:

s3, analyzing the verified data file to obtain data, and storing the data into a database;

the step S2 includes the steps of:

s22, performing quality marking on the data according to the quality detection result;

s21, comprising the following steps:

step 51, acquiring a basic library data table to be processed;

step 54, update the log table state: starting to calculate the quality indicator of the table;

step 56, obtaining the effective value range of the element field needing to be processed from the element specification table; step 57, generating a judgment update quality indicator field SQL statement;

step 58, generating a summary statistics temporary table SQL statement: summarizing statistical results under statistical logic main keys corresponding to all incremental data, and inserting the statistical results into a temporary table;

step 60, generating an SQL statement of the statistical intermediate table of the insertion quality indicator: summarizing data in the temporary table according to the statistical logic main key, and inserting the statistical intermediate table;

step 61, generating a SQL statement for deleting the temporary table;

step 62, executing the generated SQL statements in sequence;

step 63, updating the log table state: and the execution is finished.

2. The method of claim 1, wherein: the step S1 specifically includes: s11, loading the analyzed marine water environment data file to form a standard format file;

s12, sorting the data in the standard format file according to different dimensions, and taking the sorting result as an intelligent duplicate removal basis;

3. The method of claim 1, wherein: in step S3, the data file verified in step 2 is stored in a database after being subjected to format conversion according to the analysis configuration file.