CN113111116A

CN113111116A - Ocean water environment data integration method of ocean comprehensive database

Info

Publication number: CN113111116A
Application number: CN202110516191.7A
Authority: CN
Inventors: 杨锦坤; 宋晓; 韩璐遥; 刘玉龙; 苗庆生; 徐珊珊; 董明媚; 宁鹏飞
Original assignee: NATIONAL MARINE DATA AND INFORMATION SERVICE
Current assignee: NATIONAL MARINE DATA AND INFORMATION SERVICE
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-07-13
Anticipated expiration: 2041-05-12
Also published as: CN113111116B

Abstract

The invention provides a marine water body environment data integration method of a marine comprehensive database, which comprises the following steps: s1, loading the marine water environment data files for sorting, and intelligently removing the duplication of the sorted data files; s2, verifying the quality range of the data file after the duplication elimination and storing the verification result; and S3, analyzing the verified data file to obtain data, and storing the data in a database. The marine water body environment data integration method reduces the manual operation amount, reduces the operation error, improves the integration efficiency, and realizes the fine and platform integration capability of marine water body environment resources.

Description

Ocean water environment data integration method of ocean comprehensive database

Technical Field

The invention belongs to the technical field of data integration, and particularly relates to a marine water body environment data integration method of a marine comprehensive database.

Background

The method realizes efficient and accurate marine environment data integration processing, achieves efficient utilization of marine environment data, and is a key research direction for marine environment data integration.

At present, the problems of low data quality, incomplete dimensionality, data quantity loss, deviation in reliability and the like generally exist in various methods for collecting and processing marine environment information data. In addition, the traditional data integration mode is that the integration method and the rules are manually matched according to the original states of various types of data to be integrated to form a one-to-one or one-to-many data integration strategy, and the data integration flow rule configuration is carried out according to ETL and other data integration processing tools to form a data integration flow so as to achieve the data integration target. Therefore, people are often confronted with a heavy workload, and the rule setting is mistaken, so that the embarrassment situation of testing and tuning is often required.

Aiming at the technical targets of efficient processing and ordered integration of various complex marine water environment data, the invention starts from the problems of the existing processing scheme, explores a data integration method by combining with a special application scene of marine environment, converts an integration technical thought from a manual configuration integration rule to a device processor method, aims at reducing manual operation amount, reducing operation errors and improving integration efficiency, and provides a marine water environment data integration method of a marine comprehensive database, thereby realizing the refinement and platform integration capability of marine water environment resources.

Disclosure of Invention

In view of this, the invention aims to provide a method for integrating marine water environment data of a marine comprehensive database, so as to avoid manual processing errors and improve the integration efficiency of the water environment data.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the marine water body environment data integration method of the marine comprehensive database comprises the following steps:

s1, loading the marine water environment data files for sorting, and intelligently removing the duplication of the sorted data files;

s2, verifying the quality range of the data file after the duplication elimination and storing the verification result;

and S3, analyzing the verified data file to obtain data, and storing the data in a database.

Further, step S1 specifically includes:

s11, loading the analyzed marine water environment data file to form a standard format file;

s12, sorting the data in the standard format file according to different dimensions, and taking the sorting result as an intelligent duplicate removal basis;

and S13, classifying the sorted data files by adopting different clustering algorithms according to different data characteristics, marking the classified data, and intelligently removing the duplicate.

Further, step S2 specifically includes:

s21, carrying out data quality detection on each data file subjected to intelligent duplicate removal according to a detection method corresponding to data characteristic matching;

and S22, marking the data according to the quality detection result.

Further, in step S3, the data file verified in step S2 is subjected to format conversion according to the parsing configuration file and then stored in the database.

Compared with the prior art, the method has the following advantages:

the marine water body environment data integration method reduces the manual operation amount, reduces the operation error, improves the integration efficiency, and realizes the fine and platform integration capability of marine water body environment resources.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic deployment diagram of an integrated device platform according to an embodiment of the present invention;

FIG. 2 is a data processing flow diagram of a cluster computation module according to an embodiment of the present invention;

fig. 3 is a specific processing flow diagram of a quality calculation module according to an embodiment of the present invention;

fig. 4 is a flowchart of outputting to a water body integration library according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

step 1, loading an analyzed marine water environment data file to form a standard format file;

step 2, sorting the data in the standard format file according to different dimensions;

and 3, classifying the sorted data files by adopting different clustering algorithms according to different data characteristics, marking the classified data, and intelligently removing the duplication of the data according to the marks.

Step 4, performing data quality detection on each data file subjected to intelligent duplicate removal according to a detection method corresponding to data characteristic matching;

and 5, marking the data according to the quality detection result.

And 6, carrying out format conversion on the data file verified in the step 5 according to the analysis configuration file and storing the data file into a database.

The invention relates to a method for integrating marine water environment data of a marine comprehensive database, in particular to a method for automatically outputting multi-source marine water environment data into a marine water integration database by adopting an 'integration device platform'. Avoids manual processing errors and improves the integration efficiency of water body environment data. The method completes the intelligent data integration capability of data to the database by constructing an integration device, completes the processing and integration of complex and multi-source data and forms a water body database.

The integration device is a processing center of the invention, and is an intelligent processing part for realizing integration processing aiming at marine water environment data and forming a marine water comprehensive database. The integration device comprises a clustering calculation module, a quality calculation module and an output water body comprehensive library module. The ocean water environment data integration method of the ocean comprehensive database is formed through the platform construction of server hardware and software processing modules in a specific application scene.

As shown in fig. 1, the integration device platform is deployed in the data loading server, and performs platform integration processing on the loaded water-related data resources through the integration device to form a water comprehensive library, and finally falls to the database server.

1. Clustering calculation module

The cluster calculation module is a data integration function part of the integration device platform, automatically sorts the loaded marine water environment data from various sources according to multi-dimensional data such as time, longitude, latitude, depth and the like by a built-in duplicate removal technology, and intelligently removes duplicates according to a sorting result. Fig. 2 is a flow chart of data processing of the cluster calculating module.

1) The automatic sorting is a sorting method for automatically forming data of different sources according to historical data drilling in the clustering calculation module, so that the platform can form a judgment basis through communication with the loading server and provide basic judgment information for sorting and rearrangement of the clustering model.

Dividing water body data according to nine subjects of water temperature, salinity, water level, ocean current, wave, water color, transparency, sea luminescence and sea ice according to the data requirements of the ocean water body comprehensive library. A built-in ordering method is based on the characteristics of each subject.

2) And the clustering model performs clustering analysis on each kind of water body environment data according to the dimensionality such as time, longitude, latitude, depth and the like, marks the data judged to be repeated in the range, and performs identification and deletion of redundant data according to the sequencing to form a standard data set.

The water body environment data are matched by adopting different clustering algorithms according to different data characteristics so as to seek a more accurate weight-removing result, for example, a DBSCAN clustering mode is adopted for temperature and salt observation data, the radius is determined through the training of historical data, the data which is greater than or equal to the radius is marked, and other data are marked separately to be used as an intelligent weight-removing basis.

3) The intelligent duplication elimination is that multi-dimensional comparison of time, longitude, latitude, depth and the like is carried out on multi-source data according to the sorting result and the clustering analysis result, and an nc file is generated according to the sorting mark.

2. Mass calculation module

The quality calculation module controls professional quality through marine service characteristics, verifies the data in a quality range according to the service standard specification of the element data, updates the verification result into a relevant quality field of the data record, and forms high-quality output capability for the integrated water environment data.

And the quality calculation module detects each nc file subjected to clustering integration one by one according to a subject characteristic matching detection method, one by one file, one by one section and one by one method, modifies the corresponding quality symbol into 4(bad flag) when finding out bad data, and stores the bad data into the nc file. A specific processing flow diagram of the mass calculation module is shown in fig. 3. The method comprises the following steps:

step 51, acquiring a basic library data table to be processed;

step 52, obtaining the final data extraction deadline in the log table and the current system time;

step 53, acquiring the element field name and the corresponding quality character field name which need to be processed in the current data table from the quality character configuration table;

step 54, update the log table state: starting to calculate the quality indicator of the table;

step 55, judging whether the element fields needing to be processed exist, if so, jumping to S56, otherwise, jumping to step 63;

step 56, obtaining the effective value range of the element field needing to be processed from the element specification table;

step 57, generating a judgment update quality symbol field SQL statement;

step 58, generating a summary statistics temporary table SQL statement: summarizing statistical results under statistical logic main keys corresponding to all incremental data, and inserting the statistical results into a temporary table;

step 59, generating a deletion quality symbol statistical intermediate table SQL statement: deleting all logic main records existing in the temporary table in the intermediate table according to the statistical logic main key;

step 60, generating an SQL statement of the statistical intermediate table of the insertion quality indicator: summarizing data in the temporary table according to the statistical logic main key, and inserting the statistical intermediate table;

step 61, generating a SQL statement for deleting the temporary table;

step 62, executing the generated SQL statements in sequence;

step 63, updating the log table state: and finishing the execution.

3. Output to water body comprehensive library module

And directly outputting the nc file to a water body comprehensive library under the integration of cluster calculation and quality calculation model data.

And outputting the water body comprehensive library, converting the generated nc standard file content into corresponding JAVA objects by constructing an analysis configuration file and analyzing the configuration file by a loading program, and writing the analyzed content into the water body comprehensive library in batches.

The invention is explained by a specific example through an integration method of 'temperature and salinity data' of the marine water environment:

the thermohaline data mainly comprises observation information of time, longitude, latitude, depth, water temperature, salinity, density and sound velocity, and auxiliary information of observation instruments, navigation equipment, specialization items, tasks and the like, so that a complete and complex thermohaline data structure is constructed.

Step 1, loading thermohaline data files of various sources

And the clustering calculation module is used for butt-loading files such as standard csv and nc formed by analyzed marine water body temperature and salt related environment data from various sources.

Step 2, implementing multi-dimensional automatic sequencing for loaded files

And the aggregation calculation module automatically sorts the standard temperature salt csv, nc and other format files generated by analyzing all the sources through a built-in duplicate removal technology according to the sea water section dimensions such as time, longitude, latitude, depth and the like.

And determining the radius by combining a DBSCAN clustering method, marking the data which is greater than or equal to the radius, taking the marked data as the mark for priority adjustment, and forming a final sequencing result by combining sequencing to be used as a basis for intelligently re-deleting repeated items.

And 3, intelligently removing the duplicate of the sorted data files.

And (3) carrying out intelligent duplicate removal processing on the same profile data according to the sequencing result of the step (2), wherein the intelligent duplicate removal adopted by the embodiment is a duplicate removal function built in the clustering calculation module, a duplicate removal rule and a clustering model are built in, and the duplicate removal processing is automatically executed according to the rule.

The rearrangement rule is the main processing basis of the rearrangement operation and is accurately configured according to the requirements of various disciplines. Examples are as follows:

void on_comboBox_lat_lon_currentTextChanged(const QString&arg1)；

the precision rule of the longitude and latitude is, for example, 0.01, the longitude and latitude will be accurate to 2 bits after the decimal point, and when comparing data, the data of 2 bits after the decimal point are accurate to the same even if they are the same.

void on_comboBox_psal_currentTextChanged(const QString&arg1)；

And (5) accuracy regulation of salinity data.

def removeDulpicate(self,df,same_dup):

For a completely identical profile, a duplicate is directly deleted.

Step 4, detecting the data file after the duplication elimination, and marking the quality symbol

The quality calculation module is responsible for performing quality detection on each file after clustering integration one by one according to a subject characteristic matching detection method, one by one section and one by one method, if bad data is found, a corresponding quality symbol is modified to be 4(bad flag), and the bad data is stored in the nc file, wherein the nc file is taken as an example in the embodiment.

(1) Reading profile data from nc files

self.profiles＝argo.profile_from_nc(filePath,nc)

prof_count＝len(self.profiles)

(2) Reading quality control parameters from a configuration file

cfg_filePath＝"./argo.json"

file＝open(cfg_filePath,'r')

cfg＝json.load(file)

(2) Reading quality control parameters from a configuration file

pqc＝ProfileQC(prof,cfg＝cfg,csv＝greylist,metafile＝metafile)

(3) Quality detection by detection method

The quality calculation module integrates quality inspection test methods of various types of data of various disciplines, for example, the quality inspection is carried out on a water body temperature and salt file by using a "tune 53H _ norm detection method", and the method is shown as follows:

the tukey53H _ norm detection method takes advantage of the robustness of the median to create a smoother data sequence, which is then compared to the observed values. After removing the large scale variability, this difference was normalized by the standard deviation of the observed data sequence.

For a single measurement x_iWhere i is the observed position, which is evaluated as follows:

x⁽¹⁾is from x_i-2To x_i+2Median of five points of (a);

x⁽²⁾is from

To

In three pointsA value;

x⁽³⁾is defined by the ringing smoothing filter:

if it is not

X is then_iIs the peak value, where σ is the standard deviation of the low pass filtered data.

(4) And marking the quality symbols 1-4 according to the quality test result

The default behavior of the system is that if the value generated by the test is greater than k 1.5, the value is marked as 4; if the value is lower than k 1.5, it is marked 1.

Step 6, outputting temperature and salt files to write into a water comprehensive reservoir

And directly outputting the warm saline water nc file subjected to the integration processing of the clustering calculation and the mass calculation model data to a water comprehensive library through an analysis loading program.

And the analysis loader converts the generated nc standard file content into corresponding JAVA objects according to the constructed analysis configuration file, and writes the analyzed content into the water comprehensive library in batches.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. The method for integrating the marine water environment data of the marine comprehensive database is characterized by comprising the following steps of:

2. The method of claim 1, wherein: step S1 specifically includes:

3. The method of claim 1, wherein: step S2 specifically includes:

and S22, marking the data according to the quality detection result.

4. The method of claim 1, wherein: in step S3, the data file verified in step S2 is converted in format according to the parsing configuration file and stored in the database.