CN113111116B - Ocean water environment data integration method of ocean comprehensive database - Google Patents

Ocean water environment data integration method of ocean comprehensive database Download PDF

Info

Publication number
CN113111116B
CN113111116B CN202110516191.7A CN202110516191A CN113111116B CN 113111116 B CN113111116 B CN 113111116B CN 202110516191 A CN202110516191 A CN 202110516191A CN 113111116 B CN113111116 B CN 113111116B
Authority
CN
China
Prior art keywords
data
quality
file
statistical
integration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110516191.7A
Other languages
Chinese (zh)
Other versions
CN113111116A (en
Inventor
杨锦坤
宋晓
韩璐遥
刘玉龙
苗庆生
徐珊珊
董明媚
宁鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NATIONAL MARINE DATA AND INFORMATION SERVICE
Original Assignee
NATIONAL MARINE DATA AND INFORMATION SERVICE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NATIONAL MARINE DATA AND INFORMATION SERVICE filed Critical NATIONAL MARINE DATA AND INFORMATION SERVICE
Priority to CN202110516191.7A priority Critical patent/CN113111116B/en
Publication of CN113111116A publication Critical patent/CN113111116A/en
Application granted granted Critical
Publication of CN113111116B publication Critical patent/CN113111116B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Abstract

The invention provides a marine water body environment data integration method of a marine comprehensive database, which comprises the following steps: s1, loading marine water environment data files for sorting, and intelligently removing duplication of the sorted data files; s2, verifying the quality range of the data file after the duplication elimination, and storing a verification result; and S3, analyzing the verified data file to obtain data, and storing the data into a database. The marine water body environment data integration method reduces the manual operation amount, reduces the operation error, improves the integration efficiency, and realizes the fine and platform integration capability of marine water body environment resources.

Description

Ocean water environment data integration method of ocean comprehensive database
Technical Field
The invention belongs to the technical field of data integration, and particularly relates to a marine water body environment data integration method of a marine comprehensive database.
Background
The method realizes efficient and accurate marine environment data integration processing, achieves efficient utilization of marine environment data, and is a key research direction for marine environment data integration.
The problems of low data quality, incomplete dimensionality, missing data quantity, deviation in reliability and the like commonly exist in the current collection and processing method of various marine environment information data. In the conventional data integration mode, an integration method and rules are manually matched according to the original states of various types of data to be integrated to form a one-to-one or one-to-many data integration strategy, and data integration flow rule configuration is performed according to data integration processing tools such as ETL (extract transform load) and the like to form a data integration flow so as to achieve the aim of data integration. Therefore, people are often confronted with a heavy workload, and the rule setting is mistaken, so that the embarrassment situation of testing and tuning is often required.
Aiming at the technical targets of efficient processing and orderly integration of various complex marine water environment data, the invention starts from the problems of the existing processing scheme, explores a data integration method by combining with a special application scene of marine environment, converts an integration technical thought from a manual configuration integration rule to a device processor method, aims at reducing the manual workload, reducing the operation error and improving the integration efficiency, and provides the marine water environment data integration method of the marine comprehensive database, thereby realizing the fine and platform integration capability of marine water environment resources.
Disclosure of Invention
In view of this, the invention aims to provide a method for integrating marine water environment data of a marine comprehensive database, so as to avoid manual processing errors and improve the integration efficiency of the water environment data.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the marine water body environment data integration method of the marine comprehensive database comprises the following steps:
s1, loading marine water environment data files for sorting, and intelligently removing duplication of the sorted data files;
s2, verifying the quality range of the data file after the duplication elimination, and storing a verification result;
and S3, analyzing the verified data file to obtain data, and storing the data into a database.
Further, step S1 specifically includes:
s11, loading the analyzed marine water environment data file to form a standard format file;
s12, sorting the data in the standard format file according to different dimensions, wherein the sorting result is used as an intelligent duplicate removal basis;
and S13, classifying the sorted data files by adopting different clustering algorithms according to different data characteristics, marking the classified data, and intelligently removing the duplicate.
Further, step S2 specifically includes:
s21, carrying out data quality detection on each data file subjected to intelligent duplicate removal according to a detection method corresponding to data characteristic matching;
and S22, marking the data according to the quality detection result.
Further, in step S3, the data file verified in step S2 is subjected to format conversion according to the parsing configuration file and then stored in the database.
Compared with the prior art, the method has the following advantages:
the method for integrating the marine water environment data reduces the manual workload, reduces the operation error, improves the integration efficiency, and realizes the fine and platform integration capability of the marine water environment resources.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic deployment diagram of an integrated device platform according to an embodiment of the present invention;
FIG. 2 is a data processing flow diagram of a cluster computation module according to an embodiment of the present invention;
fig. 3 is a specific processing flow diagram of the mass calculation module according to the embodiment of the present invention;
fig. 4 is a flowchart of outputting to a water body integration library according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
The marine water body environment data integration method of the marine comprehensive database comprises the following steps:
step 1, loading an analyzed marine water environment data file to form a standard format file;
step 2, sorting the data in the standard format file according to different dimensions;
and 3, classifying the sorted data files by adopting different clustering algorithms according to different data characteristics, marking the classified data, and intelligently removing the duplicate of the data according to the marks.
Step 4, performing data quality detection on each data file subjected to intelligent duplicate removal according to a detection method corresponding to data characteristic matching;
and 5, performing quality marking on the data according to the quality detection result.
And 6, carrying out format conversion on the data file verified in the step 5 according to the analysis configuration file, and storing the data file into a database.
The invention relates to a method for integrating marine water environment data of a marine comprehensive database, in particular to a method for automatically outputting multi-source marine water environment data into a marine water integration database by adopting an 'integration device platform'. Avoids manual processing errors and improves the integration efficiency of water body environment data. The method completes the intelligent data integration capability of data to the database by constructing an integration device, completes the processing and integration of complex and multi-source data and forms a water body database.
The integration device is a processing center of the invention, and is an intelligent processing part for realizing integration processing aiming at marine water environment data and forming a marine water comprehensive database. The integration device comprises a clustering calculation module, a quality calculation module and an output water body comprehensive library module. The ocean water environment data integration method of the ocean comprehensive database is formed through platform construction of server hardware and software processing modules in a specific application scene.
As shown in fig. 1, the integration device platform is deployed in a data loading server, and platform-based integration processing is performed on the loaded water-related data resources through the integration device to form a water comprehensive library, and the water comprehensive library finally lands on a database server.
1. Clustering calculation module
The cluster calculation module is a data integration function part of the integration device platform, automatically sorts the loaded marine water environment data from various sources according to multi-dimensional data such as time, longitude, latitude, depth and the like by a built-in duplicate removal technology, and intelligently removes duplicates according to a sorting result. Fig. 2 is a flow chart of data processing of the cluster calculating module.
1) The automatic sorting is a sorting method for automatically forming data of different sources according to historical data drilling in the clustering calculation module, so that the platform can form a judgment basis through communication with the loading server and provide basic judgment information for sorting and rearrangement of the clustering model.
Dividing water body data according to nine major disciplines of water temperature, salinity, water level, ocean current, wave, water color, transparency, sea luminescence and sea ice according to the data requirement of the ocean water body comprehensive library. A built-in ordering method is based on the characteristics of each subject.
2) And the clustering model performs clustering analysis on each kind of water body environment data according to the dimensionality such as time, longitude, latitude, depth and the like, marks the data judged to be repeated in the range, and performs identification and deletion of redundant data according to the sequencing to form a standard data set.
According to different data characteristics of the water environment data, different clustering algorithms are adopted for matching so as to seek a more accurate weight removing result, for example, a DBSCAN clustering mode is adopted for temperature and salinity observation data, the radius is determined through the training of historical data, the data larger than or equal to the temperature and salinity observation data are marked, and other data are marked separately to serve as an intelligent weight removing basis.
3) And the intelligent duplication elimination is to compare the time, longitude, latitude, depth and other dimensions of the multi-source data according to the sequencing result and the clustering analysis result, and generate an nc file according to the sequencing mark.
2. Mass calculation module
The quality calculation module carries out professional quality control through ocean service characteristics, verifies the data in a quality range according to the service standard specification of the element data, updates the verification result into a relevant quality field of the data record, and forms high-quality output capacity for the integrated water environment data.
And the quality calculation module detects each nc file subjected to clustering integration one by one according to a subject characteristic matching detection method, one by one file, one by one section and one by one method, modifies the corresponding quality symbol into 4 (bad flag) when finding out bad data, and stores the bad data into the nc file. A specific processing flow diagram of the mass calculation module is shown in fig. 3. The method comprises the following steps:
step 51, acquiring a basic library data table to be processed;
step 52, obtaining the final data extraction deadline in the log table and the current system time;
step 53, acquiring the element field name and the corresponding quality character field name which need to be processed in the current data table from the quality character configuration table;
step 54, updating the log table state: starting to calculate the quality indicator of the table;
step 55, judging whether element fields needing to be processed exist, if so, skipping to step 56, and if not, skipping to step 63;
step 56, obtaining the effective value range of the element field needing to be processed from the element specification table;
step 57, generating a judgment update quality indicator field SQL statement;
step 58, generating a summary statistics temporary table SQL statement: summarizing the statistical results under the statistical logic main keys corresponding to all the incremental data, and inserting the statistical results into a temporary table;
step 59, generating a deletion quality symbol statistical intermediate table SQL statement: deleting all logic main records existing in the temporary table in the intermediate table according to the statistical logic main key;
step 60, generating an SQL statement of the statistical intermediate table of the insertion quality indicator: summarizing data in the temporary table according to the statistical logic primary key, and inserting the data into a statistical intermediate table;
step 61, generating a SQL statement for deleting the temporary table;
step 62, executing the generated SQL statements in sequence;
step 63, updating the log table state: and finishing the execution.
3. Output to water body comprehensive library module
And directly outputting the nc file to a water body comprehensive library under the integration of cluster calculation and quality calculation model data.
And outputting the water body comprehensive library, converting the generated nc standard file content into corresponding JAVA objects by constructing an analysis configuration file and analyzing the configuration file by a loading program, and writing the analyzed content into the water body comprehensive library in batches.
The invention is explained by a specific example of an integration method of 'temperature and salinity data' of the marine water environment:
the thermohaline data mainly comprises observation information of time, longitude, latitude, depth, water temperature, salinity, density and sound velocity, and auxiliary information of observation instruments, navigation equipment, specialization items, tasks and the like, so that a complete and complex thermohaline data structure is constructed.
Step 1, loading thermohaline data files of various sources
And the clustering calculation module is used for butt-loading files such as standard csv and nc formed by analyzed marine water body temperature and salt related environment data from various sources.
Step 2, implementing multi-dimensional automatic sequencing for loaded files
And the aggregation calculation module automatically sorts the standard thermohaline csv, nc and other format files generated by analyzing each source through a built-in repetition removing technology according to the section dimensions of the ocean water body such as time, longitude, latitude, depth and the like.
And determining the radius by combining a DBSCAN clustering method, marking the data which is greater than or equal to the radius, taking the marked data as the mark for priority adjustment, and forming a final sequencing result by combining sequencing to be used as a basis for intelligently re-deleting repeated items.
And 3, intelligently removing the duplicate of the sorted data files.
And 3, intelligently removing the duplicate of the same section data according to the sequencing result in the step 2, wherein the intelligent duplicate removal adopted by the embodiment is a duplicate removal function built in the clustering calculation module, a duplicate removal rule and a clustering model are built in, and the duplicate removal processing is automatically executed according to the rule.
The rearrangement rule is the main processing basis of the rearrangement operation and is accurately configured according to the requirements of various disciplines. Examples are as follows:
void on_comboBox_lat_lon_currentTextChanged(const QString&arg1);
the precision rule of the longitude and latitude is, for example, 0.01, the longitude and latitude will be accurate to 2 bits after the decimal point, and when comparing data, the data of 2 bits after the decimal point are accurate to the same even if they are the same.
void on_comboBox_psal_currentTextChanged(const QString&arg1);
And (5) accuracy regulation of salinity data.
def removeDulpicate(self,df,same_dup):
For a completely identical profile, a duplicate is directly deleted.
Step 4, detecting the data file after the duplication elimination, and marking the quality symbol
The quality calculation module is responsible for matching the detection method for each file after clustering integration one by one according to subject characteristics, performing quality detection one by one, one by one section, one by one, and if bad data is found, modifying a corresponding quality indicator into 4 (bad flag) and storing the quality indicator into the nc file, wherein the nc file is taken as an example in the embodiment.
(1) Reading profile data from nc files
self.profiles=argo.profile_from_nc(filePath,nc)
prof_count=len(self.profiles)
(2) Reading quality control parameters from a configuration file
cfg_filePath="./argo.json"
file=open(cfg_filePath,'r')
cfg=json.load(file)
(2) Reading quality control parameters from a configuration file
pqc=ProfileQC(prof,cfg=cfg,csv=greylist,metafile=metafile)
(3) Quality detection by detection method
The quality calculation module integrates quality inspection test methods of various types of data of various disciplines, for example, a tukey53H _ norm detection method is used for carrying out quality inspection on a water body temperature and salt file, and the method is shown as follows:
the tukey53H norm detection method exploits the robustness of the median to create a smoother data sequence, which is then compared to the observed values. After removing the large scale variability, this difference was normalized by the standard deviation of the observed data sequence.
For a single measurement x i Where i is the observed position, which is evaluated as follows:
x (1) is from x i-2 To x i+2 Median of five points of (a);
x (2) is selected from
Figure BDA0003062274190000091
To
Figure BDA0003062274190000092
Median of three points;
x (3) is defined by the ringing smoothing filter:
Figure BDA0003062274190000093
if it is not
Figure BDA0003062274190000094
X is then i Is the peak value where σ is the standard deviation of the low pass filtered data.
(4) And marking the quality symbols 1-4 according to the quality test result
The default behavior of the present system is to label 4 if the test generated value is greater than k = 1.5; if the value is below k =1.5, the flag is 1.
Step 6, outputting temperature and salt files to write in the water body comprehensive library
And directly outputting the warm saline water nc file subjected to the integration processing of the clustering calculation and the mass calculation model data to a water comprehensive library through an analysis loading program.
And the analysis loader converts the generated nc standard file content into corresponding JAVA objects according to the constructed analysis configuration file, and writes the analyzed content into the water comprehensive library in batches.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (3)

1. The marine water environment data integration method of the marine comprehensive database is characterized by comprising the following steps of:
s1, loading marine water environment data files for sorting, and intelligently removing duplication of the sorted data files;
s2, verifying the quality range of the data file after the duplication elimination, and storing a verification result;
s3, analyzing the verified data file to obtain data, and storing the data into a database;
the step S2 includes the steps of:
s21, carrying out data quality detection on each data file subjected to intelligent duplicate removal according to a detection method corresponding to data characteristic matching;
s22, performing quality marking on the data according to the quality detection result;
s21, comprising the following steps:
step 51, acquiring a basic library data table to be processed;
step 52, obtaining the final data extraction deadline in the log table and the current system time;
step 53, acquiring the element field name and the corresponding quality character field name which need to be processed in the current data table from the quality character configuration table;
step 54, update the log table state: starting to calculate the quality indicator of the table;
step 55, judging whether element fields needing to be processed exist, if so, skipping to step 56, and if not, skipping to step 63;
step 56, obtaining the effective value range of the element field needing to be processed from the element specification table; step 57, generating a judgment update quality indicator field SQL statement;
step 58, generating a summary statistics temporary table SQL statement: summarizing statistical results under statistical logic main keys corresponding to all incremental data, and inserting the statistical results into a temporary table;
step 59, generating a deletion quality symbol statistical intermediate table SQL statement: deleting all logic main records existing in the temporary table in the intermediate table according to the statistical logic main key;
step 60, generating an SQL statement of the statistical intermediate table of the insertion quality indicator: summarizing data in the temporary table according to the statistical logic main key, and inserting the statistical intermediate table;
step 61, generating a SQL statement for deleting the temporary table;
step 62, executing the generated SQL statements in sequence;
step 63, updating the log table state: and the execution is finished.
2. The method of claim 1, wherein: the step S1 specifically includes: s11, loading the analyzed marine water environment data file to form a standard format file;
s12, sorting the data in the standard format file according to different dimensions, and taking the sorting result as an intelligent duplicate removal basis;
and S13, classifying the sorted data files by adopting different clustering algorithms according to different data characteristics, marking the classified data, and intelligently removing the duplicate.
3. The method of claim 1, wherein: in step S3, the data file verified in step 2 is stored in a database after being subjected to format conversion according to the analysis configuration file.
CN202110516191.7A 2021-05-12 2021-05-12 Ocean water environment data integration method of ocean comprehensive database Active CN113111116B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110516191.7A CN113111116B (en) 2021-05-12 2021-05-12 Ocean water environment data integration method of ocean comprehensive database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110516191.7A CN113111116B (en) 2021-05-12 2021-05-12 Ocean water environment data integration method of ocean comprehensive database

Publications (2)

Publication Number Publication Date
CN113111116A CN113111116A (en) 2021-07-13
CN113111116B true CN113111116B (en) 2022-10-18

Family

ID=76722395

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110516191.7A Active CN113111116B (en) 2021-05-12 2021-05-12 Ocean water environment data integration method of ocean comprehensive database

Country Status (1)

Country Link
CN (1) CN113111116B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110716897A (en) * 2019-10-15 2020-01-21 北部湾大学 Cloud computing-based marine archive database parallelization construction method and device
CN110991940A (en) * 2019-12-24 2020-04-10 国家卫星海洋应用中心 Ocean observation data product quality online inspection method and device and server

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197876B (en) * 2006-12-06 2012-02-29 中兴通讯股份有限公司 Method and system for multi-dimensional analysis of message service data
CN103678423A (en) * 2012-09-26 2014-03-26 深圳市世纪光速信息技术有限公司 Data file input system, device and method
CN104199907B (en) * 2014-08-28 2017-08-25 广州华多网络科技有限公司 Insert the method and device of data
CN110941593B (en) * 2019-12-03 2022-07-26 浪潮卓数大数据产业发展有限公司 File warehousing system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110716897A (en) * 2019-10-15 2020-01-21 北部湾大学 Cloud computing-based marine archive database parallelization construction method and device
CN110991940A (en) * 2019-12-24 2020-04-10 国家卫星海洋应用中心 Ocean observation data product quality online inspection method and device and server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于数字海洋框架的海洋资料整合与共享服务管理模式浅析———以海洋公益性行业科研专项经费项目为例";耿姗姗;《海洋开发与管理》;20150215;第6-8页 *
"非信息化海洋环境历史资料抢救流程设计与关键技术研究";杨锦坤;《海洋信息》;20150815;第33-36页 *

Also Published As

Publication number Publication date
CN113111116A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
KR102178295B1 (en) Decision model construction method and device, computer device and storage medium
US20070233436A1 (en) Structural analysis apparatus, structural analysis method, and structural analysis program
CN113408634B (en) Model recommendation method and device, equipment and computer storage medium
CN102693266A (en) Method of searching a data base, navigation device and method of generating an index structure
CN108959395B (en) Multi-source heterogeneous big data oriented hierarchical reduction combined cleaning method
CN109947881B (en) POI weight judging method and device, mobile terminal and computer readable storage medium
CN108960269B (en) Feature acquisition method and device for data set and computing equipment
EP3608801A1 (en) Method of rapidly searching element information in a bim model
CN114281809B (en) Multi-source heterogeneous data cleaning method and device
CN105989001A (en) Image searching method and device, and image searching system
CN110991065B (en) Automatic identification method for design change in building information model
CN111338692A (en) Vulnerability classification method and device based on vulnerability codes and electronic equipment
CN111709775A (en) House property price evaluation method and device, electronic equipment and storage medium
CN104462143A (en) Method and device for establishing chain brand word bank and category word bank
CN113111116B (en) Ocean water environment data integration method of ocean comprehensive database
US20050131873A1 (en) System and method for adaptive pruning
CN111105041B (en) Machine learning method and device for intelligent data collision
CN108734393A (en) Matching process, user equipment, storage medium and the device of information of real estate
CN114049016A (en) Index similarity judgment method, system, terminal device and computer storage medium
CN112148819A (en) Address recognition method and device combining RPA and AI
CN106959960B (en) Data acquisition method and device
CN111831685A (en) Query statement processing method, model training method, device and equipment
CN113313344B (en) Label system construction method and system fusing multiple modes
CN114021716A (en) Model training method and system and electronic equipment
JP2014206382A (en) Target type identification device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant