CN113111116A - Ocean water environment data integration method of ocean comprehensive database - Google Patents

Ocean water environment data integration method of ocean comprehensive database Download PDF

Info

Publication number
CN113111116A
CN113111116A CN202110516191.7A CN202110516191A CN113111116A CN 113111116 A CN113111116 A CN 113111116A CN 202110516191 A CN202110516191 A CN 202110516191A CN 113111116 A CN113111116 A CN 113111116A
Authority
CN
China
Prior art keywords
data
file
marine
integration
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110516191.7A
Other languages
Chinese (zh)
Other versions
CN113111116B (en
Inventor
杨锦坤
宋晓
韩璐遥
刘玉龙
苗庆生
徐珊珊
董明媚
宁鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NATIONAL MARINE DATA AND INFORMATION SERVICE
Original Assignee
NATIONAL MARINE DATA AND INFORMATION SERVICE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NATIONAL MARINE DATA AND INFORMATION SERVICE filed Critical NATIONAL MARINE DATA AND INFORMATION SERVICE
Priority to CN202110516191.7A priority Critical patent/CN113111116B/en
Publication of CN113111116A publication Critical patent/CN113111116A/en
Application granted granted Critical
Publication of CN113111116B publication Critical patent/CN113111116B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a marine water body environment data integration method of a marine comprehensive database, which comprises the following steps: s1, loading the marine water environment data files for sorting, and intelligently removing the duplication of the sorted data files; s2, verifying the quality range of the data file after the duplication elimination and storing the verification result; and S3, analyzing the verified data file to obtain data, and storing the data in a database. The marine water body environment data integration method reduces the manual operation amount, reduces the operation error, improves the integration efficiency, and realizes the fine and platform integration capability of marine water body environment resources.

Description

Ocean water environment data integration method of ocean comprehensive database
Technical Field
The invention belongs to the technical field of data integration, and particularly relates to a marine water body environment data integration method of a marine comprehensive database.
Background
The method realizes efficient and accurate marine environment data integration processing, achieves efficient utilization of marine environment data, and is a key research direction for marine environment data integration.
At present, the problems of low data quality, incomplete dimensionality, data quantity loss, deviation in reliability and the like generally exist in various methods for collecting and processing marine environment information data. In addition, the traditional data integration mode is that the integration method and the rules are manually matched according to the original states of various types of data to be integrated to form a one-to-one or one-to-many data integration strategy, and the data integration flow rule configuration is carried out according to ETL and other data integration processing tools to form a data integration flow so as to achieve the data integration target. Therefore, people are often confronted with a heavy workload, and the rule setting is mistaken, so that the embarrassment situation of testing and tuning is often required.
Aiming at the technical targets of efficient processing and ordered integration of various complex marine water environment data, the invention starts from the problems of the existing processing scheme, explores a data integration method by combining with a special application scene of marine environment, converts an integration technical thought from a manual configuration integration rule to a device processor method, aims at reducing manual operation amount, reducing operation errors and improving integration efficiency, and provides a marine water environment data integration method of a marine comprehensive database, thereby realizing the refinement and platform integration capability of marine water environment resources.
Disclosure of Invention
In view of this, the invention aims to provide a method for integrating marine water environment data of a marine comprehensive database, so as to avoid manual processing errors and improve the integration efficiency of the water environment data.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the marine water body environment data integration method of the marine comprehensive database comprises the following steps:
s1, loading the marine water environment data files for sorting, and intelligently removing the duplication of the sorted data files;
s2, verifying the quality range of the data file after the duplication elimination and storing the verification result;
and S3, analyzing the verified data file to obtain data, and storing the data in a database.
Further, step S1 specifically includes:
s11, loading the analyzed marine water environment data file to form a standard format file;
s12, sorting the data in the standard format file according to different dimensions, and taking the sorting result as an intelligent duplicate removal basis;
and S13, classifying the sorted data files by adopting different clustering algorithms according to different data characteristics, marking the classified data, and intelligently removing the duplicate.
Further, step S2 specifically includes:
s21, carrying out data quality detection on each data file subjected to intelligent duplicate removal according to a detection method corresponding to data characteristic matching;
and S22, marking the data according to the quality detection result.
Further, in step S3, the data file verified in step S2 is subjected to format conversion according to the parsing configuration file and then stored in the database.
Compared with the prior art, the method has the following advantages:
the marine water body environment data integration method reduces the manual operation amount, reduces the operation error, improves the integration efficiency, and realizes the fine and platform integration capability of marine water body environment resources.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic deployment diagram of an integrated device platform according to an embodiment of the present invention;
FIG. 2 is a data processing flow diagram of a cluster computation module according to an embodiment of the present invention;
fig. 3 is a specific processing flow diagram of a quality calculation module according to an embodiment of the present invention;
fig. 4 is a flowchart of outputting to a water body integration library according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
The marine water body environment data integration method of the marine comprehensive database comprises the following steps:
step 1, loading an analyzed marine water environment data file to form a standard format file;
step 2, sorting the data in the standard format file according to different dimensions;
and 3, classifying the sorted data files by adopting different clustering algorithms according to different data characteristics, marking the classified data, and intelligently removing the duplication of the data according to the marks.
Step 4, performing data quality detection on each data file subjected to intelligent duplicate removal according to a detection method corresponding to data characteristic matching;
and 5, marking the data according to the quality detection result.
And 6, carrying out format conversion on the data file verified in the step 5 according to the analysis configuration file and storing the data file into a database.
The invention relates to a method for integrating marine water environment data of a marine comprehensive database, in particular to a method for automatically outputting multi-source marine water environment data into a marine water integration database by adopting an 'integration device platform'. Avoids manual processing errors and improves the integration efficiency of water body environment data. The method completes the intelligent data integration capability of data to the database by constructing an integration device, completes the processing and integration of complex and multi-source data and forms a water body database.
The integration device is a processing center of the invention, and is an intelligent processing part for realizing integration processing aiming at marine water environment data and forming a marine water comprehensive database. The integration device comprises a clustering calculation module, a quality calculation module and an output water body comprehensive library module. The ocean water environment data integration method of the ocean comprehensive database is formed through the platform construction of server hardware and software processing modules in a specific application scene.
As shown in fig. 1, the integration device platform is deployed in the data loading server, and performs platform integration processing on the loaded water-related data resources through the integration device to form a water comprehensive library, and finally falls to the database server.
1. Clustering calculation module
The cluster calculation module is a data integration function part of the integration device platform, automatically sorts the loaded marine water environment data from various sources according to multi-dimensional data such as time, longitude, latitude, depth and the like by a built-in duplicate removal technology, and intelligently removes duplicates according to a sorting result. Fig. 2 is a flow chart of data processing of the cluster calculating module.
1) The automatic sorting is a sorting method for automatically forming data of different sources according to historical data drilling in the clustering calculation module, so that the platform can form a judgment basis through communication with the loading server and provide basic judgment information for sorting and rearrangement of the clustering model.
Dividing water body data according to nine subjects of water temperature, salinity, water level, ocean current, wave, water color, transparency, sea luminescence and sea ice according to the data requirements of the ocean water body comprehensive library. A built-in ordering method is based on the characteristics of each subject.
2) And the clustering model performs clustering analysis on each kind of water body environment data according to the dimensionality such as time, longitude, latitude, depth and the like, marks the data judged to be repeated in the range, and performs identification and deletion of redundant data according to the sequencing to form a standard data set.
The water body environment data are matched by adopting different clustering algorithms according to different data characteristics so as to seek a more accurate weight-removing result, for example, a DBSCAN clustering mode is adopted for temperature and salt observation data, the radius is determined through the training of historical data, the data which is greater than or equal to the radius is marked, and other data are marked separately to be used as an intelligent weight-removing basis.
3) The intelligent duplication elimination is that multi-dimensional comparison of time, longitude, latitude, depth and the like is carried out on multi-source data according to the sorting result and the clustering analysis result, and an nc file is generated according to the sorting mark.
2. Mass calculation module
The quality calculation module controls professional quality through marine service characteristics, verifies the data in a quality range according to the service standard specification of the element data, updates the verification result into a relevant quality field of the data record, and forms high-quality output capability for the integrated water environment data.
And the quality calculation module detects each nc file subjected to clustering integration one by one according to a subject characteristic matching detection method, one by one file, one by one section and one by one method, modifies the corresponding quality symbol into 4(bad flag) when finding out bad data, and stores the bad data into the nc file. A specific processing flow diagram of the mass calculation module is shown in fig. 3. The method comprises the following steps:
step 51, acquiring a basic library data table to be processed;
step 52, obtaining the final data extraction deadline in the log table and the current system time;
step 53, acquiring the element field name and the corresponding quality character field name which need to be processed in the current data table from the quality character configuration table;
step 54, update the log table state: starting to calculate the quality indicator of the table;
step 55, judging whether the element fields needing to be processed exist, if so, jumping to S56, otherwise, jumping to step 63;
step 56, obtaining the effective value range of the element field needing to be processed from the element specification table;
step 57, generating a judgment update quality symbol field SQL statement;
step 58, generating a summary statistics temporary table SQL statement: summarizing statistical results under statistical logic main keys corresponding to all incremental data, and inserting the statistical results into a temporary table;
step 59, generating a deletion quality symbol statistical intermediate table SQL statement: deleting all logic main records existing in the temporary table in the intermediate table according to the statistical logic main key;
step 60, generating an SQL statement of the statistical intermediate table of the insertion quality indicator: summarizing data in the temporary table according to the statistical logic main key, and inserting the statistical intermediate table;
step 61, generating a SQL statement for deleting the temporary table;
step 62, executing the generated SQL statements in sequence;
step 63, updating the log table state: and finishing the execution.
3. Output to water body comprehensive library module
And directly outputting the nc file to a water body comprehensive library under the integration of cluster calculation and quality calculation model data.
And outputting the water body comprehensive library, converting the generated nc standard file content into corresponding JAVA objects by constructing an analysis configuration file and analyzing the configuration file by a loading program, and writing the analyzed content into the water body comprehensive library in batches.
The invention is explained by a specific example through an integration method of 'temperature and salinity data' of the marine water environment:
the thermohaline data mainly comprises observation information of time, longitude, latitude, depth, water temperature, salinity, density and sound velocity, and auxiliary information of observation instruments, navigation equipment, specialization items, tasks and the like, so that a complete and complex thermohaline data structure is constructed.
Step 1, loading thermohaline data files of various sources
And the clustering calculation module is used for butt-loading files such as standard csv and nc formed by analyzed marine water body temperature and salt related environment data from various sources.
Step 2, implementing multi-dimensional automatic sequencing for loaded files
And the aggregation calculation module automatically sorts the standard temperature salt csv, nc and other format files generated by analyzing all the sources through a built-in duplicate removal technology according to the sea water section dimensions such as time, longitude, latitude, depth and the like.
And determining the radius by combining a DBSCAN clustering method, marking the data which is greater than or equal to the radius, taking the marked data as the mark for priority adjustment, and forming a final sequencing result by combining sequencing to be used as a basis for intelligently re-deleting repeated items.
And 3, intelligently removing the duplicate of the sorted data files.
And (3) carrying out intelligent duplicate removal processing on the same profile data according to the sequencing result of the step (2), wherein the intelligent duplicate removal adopted by the embodiment is a duplicate removal function built in the clustering calculation module, a duplicate removal rule and a clustering model are built in, and the duplicate removal processing is automatically executed according to the rule.
The rearrangement rule is the main processing basis of the rearrangement operation and is accurately configured according to the requirements of various disciplines. Examples are as follows:
void on_comboBox_lat_lon_currentTextChanged(const QString&arg1);
the precision rule of the longitude and latitude is, for example, 0.01, the longitude and latitude will be accurate to 2 bits after the decimal point, and when comparing data, the data of 2 bits after the decimal point are accurate to the same even if they are the same.
void on_comboBox_psal_currentTextChanged(const QString&arg1);
And (5) accuracy regulation of salinity data.
def removeDulpicate(self,df,same_dup):
For a completely identical profile, a duplicate is directly deleted.
Step 4, detecting the data file after the duplication elimination, and marking the quality symbol
The quality calculation module is responsible for performing quality detection on each file after clustering integration one by one according to a subject characteristic matching detection method, one by one section and one by one method, if bad data is found, a corresponding quality symbol is modified to be 4(bad flag), and the bad data is stored in the nc file, wherein the nc file is taken as an example in the embodiment.
(1) Reading profile data from nc files
self.profiles=argo.profile_from_nc(filePath,nc)
prof_count=len(self.profiles)
(2) Reading quality control parameters from a configuration file
cfg_filePath="./argo.json"
file=open(cfg_filePath,'r')
cfg=json.load(file)
(2) Reading quality control parameters from a configuration file
pqc=ProfileQC(prof,cfg=cfg,csv=greylist,metafile=metafile)
(3) Quality detection by detection method
The quality calculation module integrates quality inspection test methods of various types of data of various disciplines, for example, the quality inspection is carried out on a water body temperature and salt file by using a "tune 53H _ norm detection method", and the method is shown as follows:
the tukey53H _ norm detection method takes advantage of the robustness of the median to create a smoother data sequence, which is then compared to the observed values. After removing the large scale variability, this difference was normalized by the standard deviation of the observed data sequence.
For a single measurement xiWhere i is the observed position, which is evaluated as follows:
x(1)is from xi-2To xi+2Median of five points of (a);
x(2)is from
Figure BDA0003062274190000091
To
Figure BDA0003062274190000092
In three pointsA value;
x(3)is defined by the ringing smoothing filter:
Figure BDA0003062274190000093
if it is not
Figure BDA0003062274190000094
X is theniIs the peak value, where σ is the standard deviation of the low pass filtered data.
(4) And marking the quality symbols 1-4 according to the quality test result
The default behavior of the system is that if the value generated by the test is greater than k 1.5, the value is marked as 4; if the value is lower than k 1.5, it is marked 1.
Step 6, outputting temperature and salt files to write into a water comprehensive reservoir
And directly outputting the warm saline water nc file subjected to the integration processing of the clustering calculation and the mass calculation model data to a water comprehensive library through an analysis loading program.
And the analysis loader converts the generated nc standard file content into corresponding JAVA objects according to the constructed analysis configuration file, and writes the analyzed content into the water comprehensive library in batches.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (4)

1. The method for integrating the marine water environment data of the marine comprehensive database is characterized by comprising the following steps of:
s1, loading the marine water environment data files for sorting, and intelligently removing the duplication of the sorted data files;
s2, verifying the quality range of the data file after the duplication elimination and storing the verification result;
and S3, analyzing the verified data file to obtain data, and storing the data in a database.
2. The method of claim 1, wherein: step S1 specifically includes:
s11, loading the analyzed marine water environment data file to form a standard format file;
s12, sorting the data in the standard format file according to different dimensions, and taking the sorting result as an intelligent duplicate removal basis;
and S13, classifying the sorted data files by adopting different clustering algorithms according to different data characteristics, marking the classified data, and intelligently removing the duplicate.
3. The method of claim 1, wherein: step S2 specifically includes:
s21, carrying out data quality detection on each data file subjected to intelligent duplicate removal according to a detection method corresponding to data characteristic matching;
and S22, marking the data according to the quality detection result.
4. The method of claim 1, wherein: in step S3, the data file verified in step S2 is converted in format according to the parsing configuration file and stored in the database.
CN202110516191.7A 2021-05-12 2021-05-12 Ocean water environment data integration method of ocean comprehensive database Active CN113111116B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110516191.7A CN113111116B (en) 2021-05-12 2021-05-12 Ocean water environment data integration method of ocean comprehensive database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110516191.7A CN113111116B (en) 2021-05-12 2021-05-12 Ocean water environment data integration method of ocean comprehensive database

Publications (2)

Publication Number Publication Date
CN113111116A true CN113111116A (en) 2021-07-13
CN113111116B CN113111116B (en) 2022-10-18

Family

ID=76722395

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110516191.7A Active CN113111116B (en) 2021-05-12 2021-05-12 Ocean water environment data integration method of ocean comprehensive database

Country Status (1)

Country Link
CN (1) CN113111116B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197876A (en) * 2006-12-06 2008-06-11 中兴通讯股份有限公司 Method and system for multi-dimensional analysis of message service data
CN103678423A (en) * 2012-09-26 2014-03-26 深圳市世纪光速信息技术有限公司 Data file input system, device and method
CN104199907A (en) * 2014-08-28 2014-12-10 广州华多网络科技有限公司 Data inserting method and device
CN110716897A (en) * 2019-10-15 2020-01-21 北部湾大学 Cloud computing-based marine archive database parallelization construction method and device
CN110941593A (en) * 2019-12-03 2020-03-31 浪潮卓数大数据产业发展有限公司 File warehousing system and method
CN110991940A (en) * 2019-12-24 2020-04-10 国家卫星海洋应用中心 Ocean observation data product quality online inspection method and device and server

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197876A (en) * 2006-12-06 2008-06-11 中兴通讯股份有限公司 Method and system for multi-dimensional analysis of message service data
CN103678423A (en) * 2012-09-26 2014-03-26 深圳市世纪光速信息技术有限公司 Data file input system, device and method
CN104199907A (en) * 2014-08-28 2014-12-10 广州华多网络科技有限公司 Data inserting method and device
CN110716897A (en) * 2019-10-15 2020-01-21 北部湾大学 Cloud computing-based marine archive database parallelization construction method and device
CN110941593A (en) * 2019-12-03 2020-03-31 浪潮卓数大数据产业发展有限公司 File warehousing system and method
CN110991940A (en) * 2019-12-24 2020-04-10 国家卫星海洋应用中心 Ocean observation data product quality online inspection method and device and server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨锦坤: ""非信息化海洋环境历史资料抢救流程设计与关键技术研究"", 《海洋信息》 *
耿姗姗: ""基于数字海洋框架的海洋资料整合与共享服务管理模式浅析———以海洋公益性行业科研专项经费项目为例"", 《海洋开发与管理》 *

Also Published As

Publication number Publication date
CN113111116B (en) 2022-10-18

Similar Documents

Publication Publication Date Title
KR102178295B1 (en) Decision model construction method and device, computer device and storage medium
US7657408B2 (en) Structural analysis apparatus, structural analysis method, and structural analysis program
CN108292204B (en) System and method for automatic address verification
CN108960269B (en) Feature acquisition method and device for data set and computing equipment
CN102693266A (en) Method of searching a data base, navigation device and method of generating an index structure
CN105989001A (en) Image searching method and device, and image searching system
CN108681505B (en) Test case ordering method and device based on decision tree
CN110991065B (en) Automatic identification method for design change in building information model
CN111709775A (en) House property price evaluation method and device, electronic equipment and storage medium
US8301584B2 (en) System and method for adaptive pruning
CN113313344B (en) Label system construction method and system fusing multiple modes
CN111105041B (en) Machine learning method and device for intelligent data collision
CN113111116B (en) Ocean water environment data integration method of ocean comprehensive database
CN114049016A (en) Index similarity judgment method, system, terminal device and computer storage medium
CN108734393A (en) Matching process, user equipment, storage medium and the device of information of real estate
CN112182140B (en) Information input method, device, computer equipment and medium combining RPA and AI
CN112148819A (en) Address recognition method and device combining RPA and AI
CN106959960B (en) Data acquisition method and device
CN115952150A (en) Multi-source heterogeneous data fusion method and device
CN114021716A (en) Model training method and system and electronic equipment
JP2014206382A (en) Target type identification device
CN114185785A (en) Natural language processing model test case reduction method for deep neural network
CN111667552B (en) S57 electronic chart depth range rapid judging and filling method and equipment
CN116882396A (en) Function point analysis method, device, computer equipment, storage medium and product
CN117034016A (en) Method, system, electronic equipment and medium for constructing communication radiation source data model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant