CN116303402A - Data cleaning method based on data warehouse - Google Patents

Data cleaning method based on data warehouse Download PDF

Info

Publication number
CN116303402A
CN116303402A CN202310312420.2A CN202310312420A CN116303402A CN 116303402 A CN116303402 A CN 116303402A CN 202310312420 A CN202310312420 A CN 202310312420A CN 116303402 A CN116303402 A CN 116303402A
Authority
CN
China
Prior art keywords
data
cleaning
warehouse
attribute
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310312420.2A
Other languages
Chinese (zh)
Inventor
张莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang College of Security Technology
Original Assignee
Zhejiang College of Security Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang College of Security Technology filed Critical Zhejiang College of Security Technology
Priority to CN202310312420.2A priority Critical patent/CN116303402A/en
Publication of CN116303402A publication Critical patent/CN116303402A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Abstract

The invention discloses a data cleaning method based on a data warehouse, which comprises the following steps: s1: pretreatment: presetting an index statistics maintenance task table and an index cleaning code template, and selecting an attribute for record matching, wherein the attribute can represent record characteristics; s2: when the synchronous moment is reached, different weights are distributed to each attribute according to the difference of importance degrees of the attribute in determining the similarity of two records; s3: performing operation test on the task of the cleaning data; s4: the data extraction module DEM first formulates a data extraction specification and a data extraction standard. According to the method, the automatic cleaning of inaccurate and nonstandard material data can be manually interfered and cleaned, so that the data caused by the fact that individual parameters are absent, wrongly written and written in different formats in a data value are matched through similarity calculation of each material data, the data are displayed in a corresponding sequence according to the data attribute, and meanwhile, the aim of cleaning the data is achieved by matching with manual matching confirmation.

Description

Data cleaning method based on data warehouse
Technical Field
The invention relates to the technical field of computer data processing, in particular to a data cleaning method based on a data warehouse.
Background
With the advent of the DT (data technology) age, data value is increasingly prominent. The data requirements for each business are reaching an unprecedented level for the platform operator or service provider of the internet. How to deeply analyze the existing data and mine the potential value from the data becomes a technical problem which is primarily solved by the person skilled in the art.
Currently, business teams and technicians handling data are gradually building closer collaboration relationships, with one important area of collaboration being model deployment. Taking data processing of a trusted system as an example, the system identifies whether operation of a certain account under a certain environment is trusted or not by deploying a set of offline models, and reduces disturbance to a user by only allowing a white list, so that the use experience of the user is improved. The trusted model performs trusted class identification (e.g., index a >1, index B >2 is identified as class one, index a >3, and index B >4 is identified as class two) based on account and fixed index under various environmental information (MAC (media access control), UMID (unique material identifier), TID (thread identifier), etc.). The corresponding model constructors of the business team are responsible for determining model indexes and threshold values, the technicians for processing data are responsible for cleaning basic indexes, model deployment and data pushing are carried out to an application system, and the whole data link closed loop is completed.
After the model constructor submits the model deployment requirement to the data processing technician, the data processing technician needs to perform a series of operations such as development scheduling, index cleaning, model deployment and the like after receiving the requirement, and the series of processes are executed. Sometimes, when technician resources for processing data are intense, model deployment is severely delayed.
Therefore, how to automatically and effectively implement data cleaning, so as to solve the problem of limited resources and improve the working efficiency of technicians, and the method is a technical problem to be solved by the technicians in the field.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a data cleaning method based on a data warehouse.
In order to achieve the above purpose, the present invention provides the following technical solutions: a data warehouse-based data cleansing method, comprising the steps of:
s1: pretreatment: presetting an index statistics maintenance task table and an index cleaning code template, and selecting an attribute for record matching, wherein the attribute can represent record characteristics;
s2: when the synchronous moment is reached, different weights are distributed to each attribute according to the difference of importance degrees of the attribute in determining the similarity of two records;
s3: performing operation test on the task of the cleaning data;
s4: the data extraction module DEM firstly establishes data extraction standards and data extraction standards, and then performs data extraction on an enterprise department application system EAPPi;
s5: determining target data to be cleaned from any data warehouse and creating a cleaning task aiming at the target data, wherein the cleaning task comprises the target data information and cleaning rules;
s6: determining a source data warehouse of the target data, and determining a target cleaning sub-node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;
s7: determining a source data warehouse of the target data, and determining a target cleaning sub-node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;
s8: constructing a data cleaning system, determining a source data warehouse of target data according to the system, and determining a target cleaning child node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;
s9: displaying the automatic cleaning result of the system;
s10: identifying a plurality of domain data belonging to the same domain, and comparing the plurality of domain data under the same domain;
when comparing data with difference, correcting the data with difference according to the intra-domain data relationship of the same domain;
s11: and acquiring a source data table of a source data layer, and preprocessing the source data table through the data preprocessing layer to obtain a public data table.
Preferably, in the step S1, the raw material raw data needs to be collected.
Preferably, the data cleaning method further includes:
conflict processing: and merging or deleting the detected repeated records of the same repeated record cluster, and retaining the correct records therein.
Preferably, in S10, the specific step of cleaning the domain data includes:
and reading the field value of each record in the domain data, and replacing the field value which does not meet the preset condition with a preset value or a null value.
Preferably, in the step S2, a data cleaning task is configured according to an index statistical maintenance task table and the index cleaning code template, where the index statistical maintenance task table includes elements currently used for index cleaning and corresponding data thereof.
Preferably, in S11, the common data table and other data tables not requiring preprocessing need to be summarized through the data summarizing layer.
Preferably, the method further comprises: and calling an extraction tool to acquire cleaned data generated by each source data warehouse and synchronizing the cleaned data to any data warehouse.
Compared with the prior art, the invention provides a data cleaning method based on a data warehouse, which has the following beneficial effects:
1. according to the data cleaning method based on the data warehouse, the automatic cleaning of inaccurate and nonstandard material data can be manually interfered and cleaned, so that the data caused by the fact that individual parameters are absent, wrongly written and written in different formats in a data value are matched through similarity calculation on each material data, corresponding sorting display is carried out according to data attributes, and meanwhile, the aim of data cleaning is achieved by matching with manual matching confirmation.
2. According to the data cleaning method based on the data warehouse, by the detection method, errors of a large number of data sources can be detected and corrected, the complexity of cleaning is effectively reduced, and the cleaning efficiency is improved.
Detailed Description
The technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments.
Examples of the embodiments are shown in which the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The following examples, which are given by way of illustration, are intended to illustrate the invention and are not to be construed as limiting the invention.
In the description of the present invention, it should be understood that the terms "center," "longitudinal," "transverse," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," etc. indicate orientations or positional relationships, merely for convenience in describing the present invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be configured and operated in a particular orientation, and therefore should not be construed as limiting the present invention.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
The invention provides a data cleaning method based on a data warehouse, which comprises the following steps:
s1: pretreatment: presetting an index statistics maintenance task table and an index cleaning code template, and selecting an attribute for recording matching, wherein the attribute can represent recording characteristics and is required to acquire original material original data;
s2: when the synchronous moment is reached, according to the difference of importance degrees of the attributes in determining the similarity of the two records, different weights are distributed for each attribute, and particularly, according to the index statistical maintenance task table with the current state being effective and the index cleaning code template, a data cleaning task is configured, wherein the index statistical maintenance task table comprises elements currently used for index cleaning and corresponding data thereof;
s3: performing operation test on the task of cleaning data;
s4: the data extraction module DEM firstly establishes data extraction standards and data extraction standards, and then performs data extraction on an enterprise department application system EAPPi;
s5: determining target data to be cleaned from any data warehouse, and creating a cleaning task aiming at the target data, wherein the cleaning task comprises target data information and cleaning rules;
s6: determining a source data warehouse of the target data, and determining a target cleaning child node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;
s7: determining a source data warehouse of the target data, and determining a target cleaning child node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;
s8: constructing a data cleaning system, determining a source data warehouse of target data according to the system, and determining a target cleaning child node corresponding to the source data warehouse according to the mapping relation between the data warehouse and the cleaning node;
s9: displaying the automatic cleaning result of the system;
s10: identifying a plurality of domain data belonging to the same domain, and comparing the plurality of domain data under the same domain, wherein the specific steps of cleaning the domain data comprise:
reading the field value of each record in the domain data, and replacing the field value which does not meet the preset condition with a preset value or a null value;
when comparing the data with the difference, correcting the data with the difference according to the intra-domain data relationship of the same domain;
s11: the method comprises the steps of obtaining a source data table of a source data layer, preprocessing the source data table through a data preprocessing layer to obtain a public data table, and summarizing the public data table and other data tables without preprocessing through a data summarizing layer.
Preferably, the data cleaning method further comprises:
conflict processing: and merging or deleting the detected repeated records of the same repeated record cluster, and retaining the correct records therein.
Preferably, the method further comprises: invoking an extraction tool to obtain cleaned data generated by each source data warehouse and synchronizing the cleaned data to any data warehouse
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a reference structure" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (7)

1. A data warehouse-based data cleansing method, comprising the steps of:
s1: pretreatment: presetting an index statistics maintenance task table and an index cleaning code template, and selecting an attribute for record matching, wherein the attribute can represent record characteristics;
s2: when the synchronous moment is reached, different weights are distributed to each attribute according to the difference of importance degrees of the attribute in determining the similarity of two records;
s3: performing operation test on the task of the cleaning data;
s4: the data extraction module DEM firstly establishes data extraction standards and data extraction standards, and then performs data extraction on an enterprise department application system EAPPi;
s5: determining target data to be cleaned from any data warehouse and creating a cleaning task aiming at the target data, wherein the cleaning task comprises the target data information and cleaning rules;
s6: determining a source data warehouse of the target data, and determining a target cleaning sub-node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;
s7: determining a source data warehouse of the target data, and determining a target cleaning sub-node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;
s8: constructing a data cleaning system, determining a source data warehouse of target data according to the system, and determining a target cleaning child node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;
s9: displaying the automatic cleaning result of the system;
s10: identifying a plurality of domain data belonging to the same domain, and comparing the plurality of domain data under the same domain;
when comparing data with difference, correcting the data with difference according to the intra-domain data relationship of the same domain;
s11: and acquiring a source data table of a source data layer, and preprocessing the source data table through the data preprocessing layer to obtain a public data table.
2. A data warehouse-based data cleansing method as claimed in claim 1, wherein: in the step S1, the original material raw data needs to be collected.
3. The data warehouse-based data cleansing method as claimed in claim 1, wherein the data cleansing method further comprises:
conflict processing: and merging or deleting the detected repeated records of the same repeated record cluster, and retaining the correct records therein.
4. The data cleansing method based on data warehouse according to claim 1, wherein in S10, the specific step of cleansing the domain data comprises:
and reading the field value of each record in the domain data, and replacing the field value which does not meet the preset condition with a preset value or a null value.
5. The data cleaning method based on a data warehouse according to claim 1, wherein in S2, a data cleaning task is configured according to an index statistical maintenance task table and the index cleaning code template, wherein the index statistical maintenance task table contains elements currently used for index cleaning and corresponding data thereof.
6. The data cleansing method according to claim 1, wherein in S11, the common data table is further summarized with other data tables without preprocessing through the data summarizing layer.
7. The data warehouse-based data cleansing method as claimed in claim 1, further comprising: and calling an extraction tool to acquire cleaned data generated by each source data warehouse and synchronizing the cleaned data to any data warehouse.
CN202310312420.2A 2023-03-28 2023-03-28 Data cleaning method based on data warehouse Pending CN116303402A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310312420.2A CN116303402A (en) 2023-03-28 2023-03-28 Data cleaning method based on data warehouse

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310312420.2A CN116303402A (en) 2023-03-28 2023-03-28 Data cleaning method based on data warehouse

Publications (1)

Publication Number Publication Date
CN116303402A true CN116303402A (en) 2023-06-23

Family

ID=86828616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310312420.2A Pending CN116303402A (en) 2023-03-28 2023-03-28 Data cleaning method based on data warehouse

Country Status (1)

Country Link
CN (1) CN116303402A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117171145A (en) * 2023-06-28 2023-12-05 华远陆港智慧物流科技有限公司 Analysis processing method, equipment and storage medium for enterprise management system data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117171145A (en) * 2023-06-28 2023-12-05 华远陆港智慧物流科技有限公司 Analysis processing method, equipment and storage medium for enterprise management system data
CN117171145B (en) * 2023-06-28 2024-03-26 华远陆港智慧物流科技有限公司 Analysis processing method, equipment and storage medium for enterprise management system data

Similar Documents

Publication Publication Date Title
EP3798846A1 (en) Operation and maintenance system and method
CN111459985B (en) Identification information processing method and device
EP4060942A1 (en) Configuration anomaly detection method, server and storage medium
CN109376196B (en) Method and device for batch synchronization of redo logs
CN111209274B (en) Data quality checking method, system, equipment and readable storage medium
CN116303402A (en) Data cleaning method based on data warehouse
CN110716539B (en) Fault diagnosis and analysis method and device
CN114153980A (en) Knowledge graph construction method and device, inspection method and storage medium
CN113051308A (en) Alarm information processing method, equipment, storage medium and device
CN110674231A (en) Data lake-oriented user ID integration method and system
CN116205396A (en) Data panoramic monitoring method and system based on data center
CN112187914A (en) Remote control robot management method and system
US7844601B2 (en) Quality of service feedback for technology-neutral data reporting
CN104503982B (en) A kind of method that CMDB configuration items reconcile
CN112068981A (en) Knowledge base-based fault scanning recovery method and system in Linux operating system
CN107291938A (en) Order Query System and method
CN111177016A (en) Software test defect management method
CN110941910A (en) Intelligent auxiliary method and system for power grid three-dimensional design review
CN111290969B (en) Software quality analysis method based on characteristic frequency statistics
CN112561388A (en) Information processing method, device and equipment based on Internet of things
CN108170825A (en) Distributed energy data monitoring cleaning method based on cloud platform
CN111444254B (en) SKL system file format conversion method and system
CN113868615A (en) Asset database configuration management method based on network monitoring
CN111352824A (en) Test method and device and computer equipment
CN111898961A (en) Error checking method suitable for same field of standing book data of similar power equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination