CN116303402A

CN116303402A - Data cleaning method based on data warehouse

Info

Publication number: CN116303402A
Application number: CN202310312420.2A
Authority: CN
Inventors: 张莉
Original assignee: Zhejiang College of Security Technology
Current assignee: Zhejiang College of Security Technology
Priority date: 2023-03-28
Filing date: 2023-03-28
Publication date: 2023-06-23

Abstract

The invention discloses a data cleaning method based on a data warehouse, which comprises the following steps: s1: pretreatment: presetting an index statistics maintenance task table and an index cleaning code template, and selecting an attribute for record matching, wherein the attribute can represent record characteristics; s2: when the synchronous moment is reached, different weights are distributed to each attribute according to the difference of importance degrees of the attribute in determining the similarity of two records; s3: performing operation test on the task of the cleaning data; s4: the data extraction module DEM first formulates a data extraction specification and a data extraction standard. According to the method, the automatic cleaning of inaccurate and nonstandard material data can be manually interfered and cleaned, so that the data caused by the fact that individual parameters are absent, wrongly written and written in different formats in a data value are matched through similarity calculation of each material data, the data are displayed in a corresponding sequence according to the data attribute, and meanwhile, the aim of cleaning the data is achieved by matching with manual matching confirmation.

Description

Data cleaning method based on data warehouse

Technical Field

The invention relates to the technical field of computer data processing, in particular to a data cleaning method based on a data warehouse.

Background

With the advent of the DT (data technology) age, data value is increasingly prominent. The data requirements for each business are reaching an unprecedented level for the platform operator or service provider of the internet. How to deeply analyze the existing data and mine the potential value from the data becomes a technical problem which is primarily solved by the person skilled in the art.

Currently, business teams and technicians handling data are gradually building closer collaboration relationships, with one important area of collaboration being model deployment. Taking data processing of a trusted system as an example, the system identifies whether operation of a certain account under a certain environment is trusted or not by deploying a set of offline models, and reduces disturbance to a user by only allowing a white list, so that the use experience of the user is improved. The trusted model performs trusted class identification (e.g., index a >1, index B >2 is identified as class one, index a >3, and index B >4 is identified as class two) based on account and fixed index under various environmental information (MAC (media access control), UMID (unique material identifier), TID (thread identifier), etc.). The corresponding model constructors of the business team are responsible for determining model indexes and threshold values, the technicians for processing data are responsible for cleaning basic indexes, model deployment and data pushing are carried out to an application system, and the whole data link closed loop is completed.

After the model constructor submits the model deployment requirement to the data processing technician, the data processing technician needs to perform a series of operations such as development scheduling, index cleaning, model deployment and the like after receiving the requirement, and the series of processes are executed. Sometimes, when technician resources for processing data are intense, model deployment is severely delayed.

Therefore, how to automatically and effectively implement data cleaning, so as to solve the problem of limited resources and improve the working efficiency of technicians, and the method is a technical problem to be solved by the technicians in the field.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a data cleaning method based on a data warehouse.

In order to achieve the above purpose, the present invention provides the following technical solutions: a data warehouse-based data cleansing method, comprising the steps of:

s1: pretreatment: presetting an index statistics maintenance task table and an index cleaning code template, and selecting an attribute for record matching, wherein the attribute can represent record characteristics;

s2: when the synchronous moment is reached, different weights are distributed to each attribute according to the difference of importance degrees of the attribute in determining the similarity of two records;

s3: performing operation test on the task of the cleaning data;

s4: the data extraction module DEM firstly establishes data extraction standards and data extraction standards, and then performs data extraction on an enterprise department application system EAPPi;

s5: determining target data to be cleaned from any data warehouse and creating a cleaning task aiming at the target data, wherein the cleaning task comprises the target data information and cleaning rules;

s6: determining a source data warehouse of the target data, and determining a target cleaning sub-node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;

s7: determining a source data warehouse of the target data, and determining a target cleaning sub-node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;

s8: constructing a data cleaning system, determining a source data warehouse of target data according to the system, and determining a target cleaning child node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;

s9: displaying the automatic cleaning result of the system;

s10: identifying a plurality of domain data belonging to the same domain, and comparing the plurality of domain data under the same domain;

when comparing data with difference, correcting the data with difference according to the intra-domain data relationship of the same domain;

s11: and acquiring a source data table of a source data layer, and preprocessing the source data table through the data preprocessing layer to obtain a public data table.

Preferably, in the step S1, the raw material raw data needs to be collected.

Preferably, the data cleaning method further includes:

conflict processing: and merging or deleting the detected repeated records of the same repeated record cluster, and retaining the correct records therein.

Preferably, in S10, the specific step of cleaning the domain data includes:

and reading the field value of each record in the domain data, and replacing the field value which does not meet the preset condition with a preset value or a null value.

Preferably, in the step S2, a data cleaning task is configured according to an index statistical maintenance task table and the index cleaning code template, where the index statistical maintenance task table includes elements currently used for index cleaning and corresponding data thereof.

Preferably, in S11, the common data table and other data tables not requiring preprocessing need to be summarized through the data summarizing layer.

Preferably, the method further comprises: and calling an extraction tool to acquire cleaned data generated by each source data warehouse and synchronizing the cleaned data to any data warehouse.

Compared with the prior art, the invention provides a data cleaning method based on a data warehouse, which has the following beneficial effects:

1. according to the data cleaning method based on the data warehouse, the automatic cleaning of inaccurate and nonstandard material data can be manually interfered and cleaned, so that the data caused by the fact that individual parameters are absent, wrongly written and written in different formats in a data value are matched through similarity calculation on each material data, corresponding sorting display is carried out according to data attributes, and meanwhile, the aim of data cleaning is achieved by matching with manual matching confirmation.

2. According to the data cleaning method based on the data warehouse, by the detection method, errors of a large number of data sources can be detected and corrected, the complexity of cleaning is effectively reduced, and the cleaning efficiency is improved.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments.

Examples of the embodiments are shown in which the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The following examples, which are given by way of illustration, are intended to illustrate the invention and are not to be construed as limiting the invention.

In the description of the present invention, it should be understood that the terms "center," "longitudinal," "transverse," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," etc. indicate orientations or positional relationships, merely for convenience in describing the present invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be configured and operated in a particular orientation, and therefore should not be construed as limiting the present invention.

In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

The invention provides a data cleaning method based on a data warehouse, which comprises the following steps:

s1: pretreatment: presetting an index statistics maintenance task table and an index cleaning code template, and selecting an attribute for recording matching, wherein the attribute can represent recording characteristics and is required to acquire original material original data;

s2: when the synchronous moment is reached, according to the difference of importance degrees of the attributes in determining the similarity of the two records, different weights are distributed for each attribute, and particularly, according to the index statistical maintenance task table with the current state being effective and the index cleaning code template, a data cleaning task is configured, wherein the index statistical maintenance task table comprises elements currently used for index cleaning and corresponding data thereof;

s3: performing operation test on the task of cleaning data;

s5: determining target data to be cleaned from any data warehouse, and creating a cleaning task aiming at the target data, wherein the cleaning task comprises target data information and cleaning rules;

s6: determining a source data warehouse of the target data, and determining a target cleaning child node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;

s7: determining a source data warehouse of the target data, and determining a target cleaning child node corresponding to the source data warehouse according to a mapping relation between the data warehouse and the cleaning node;

s8: constructing a data cleaning system, determining a source data warehouse of target data according to the system, and determining a target cleaning child node corresponding to the source data warehouse according to the mapping relation between the data warehouse and the cleaning node;

s9: displaying the automatic cleaning result of the system;

s10: identifying a plurality of domain data belonging to the same domain, and comparing the plurality of domain data under the same domain, wherein the specific steps of cleaning the domain data comprise:

reading the field value of each record in the domain data, and replacing the field value which does not meet the preset condition with a preset value or a null value;

when comparing the data with the difference, correcting the data with the difference according to the intra-domain data relationship of the same domain;

s11: the method comprises the steps of obtaining a source data table of a source data layer, preprocessing the source data table through a data preprocessing layer to obtain a public data table, and summarizing the public data table and other data tables without preprocessing through a data summarizing layer.

Preferably, the data cleaning method further comprises:

Preferably, the method further comprises: invoking an extraction tool to obtain cleaned data generated by each source data warehouse and synchronizing the cleaned data to any data warehouse

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a reference structure" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A data warehouse-based data cleansing method, comprising the steps of:

s3: performing operation test on the task of the cleaning data;

s9: displaying the automatic cleaning result of the system;

2. A data warehouse-based data cleansing method as claimed in claim 1, wherein: in the step S1, the original material raw data needs to be collected.

3. The data warehouse-based data cleansing method as claimed in claim 1, wherein the data cleansing method further comprises:

4. The data cleansing method based on data warehouse according to claim 1, wherein in S10, the specific step of cleansing the domain data comprises:

5. The data cleaning method based on a data warehouse according to claim 1, wherein in S2, a data cleaning task is configured according to an index statistical maintenance task table and the index cleaning code template, wherein the index statistical maintenance task table contains elements currently used for index cleaning and corresponding data thereof.

6. The data cleansing method according to claim 1, wherein in S11, the common data table is further summarized with other data tables without preprocessing through the data summarizing layer.

7. The data warehouse-based data cleansing method as claimed in claim 1, further comprising: and calling an extraction tool to acquire cleaned data generated by each source data warehouse and synchronizing the cleaned data to any data warehouse.