CN116756133A

CN116756133A - Data cleaning flow method

Info

Publication number: CN116756133A
Application number: CN202310774234.0A
Authority: CN
Inventors: 梁郁庆; 陈锡雁; 袁军; 蔡德全; 王力; 杨子勤
Original assignee: Zhejiang Tianxi Kitchen Appliance Co Ltd
Current assignee: Zhejiang Tianxi Kitchen Appliance Co Ltd
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-09-15

Abstract

The invention discloses a data cleaning flow method, which comprises the following steps of; step one: collecting data; step two: counting repeated record data in full fields; step three: performing data deduplication on the data with the repeated data in the data table; step four: and converting the format of the data. The authority and the reliability of the data source are judged from multiple angles and all directions.

Description

Data cleaning flow method

[ field of technology ]

The invention relates to the technical field of data cleaning flow methods, in particular to the technical field of data cleaning flow methods.

[ background Art ]

In the process of integrating industrial data of kitchen ware manufacturing, the industrial data are complex and various, and the data quality in the process of integrating the data is difficult to guarantee; the existing ETL (extraction-conversion-loading) task script scheduling scheme has low efficiency and slow core service data updating, and is difficult to adapt to the problems of data integration requirements and the like in the kitchen ware manufacturing industry big data environment.

The project is based on SOA architecture industrial big data fusion, cleaning of TAN network multi-source heterogeneous inaccurate data and data warehouse management integration technology of ETL to construct an autonomous controllable intelligent data resource management platform, and autonomous controllable intelligent management of data resources is realized.

In order to solve the above problems and realize autonomous, controllable and intelligent management of data resources, it is necessary to propose a data cleaning flow method.

[ invention ]

The invention aims to solve the problems in the prior art and provides a data cleaning flow method which can judge authority and reliability of a data source from multiple angles and all directions.

In order to achieve the above object, the present invention provides a data cleaning flow method, comprising the following steps;

step one: collecting data;

step two: counting repeated record data in full fields;

step three: performing data deduplication on the data with the repeated data in the data table;

step four: converting the format of the data;

step five: processing the successfully converted data by default values;

step six: performing coding standardization treatment;

step seven: determining gold data sources;

step eight: data integration is carried out;

step nine: the data cleaning is completed.

Preferably, in the third step, if the data table does not have duplicate data, the format conversion is directly performed.

Preferably, in the fourth step, the format conversion includes date format conversion, character conversion number, and the like.

Preferably, in the fifth step, if the format conversion of the data is unsuccessful, a special value is given to the data which is unsuccessful in the conversion, and then default value processing is performed.

Preferably, in the fifth step, the default value processing includes date, amount, length, and the like.

Preferably, in the seventh step, a gold data source determination process is as follows;

step1: judging whether the gold data source can be evaluated by an expert, if so, judging the gold data by the expert, and ending the judgment of the gold data source;

step2: if the gold data source can not be evaluated by an expert, the following judging flow is carried out;

s1, importing sample data to be compared;

s2, counting field null rate;

s3, calculating a data integrity index;

s4, sampling and checking the accuracy of the data;

s5, calculating a data accuracy index;

s6, a time point for updating the statistical data;

s7, calculating a data timeliness index;

s8, counting the available record number;

s9, calculating a data availability index;

s10, summarizing and calculating data quality indexes;

s11, data source index score comparison is carried out;

A. the score exceeds 2: after 1, determining a gold data source, and ending the gold data source judgment;

B. the score does not exceed 2:1, checking the number of quoted data through the source data statistics;

B1. if the number ratio exceeds 1:1, determining a gold data source, and ending the gold data source judgment;

B2. if the number proportion does not exceed 1: and 1, no gold data source exists, and the gold data source judgment is finished.

Preferably, in S3, the integrity is equal to a value obtained by dividing the number of complete records by the total number of records, multiplied by 100%.

Preferably, in S5, the accuracy is equal to a value obtained by dividing the correct record number by the total record number and multiplying the value by 100%.

Preferably, in S7, the timeliness is equal to a value obtained by dividing the number of updated records in time by the total number of records and multiplying the value by 100%.

Preferably, in S9, the availability is equal to a value obtained by dividing the available record number by the total record number and multiplying the value by 100%.

The invention has the beneficial effects that: the authority and the reliability of the data source are judged from multiple angles and all directions; expert judgment is carried out, and trusted data sources are recognized in enterprises; secondly, index judgment, namely judging the credibility of the data through integrity, accuracy, timeliness and availability if no acknowledged credible data source exists; finally, if it is still not determinable, it is determined by the reference of the downstream data, and typically the data application will reference more data with higher reliability.

The features and advantages of the present invention will be described in detail by way of example with reference to the accompanying drawings.

[ description of the drawings ]

FIG. 1 is a flow chart of a data cleansing flow method of the present invention;

FIG. 2 is a flow chart of a golden data source determination for the data cleansing flow method of the present invention.

[ detailed description ] of the invention

Referring to fig. 1 and 2, the data cleaning flow method of the present invention includes the following steps;

step one: collecting data;

step two: counting repeated record data in full fields;

step four: converting the format of the data;

step five: processing the successfully converted data by default values;

step six: performing coding standardization treatment;

step seven: determining gold data sources;

step eight: data integration is carried out;

step nine: the data cleaning is completed.

In the third step, if the data table does not have repeated data, format conversion is directly performed.

In the fourth step, the format conversion includes date format conversion, character conversion number, and the like.

In the fifth step, if the format conversion of the data is unsuccessful, a special value is assigned to the data which is unsuccessful in conversion, and then default value processing is performed.

In the fifth step, the default value processing includes date, amount, length, etc.

In the seventh step, the gold data source determination flow is as follows;

s1, importing sample data to be compared;

s2, counting field null rate;

s3, calculating a data integrity index;

s4, sampling and checking the accuracy of the data;

s5, calculating a data accuracy index;

s6, a time point for updating the statistical data;

s7, calculating a data timeliness index;

s8, counting the available record number;

s9, calculating a data availability index;

s10, summarizing and calculating data quality indexes;

s11, data source index score comparison is carried out;

In S3, the integrity is equal to the number obtained by dividing the number of complete records by the total number of records, and the number is multiplied by 100%.

In S5, the accuracy is equal to the number obtained by dividing the correct record number by the total record number and multiplying the number by 100%.

In the step S7, the timeliness is equal to the number obtained by dividing the timely updated record number by the total record number and multiplying the obtained value by 100%.

In S9, the availability is equal to the number obtained by dividing the available record number by the total record number and multiplying the number by 100%.

The working process of the invention comprises the following steps:

in the working process of the data cleaning flow method, authority and reliability of a data source are judged from multiple angles and all directions; expert judgment is carried out, and trusted data sources are recognized in enterprises; secondly, index judgment, namely judging the credibility of the data through integrity, accuracy, timeliness and availability if no acknowledged credible data source exists; finally, if it is still not determinable, it is determined by the reference of the downstream data, and typically the data application will reference more data with higher reliability.

The above embodiments are illustrative of the present invention, and not limiting, and any simple modifications of the present invention fall within the scope of the present invention.

Claims

1. The data cleaning flow method is characterized in that: comprises the following steps of;

step one: collecting data;

step two: counting repeated record data in full fields;

step four: converting the format of the data;

step five: processing the successfully converted data by default values;

step six: performing coding standardization treatment;

step seven: determining gold data sources;

step eight: data integration is carried out;

step nine: the data cleaning is completed.

2. The data cleansing flow method of claim 1, wherein: in the third step, if the data table does not have repeated data, format conversion is directly performed.

3. The data cleansing flow method of claim 1, wherein: in the fourth step, the format conversion includes date format conversion, character conversion number, and the like.

4. The data cleansing flow method of claim 1, wherein: in the fifth step, if the format conversion of the data is unsuccessful, a special value is assigned to the data which is unsuccessful in conversion, and then default value processing is performed.

5. The data cleansing flow method of claim 1, wherein: in the fifth step, the default value processing includes date, amount, length, and the like.

6. The data cleansing flow method of claim 1, wherein: in the seventh step, the gold data source judging process is as follows;

s1, importing sample data to be compared;

s2, counting field null rate;

s3, calculating a data integrity index;

s4, sampling and checking the accuracy of the data;

s5, calculating a data accuracy index;

s6, a time point for updating the statistical data;

s7, calculating a data timeliness index;

s8, counting the available record number;

s9, calculating a data availability index;

s10, summarizing and calculating data quality indexes;

s11, data source index score comparison is carried out;

7. The data cleansing flow method as defined in claim 6, wherein: in S3, the integrity is equal to the number obtained by dividing the number of complete records by the total number of records, multiplied by 100%.

8. The data cleansing flow method of claim 1, wherein: in S5, the accuracy is equal to the number obtained by dividing the correct record number by the total record number and multiplying the number by 100%.

9. The data cleansing flow method of claim 1, wherein: in the step S7, the timeliness is equal to the number obtained by dividing the timely updated record number by the total record number and multiplying the obtained value by 100%.

10. The data cleansing flow method of claim 1, wherein: in S9, the availability is equal to the number obtained by dividing the available record number by the total record number and multiplying the number by 100%.