CN116756133A - Data cleaning flow method - Google Patents

Data cleaning flow method Download PDF

Info

Publication number
CN116756133A
CN116756133A CN202310774234.0A CN202310774234A CN116756133A CN 116756133 A CN116756133 A CN 116756133A CN 202310774234 A CN202310774234 A CN 202310774234A CN 116756133 A CN116756133 A CN 116756133A
Authority
CN
China
Prior art keywords
data
gold
flow method
source
data source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310774234.0A
Other languages
Chinese (zh)
Inventor
梁郁庆
陈锡雁
袁军
蔡德全
王力
杨子勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Tianxi Kitchen Appliance Co Ltd
Original Assignee
Zhejiang Tianxi Kitchen Appliance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Tianxi Kitchen Appliance Co Ltd filed Critical Zhejiang Tianxi Kitchen Appliance Co Ltd
Priority to CN202310774234.0A priority Critical patent/CN116756133A/en
Publication of CN116756133A publication Critical patent/CN116756133A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a data cleaning flow method, which comprises the following steps of; step one: collecting data; step two: counting repeated record data in full fields; step three: performing data deduplication on the data with the repeated data in the data table; step four: and converting the format of the data. The authority and the reliability of the data source are judged from multiple angles and all directions.

Description

Data cleaning flow method
[ field of technology ]
The invention relates to the technical field of data cleaning flow methods, in particular to the technical field of data cleaning flow methods.
[ background Art ]
In the process of integrating industrial data of kitchen ware manufacturing, the industrial data are complex and various, and the data quality in the process of integrating the data is difficult to guarantee; the existing ETL (extraction-conversion-loading) task script scheduling scheme has low efficiency and slow core service data updating, and is difficult to adapt to the problems of data integration requirements and the like in the kitchen ware manufacturing industry big data environment.
The project is based on SOA architecture industrial big data fusion, cleaning of TAN network multi-source heterogeneous inaccurate data and data warehouse management integration technology of ETL to construct an autonomous controllable intelligent data resource management platform, and autonomous controllable intelligent management of data resources is realized.
In order to solve the above problems and realize autonomous, controllable and intelligent management of data resources, it is necessary to propose a data cleaning flow method.
[ invention ]
The invention aims to solve the problems in the prior art and provides a data cleaning flow method which can judge authority and reliability of a data source from multiple angles and all directions.
In order to achieve the above object, the present invention provides a data cleaning flow method, comprising the following steps;
step one: collecting data;
step two: counting repeated record data in full fields;
step three: performing data deduplication on the data with the repeated data in the data table;
step four: converting the format of the data;
step five: processing the successfully converted data by default values;
step six: performing coding standardization treatment;
step seven: determining gold data sources;
step eight: data integration is carried out;
step nine: the data cleaning is completed.
Preferably, in the third step, if the data table does not have duplicate data, the format conversion is directly performed.
Preferably, in the fourth step, the format conversion includes date format conversion, character conversion number, and the like.
Preferably, in the fifth step, if the format conversion of the data is unsuccessful, a special value is given to the data which is unsuccessful in the conversion, and then default value processing is performed.
Preferably, in the fifth step, the default value processing includes date, amount, length, and the like.
Preferably, in the seventh step, a gold data source determination process is as follows;
step1: judging whether the gold data source can be evaluated by an expert, if so, judging the gold data by the expert, and ending the judgment of the gold data source;
step2: if the gold data source can not be evaluated by an expert, the following judging flow is carried out;
s1, importing sample data to be compared;
s2, counting field null rate;
s3, calculating a data integrity index;
s4, sampling and checking the accuracy of the data;
s5, calculating a data accuracy index;
s6, a time point for updating the statistical data;
s7, calculating a data timeliness index;
s8, counting the available record number;
s9, calculating a data availability index;
s10, summarizing and calculating data quality indexes;
s11, data source index score comparison is carried out;
A. the score exceeds 2: after 1, determining a gold data source, and ending the gold data source judgment;
B. the score does not exceed 2:1, checking the number of quoted data through the source data statistics;
B1. if the number ratio exceeds 1:1, determining a gold data source, and ending the gold data source judgment;
B2. if the number proportion does not exceed 1: and 1, no gold data source exists, and the gold data source judgment is finished.
Preferably, in S3, the integrity is equal to a value obtained by dividing the number of complete records by the total number of records, multiplied by 100%.
Preferably, in S5, the accuracy is equal to a value obtained by dividing the correct record number by the total record number and multiplying the value by 100%.
Preferably, in S7, the timeliness is equal to a value obtained by dividing the number of updated records in time by the total number of records and multiplying the value by 100%.
Preferably, in S9, the availability is equal to a value obtained by dividing the available record number by the total record number and multiplying the value by 100%.
The invention has the beneficial effects that: the authority and the reliability of the data source are judged from multiple angles and all directions; expert judgment is carried out, and trusted data sources are recognized in enterprises; secondly, index judgment, namely judging the credibility of the data through integrity, accuracy, timeliness and availability if no acknowledged credible data source exists; finally, if it is still not determinable, it is determined by the reference of the downstream data, and typically the data application will reference more data with higher reliability.
The features and advantages of the present invention will be described in detail by way of example with reference to the accompanying drawings.
[ description of the drawings ]
FIG. 1 is a flow chart of a data cleansing flow method of the present invention;
FIG. 2 is a flow chart of a golden data source determination for the data cleansing flow method of the present invention.
[ detailed description ] of the invention
Referring to fig. 1 and 2, the data cleaning flow method of the present invention includes the following steps;
step one: collecting data;
step two: counting repeated record data in full fields;
step three: performing data deduplication on the data with the repeated data in the data table;
step four: converting the format of the data;
step five: processing the successfully converted data by default values;
step six: performing coding standardization treatment;
step seven: determining gold data sources;
step eight: data integration is carried out;
step nine: the data cleaning is completed.
In the third step, if the data table does not have repeated data, format conversion is directly performed.
In the fourth step, the format conversion includes date format conversion, character conversion number, and the like.
In the fifth step, if the format conversion of the data is unsuccessful, a special value is assigned to the data which is unsuccessful in conversion, and then default value processing is performed.
In the fifth step, the default value processing includes date, amount, length, etc.
In the seventh step, the gold data source determination flow is as follows;
step1: judging whether the gold data source can be evaluated by an expert, if so, judging the gold data by the expert, and ending the judgment of the gold data source;
step2: if the gold data source can not be evaluated by an expert, the following judging flow is carried out;
s1, importing sample data to be compared;
s2, counting field null rate;
s3, calculating a data integrity index;
s4, sampling and checking the accuracy of the data;
s5, calculating a data accuracy index;
s6, a time point for updating the statistical data;
s7, calculating a data timeliness index;
s8, counting the available record number;
s9, calculating a data availability index;
s10, summarizing and calculating data quality indexes;
s11, data source index score comparison is carried out;
A. the score exceeds 2: after 1, determining a gold data source, and ending the gold data source judgment;
B. the score does not exceed 2:1, checking the number of quoted data through the source data statistics;
B1. if the number ratio exceeds 1:1, determining a gold data source, and ending the gold data source judgment;
B2. if the number proportion does not exceed 1: and 1, no gold data source exists, and the gold data source judgment is finished.
In S3, the integrity is equal to the number obtained by dividing the number of complete records by the total number of records, and the number is multiplied by 100%.
In S5, the accuracy is equal to the number obtained by dividing the correct record number by the total record number and multiplying the number by 100%.
In the step S7, the timeliness is equal to the number obtained by dividing the timely updated record number by the total record number and multiplying the obtained value by 100%.
In S9, the availability is equal to the number obtained by dividing the available record number by the total record number and multiplying the number by 100%.
The working process of the invention comprises the following steps:
in the working process of the data cleaning flow method, authority and reliability of a data source are judged from multiple angles and all directions; expert judgment is carried out, and trusted data sources are recognized in enterprises; secondly, index judgment, namely judging the credibility of the data through integrity, accuracy, timeliness and availability if no acknowledged credible data source exists; finally, if it is still not determinable, it is determined by the reference of the downstream data, and typically the data application will reference more data with higher reliability.
The above embodiments are illustrative of the present invention, and not limiting, and any simple modifications of the present invention fall within the scope of the present invention.

Claims (10)

1. The data cleaning flow method is characterized in that: comprises the following steps of;
step one: collecting data;
step two: counting repeated record data in full fields;
step three: performing data deduplication on the data with the repeated data in the data table;
step four: converting the format of the data;
step five: processing the successfully converted data by default values;
step six: performing coding standardization treatment;
step seven: determining gold data sources;
step eight: data integration is carried out;
step nine: the data cleaning is completed.
2. The data cleansing flow method of claim 1, wherein: in the third step, if the data table does not have repeated data, format conversion is directly performed.
3. The data cleansing flow method of claim 1, wherein: in the fourth step, the format conversion includes date format conversion, character conversion number, and the like.
4. The data cleansing flow method of claim 1, wherein: in the fifth step, if the format conversion of the data is unsuccessful, a special value is assigned to the data which is unsuccessful in conversion, and then default value processing is performed.
5. The data cleansing flow method of claim 1, wherein: in the fifth step, the default value processing includes date, amount, length, and the like.
6. The data cleansing flow method of claim 1, wherein: in the seventh step, the gold data source judging process is as follows;
step1: judging whether the gold data source can be evaluated by an expert, if so, judging the gold data by the expert, and ending the judgment of the gold data source;
step2: if the gold data source can not be evaluated by an expert, the following judging flow is carried out;
s1, importing sample data to be compared;
s2, counting field null rate;
s3, calculating a data integrity index;
s4, sampling and checking the accuracy of the data;
s5, calculating a data accuracy index;
s6, a time point for updating the statistical data;
s7, calculating a data timeliness index;
s8, counting the available record number;
s9, calculating a data availability index;
s10, summarizing and calculating data quality indexes;
s11, data source index score comparison is carried out;
A. the score exceeds 2: after 1, determining a gold data source, and ending the gold data source judgment;
B. the score does not exceed 2:1, checking the number of quoted data through the source data statistics;
B1. if the number ratio exceeds 1:1, determining a gold data source, and ending the gold data source judgment;
B2. if the number proportion does not exceed 1: and 1, no gold data source exists, and the gold data source judgment is finished.
7. The data cleansing flow method as defined in claim 6, wherein: in S3, the integrity is equal to the number obtained by dividing the number of complete records by the total number of records, multiplied by 100%.
8. The data cleansing flow method of claim 1, wherein: in S5, the accuracy is equal to the number obtained by dividing the correct record number by the total record number and multiplying the number by 100%.
9. The data cleansing flow method of claim 1, wherein: in the step S7, the timeliness is equal to the number obtained by dividing the timely updated record number by the total record number and multiplying the obtained value by 100%.
10. The data cleansing flow method of claim 1, wherein: in S9, the availability is equal to the number obtained by dividing the available record number by the total record number and multiplying the number by 100%.
CN202310774234.0A 2023-06-28 2023-06-28 Data cleaning flow method Pending CN116756133A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310774234.0A CN116756133A (en) 2023-06-28 2023-06-28 Data cleaning flow method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310774234.0A CN116756133A (en) 2023-06-28 2023-06-28 Data cleaning flow method

Publications (1)

Publication Number Publication Date
CN116756133A true CN116756133A (en) 2023-09-15

Family

ID=87956863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310774234.0A Pending CN116756133A (en) 2023-06-28 2023-06-28 Data cleaning flow method

Country Status (1)

Country Link
CN (1) CN116756133A (en)

Similar Documents

Publication Publication Date Title
CN112256782B (en) Hadoop-based power big data processing system
CN108718345A (en) A kind of digitlization workshop industrial data Network Transmitting system
CN111178587A (en) Spark framework-based short-term power load rapid prediction method
CN112347071A (en) Power distribution network cloud platform data fusion method and power distribution network cloud platform
CN111125069B (en) Data cleaning fusion system
CN116777284A (en) Space and attribute data integrated quality inspection method
CN116756133A (en) Data cleaning flow method
CN116775632A (en) Near-real-time cleaning data execution method based on vehicle-mounted terminal acquisition data
CN116881535A (en) Public opinion comprehensive supervision system with timely early warning function
CN107766452B (en) Indexing system suitable for high-speed access of power dispatching data and indexing method thereof
CN111143651A (en) New media integration operation data acquisition analysis system for management
CN113986990B (en) Data resource acquisition and labeling method and device based on block chain data mining
CN111277614A (en) Industrial energy management system based on cloud data
CN110347726A (en) A kind of efficient time series data is integrated to store inquiry system and method
CN115470279A (en) Data source conversion method, device, equipment and medium based on enterprise data
CN110244096B (en) Method for automatically discovering and processing electric meter full code in electric energy metering system
CN115422275A (en) Data processing method, device, equipment and storage medium
CN115203290A (en) Fault diagnosis method based on multi-dimensional prefix span algorithm
CN115034128A (en) Evaluation method for intelligent wind power plant wind turbine generator set of intelligent wind power plant
US20170337644A1 (en) Data driven invocation of realtime wind market forecasting analytics
CN106777313A (en) Based on holographic time scale measurement electric network data calculated value and calculated value Component Analysis method
CN113283881A (en) Automatic auditing method and system for telecontrol information source
CN109524983B (en) Photovoltaic output modeling method based on typical state
CN109165089B (en) Non-preemptible scheduling method of overload real-time system based on MaxSAT optimal solution
CN116680328A (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination