CN116756133A - Data cleaning flow method - Google Patents
Data cleaning flow method Download PDFInfo
- Publication number
- CN116756133A CN116756133A CN202310774234.0A CN202310774234A CN116756133A CN 116756133 A CN116756133 A CN 116756133A CN 202310774234 A CN202310774234 A CN 202310774234A CN 116756133 A CN116756133 A CN 116756133A
- Authority
- CN
- China
- Prior art keywords
- data
- gold
- flow method
- source
- data source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000004140 cleaning Methods 0.000 title claims abstract description 15
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 claims description 36
- 239000010931 gold Substances 0.000 claims description 36
- 229910052737 gold Inorganic materials 0.000 claims description 36
- 238000006243 chemical reaction Methods 0.000 claims description 18
- 230000010354 integration Effects 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a data cleaning flow method, which comprises the following steps of; step one: collecting data; step two: counting repeated record data in full fields; step three: performing data deduplication on the data with the repeated data in the data table; step four: and converting the format of the data. The authority and the reliability of the data source are judged from multiple angles and all directions.
Description
[ field of technology ]
The invention relates to the technical field of data cleaning flow methods, in particular to the technical field of data cleaning flow methods.
[ background Art ]
In the process of integrating industrial data of kitchen ware manufacturing, the industrial data are complex and various, and the data quality in the process of integrating the data is difficult to guarantee; the existing ETL (extraction-conversion-loading) task script scheduling scheme has low efficiency and slow core service data updating, and is difficult to adapt to the problems of data integration requirements and the like in the kitchen ware manufacturing industry big data environment.
The project is based on SOA architecture industrial big data fusion, cleaning of TAN network multi-source heterogeneous inaccurate data and data warehouse management integration technology of ETL to construct an autonomous controllable intelligent data resource management platform, and autonomous controllable intelligent management of data resources is realized.
In order to solve the above problems and realize autonomous, controllable and intelligent management of data resources, it is necessary to propose a data cleaning flow method.
[ invention ]
The invention aims to solve the problems in the prior art and provides a data cleaning flow method which can judge authority and reliability of a data source from multiple angles and all directions.
In order to achieve the above object, the present invention provides a data cleaning flow method, comprising the following steps;
step one: collecting data;
step two: counting repeated record data in full fields;
step three: performing data deduplication on the data with the repeated data in the data table;
step four: converting the format of the data;
step five: processing the successfully converted data by default values;
step six: performing coding standardization treatment;
step seven: determining gold data sources;
step eight: data integration is carried out;
step nine: the data cleaning is completed.
Preferably, in the third step, if the data table does not have duplicate data, the format conversion is directly performed.
Preferably, in the fourth step, the format conversion includes date format conversion, character conversion number, and the like.
Preferably, in the fifth step, if the format conversion of the data is unsuccessful, a special value is given to the data which is unsuccessful in the conversion, and then default value processing is performed.
Preferably, in the fifth step, the default value processing includes date, amount, length, and the like.
Preferably, in the seventh step, a gold data source determination process is as follows;
step1: judging whether the gold data source can be evaluated by an expert, if so, judging the gold data by the expert, and ending the judgment of the gold data source;
step2: if the gold data source can not be evaluated by an expert, the following judging flow is carried out;
s1, importing sample data to be compared;
s2, counting field null rate;
s3, calculating a data integrity index;
s4, sampling and checking the accuracy of the data;
s5, calculating a data accuracy index;
s6, a time point for updating the statistical data;
s7, calculating a data timeliness index;
s8, counting the available record number;
s9, calculating a data availability index;
s10, summarizing and calculating data quality indexes;
s11, data source index score comparison is carried out;
A. the score exceeds 2: after 1, determining a gold data source, and ending the gold data source judgment;
B. the score does not exceed 2:1, checking the number of quoted data through the source data statistics;
B1. if the number ratio exceeds 1:1, determining a gold data source, and ending the gold data source judgment;
B2. if the number proportion does not exceed 1: and 1, no gold data source exists, and the gold data source judgment is finished.
Preferably, in S3, the integrity is equal to a value obtained by dividing the number of complete records by the total number of records, multiplied by 100%.
Preferably, in S5, the accuracy is equal to a value obtained by dividing the correct record number by the total record number and multiplying the value by 100%.
Preferably, in S7, the timeliness is equal to a value obtained by dividing the number of updated records in time by the total number of records and multiplying the value by 100%.
Preferably, in S9, the availability is equal to a value obtained by dividing the available record number by the total record number and multiplying the value by 100%.
The invention has the beneficial effects that: the authority and the reliability of the data source are judged from multiple angles and all directions; expert judgment is carried out, and trusted data sources are recognized in enterprises; secondly, index judgment, namely judging the credibility of the data through integrity, accuracy, timeliness and availability if no acknowledged credible data source exists; finally, if it is still not determinable, it is determined by the reference of the downstream data, and typically the data application will reference more data with higher reliability.
The features and advantages of the present invention will be described in detail by way of example with reference to the accompanying drawings.
[ description of the drawings ]
FIG. 1 is a flow chart of a data cleansing flow method of the present invention;
FIG. 2 is a flow chart of a golden data source determination for the data cleansing flow method of the present invention.
[ detailed description ] of the invention
Referring to fig. 1 and 2, the data cleaning flow method of the present invention includes the following steps;
step one: collecting data;
step two: counting repeated record data in full fields;
step three: performing data deduplication on the data with the repeated data in the data table;
step four: converting the format of the data;
step five: processing the successfully converted data by default values;
step six: performing coding standardization treatment;
step seven: determining gold data sources;
step eight: data integration is carried out;
step nine: the data cleaning is completed.
In the third step, if the data table does not have repeated data, format conversion is directly performed.
In the fourth step, the format conversion includes date format conversion, character conversion number, and the like.
In the fifth step, if the format conversion of the data is unsuccessful, a special value is assigned to the data which is unsuccessful in conversion, and then default value processing is performed.
In the fifth step, the default value processing includes date, amount, length, etc.
In the seventh step, the gold data source determination flow is as follows;
step1: judging whether the gold data source can be evaluated by an expert, if so, judging the gold data by the expert, and ending the judgment of the gold data source;
step2: if the gold data source can not be evaluated by an expert, the following judging flow is carried out;
s1, importing sample data to be compared;
s2, counting field null rate;
s3, calculating a data integrity index;
s4, sampling and checking the accuracy of the data;
s5, calculating a data accuracy index;
s6, a time point for updating the statistical data;
s7, calculating a data timeliness index;
s8, counting the available record number;
s9, calculating a data availability index;
s10, summarizing and calculating data quality indexes;
s11, data source index score comparison is carried out;
A. the score exceeds 2: after 1, determining a gold data source, and ending the gold data source judgment;
B. the score does not exceed 2:1, checking the number of quoted data through the source data statistics;
B1. if the number ratio exceeds 1:1, determining a gold data source, and ending the gold data source judgment;
B2. if the number proportion does not exceed 1: and 1, no gold data source exists, and the gold data source judgment is finished.
In S3, the integrity is equal to the number obtained by dividing the number of complete records by the total number of records, and the number is multiplied by 100%.
In S5, the accuracy is equal to the number obtained by dividing the correct record number by the total record number and multiplying the number by 100%.
In the step S7, the timeliness is equal to the number obtained by dividing the timely updated record number by the total record number and multiplying the obtained value by 100%.
In S9, the availability is equal to the number obtained by dividing the available record number by the total record number and multiplying the number by 100%.
The working process of the invention comprises the following steps:
in the working process of the data cleaning flow method, authority and reliability of a data source are judged from multiple angles and all directions; expert judgment is carried out, and trusted data sources are recognized in enterprises; secondly, index judgment, namely judging the credibility of the data through integrity, accuracy, timeliness and availability if no acknowledged credible data source exists; finally, if it is still not determinable, it is determined by the reference of the downstream data, and typically the data application will reference more data with higher reliability.
The above embodiments are illustrative of the present invention, and not limiting, and any simple modifications of the present invention fall within the scope of the present invention.
Claims (10)
1. The data cleaning flow method is characterized in that: comprises the following steps of;
step one: collecting data;
step two: counting repeated record data in full fields;
step three: performing data deduplication on the data with the repeated data in the data table;
step four: converting the format of the data;
step five: processing the successfully converted data by default values;
step six: performing coding standardization treatment;
step seven: determining gold data sources;
step eight: data integration is carried out;
step nine: the data cleaning is completed.
2. The data cleansing flow method of claim 1, wherein: in the third step, if the data table does not have repeated data, format conversion is directly performed.
3. The data cleansing flow method of claim 1, wherein: in the fourth step, the format conversion includes date format conversion, character conversion number, and the like.
4. The data cleansing flow method of claim 1, wherein: in the fifth step, if the format conversion of the data is unsuccessful, a special value is assigned to the data which is unsuccessful in conversion, and then default value processing is performed.
5. The data cleansing flow method of claim 1, wherein: in the fifth step, the default value processing includes date, amount, length, and the like.
6. The data cleansing flow method of claim 1, wherein: in the seventh step, the gold data source judging process is as follows;
step1: judging whether the gold data source can be evaluated by an expert, if so, judging the gold data by the expert, and ending the judgment of the gold data source;
step2: if the gold data source can not be evaluated by an expert, the following judging flow is carried out;
s1, importing sample data to be compared;
s2, counting field null rate;
s3, calculating a data integrity index;
s4, sampling and checking the accuracy of the data;
s5, calculating a data accuracy index;
s6, a time point for updating the statistical data;
s7, calculating a data timeliness index;
s8, counting the available record number;
s9, calculating a data availability index;
s10, summarizing and calculating data quality indexes;
s11, data source index score comparison is carried out;
A. the score exceeds 2: after 1, determining a gold data source, and ending the gold data source judgment;
B. the score does not exceed 2:1, checking the number of quoted data through the source data statistics;
B1. if the number ratio exceeds 1:1, determining a gold data source, and ending the gold data source judgment;
B2. if the number proportion does not exceed 1: and 1, no gold data source exists, and the gold data source judgment is finished.
7. The data cleansing flow method as defined in claim 6, wherein: in S3, the integrity is equal to the number obtained by dividing the number of complete records by the total number of records, multiplied by 100%.
8. The data cleansing flow method of claim 1, wherein: in S5, the accuracy is equal to the number obtained by dividing the correct record number by the total record number and multiplying the number by 100%.
9. The data cleansing flow method of claim 1, wherein: in the step S7, the timeliness is equal to the number obtained by dividing the timely updated record number by the total record number and multiplying the obtained value by 100%.
10. The data cleansing flow method of claim 1, wherein: in S9, the availability is equal to the number obtained by dividing the available record number by the total record number and multiplying the number by 100%.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310774234.0A CN116756133A (en) | 2023-06-28 | 2023-06-28 | Data cleaning flow method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310774234.0A CN116756133A (en) | 2023-06-28 | 2023-06-28 | Data cleaning flow method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116756133A true CN116756133A (en) | 2023-09-15 |
Family
ID=87956863
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310774234.0A Pending CN116756133A (en) | 2023-06-28 | 2023-06-28 | Data cleaning flow method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116756133A (en) |
-
2023
- 2023-06-28 CN CN202310774234.0A patent/CN116756133A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112256782B (en) | Hadoop-based power big data processing system | |
CN108718345A (en) | A kind of digitlization workshop industrial data Network Transmitting system | |
CN111178587A (en) | Spark framework-based short-term power load rapid prediction method | |
CN112347071A (en) | Power distribution network cloud platform data fusion method and power distribution network cloud platform | |
CN111125069B (en) | Data cleaning fusion system | |
CN116777284A (en) | Space and attribute data integrated quality inspection method | |
CN116756133A (en) | Data cleaning flow method | |
CN116775632A (en) | Near-real-time cleaning data execution method based on vehicle-mounted terminal acquisition data | |
CN116881535A (en) | Public opinion comprehensive supervision system with timely early warning function | |
CN107766452B (en) | Indexing system suitable for high-speed access of power dispatching data and indexing method thereof | |
CN111143651A (en) | New media integration operation data acquisition analysis system for management | |
CN113986990B (en) | Data resource acquisition and labeling method and device based on block chain data mining | |
CN111277614A (en) | Industrial energy management system based on cloud data | |
CN110347726A (en) | A kind of efficient time series data is integrated to store inquiry system and method | |
CN115470279A (en) | Data source conversion method, device, equipment and medium based on enterprise data | |
CN110244096B (en) | Method for automatically discovering and processing electric meter full code in electric energy metering system | |
CN115422275A (en) | Data processing method, device, equipment and storage medium | |
CN115203290A (en) | Fault diagnosis method based on multi-dimensional prefix span algorithm | |
CN115034128A (en) | Evaluation method for intelligent wind power plant wind turbine generator set of intelligent wind power plant | |
US20170337644A1 (en) | Data driven invocation of realtime wind market forecasting analytics | |
CN106777313A (en) | Based on holographic time scale measurement electric network data calculated value and calculated value Component Analysis method | |
CN113283881A (en) | Automatic auditing method and system for telecontrol information source | |
CN109524983B (en) | Photovoltaic output modeling method based on typical state | |
CN109165089B (en) | Non-preemptible scheduling method of overload real-time system based on MaxSAT optimal solution | |
CN116680328A (en) | Data processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |