CN112506903B - Data quality representation method using specimen line - Google Patents
Data quality representation method using specimen line Download PDFInfo
- Publication number
- CN112506903B CN112506903B CN202011390902.2A CN202011390902A CN112506903B CN 112506903 B CN112506903 B CN 112506903B CN 202011390902 A CN202011390902 A CN 202011390902A CN 112506903 B CN112506903 B CN 112506903B
- Authority
- CN
- China
- Prior art keywords
- data
- database
- repair
- check
- data quality
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000001514 detection method Methods 0.000 claims abstract description 41
- 230000008439 repair process Effects 0.000 claims abstract description 33
- 230000000694 effects Effects 0.000 claims abstract description 18
- 238000004458 analytical method Methods 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000007689 inspection Methods 0.000 claims 2
- 208000024891 symptom Diseases 0.000 abstract description 6
- 238000012795 verification Methods 0.000 abstract description 3
- 238000005206 flow analysis Methods 0.000 abstract description 2
- 230000001351 cycling effect Effects 0.000 abstract 1
- 238000003908 quality control method Methods 0.000 abstract 1
- 238000007726 management method Methods 0.000 description 10
- 238000013523 data management Methods 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000000630 rising effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000004888 barrier function Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000003245 working effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
Abstract
The invention discloses a verification method for measuring the data quality work effect in data treatment work by adopting a sample line in the field of data treatment, which comprises the steps of collecting data to a target database through a file, an interface, a library table, a manual mode and the like; then setting a detection model, and detecting a target database; obtaining a detection result to form a problem database; then, problem repair is carried out, the source of the target database is optimized, and the data problem is solved from the source; cycling the data detection and repair tasks for a plurality of times; calculating the data accuracy before repair and the data accuracy after repair; finally, continuously showing the variation trend of the mark and the book in the data quality control process through sample lines. The invention provides continuity and whole flow analysis of data quality management for users through the sample line, embodies the effects of treating both symptoms and root causes, solves the data quality problem from the source, and promotes good circulation of data quality.
Description
Technical Field
The invention relates to the technical field of data management, in particular to a verification method for measuring the data quality work effect in data management work by adopting a sample line.
Background
With the deep development of informatization, enterprises and government units increasingly accumulate massive data, and the data form dirty data and cause great barriers to data analysis and decision-making due to different sources, different acquisition modes, different use departments, non-uniform data storage, imperfect data specification and the like. Accordingly, enterprises and government agencies are increasingly taking improving data quality levels as an important task for data governance.
The working effect of the current data quality is mainly a quality analysis report aiming at single-batch data or a quality analysis report of certain data, the current state of the data quality of a certain link can only be reflected on one side, the continuity of the data quality analysis is lacking, and the management effect of the data quality cannot be comprehensively verified.
Therefore, the continuity analysis and the whole flow analysis provided for the data quality management can verify the comprehensive effect of the data quality management work, and the method has guiding significance for the data quality work.
Disclosure of Invention
Aiming at the defect of the existing verification method of the data quality work result, the invention firstly provides a representation mode of a sample line to analyze the data quality management work, embody the effects of treating both symptoms and root causes, solve the data quality problem from the source, and promote the good circulation of the data quality.
In order to achieve the above purpose, the present invention adopts a data quality representation mode of a specimen line, and specifically comprises the following steps:
step one, a user collects data to a target database through file exchange, interface exchange, library table exchange, manual collection and other modes.
Setting a detection model, and detecting data in a target database.
The detection model comprises rules such as format normalization check, reference integrity check, null check, data deletion check, data quantity check, uniqueness check, value range check, logic check, consistency check, cross comparison check, timeliness check and the like.
Step three, obtaining a detection result, forming a problem database, and obtaining the detected data increment M on the same day t Data increment N of current day problem database t 。
Wherein N is t And when the detection is carried out for the t-th time, increasing the problem data of the problem database. M is M t For the t-th detection, the target database is detected for data increment.
Further, the construction method of the problem database is as follows:
the question database contains a basic field Z w And a flag field Z b . Basic field Z w From the target database field, including check field name, check field value, etc., tag field Z b A field for marking the status of the problem data, including a repair identifier, etc.
Data of primary entry problem database, Z b All defaults to unrepaired.
And step four, repairing the problem data according to the detection result by the user.
Calculating the number R of repaired questions of the current day question database according to the question library state t 。
And fifthly, optimizing the source of the target database by a user, and solving the data problem from the source.
The source optimization of the target database comprises optimization of a business process, an acquisition process and a processing process.
And step six, repeating the step two to the step five, and carrying out repeated data detection and repair tasks.
The problem database construction step in the third step does not need repeated construction after the first construction.
Further, after t times of detection is performed on the target database, the server updates the problem database data and records M t 、N t And R is t The method comprises the following steps:
s1, recording detected data increment M of target database in the t-th detection t ;
S2, inserting new problem data into a problem database, and recording N t 。N t The number of questions newly added to the question database;
s3, the repaired problem data is processed by the server to Z b Updated to repaired and record R t 。R t Is the data increment of the repaired problem in the problem database after the t-th detection.
Step seven, calculating the data accuracy rate P before repair and the data accuracy rate Q after repair for the nth time, wherein the steps are as follows:
s1, calculating the data accuracy before repair:
wherein M is t For the detected data increment of the target database at the time of the t-th detection, N t For increasing the problem data amount in the t-th detection, R t For the problem data increment successfully repaired according to the t-th detection result, t=1, 2, … n.
S2, calculating the correct rate of the repaired data:
wherein M is t For the detected data increment of the target database at the time of the t-th detection, N t For increasing the problem data amount in the t-th detection, R t For the problem data increment successfully repaired according to the t-th detection result, t=1, 2, … n.
And step eight, obtaining the data accuracy before repair and the data accuracy after repair by repeatedly executing the step seven task, and continuously showing the change trend of the mark and the book in the data quality treatment process through a sample line.
The 'marked line' change trend represents the work effect of treating the detected data quality problem, the work effect of the data quality work process on the existing problem is reflected, and the 'marked line' rising trend reflects the work enthusiasm of the data responsibility department. The 'local line' variation trend reflects the comprehensive analysis aiming at the data quality problem, so that the data acquisition flow, the service flow and the service system data are optimized, and the source data quality problem is solved. The variation trend of the local line reflects the maturity of the data management of the source data quality problem, and the data responsibility department.
The trend line of the corrected data rate change is marked line. After continuous checking and repairing, the data accuracy is higher and higher, and the effect of treating the symptoms is shown.
The trend line of the data correct rate change before repair is the 'local line'. Through continuous examination, feedback, training and interpretation, the original data accuracy is higher and higher, so that the overall data quality and quality consciousness develop towards a benign direction, and the effect of 'root cause' is shown.
When the variation trend of the marked line and the line becomes more and more 100%, the curve trend becomes more and more gentle, and the management work of the data quality is represented, so that the purpose of treating both the symptoms and the root causes is achieved, and the ideal and sustainable data quality management mode is adopted at present.
Compared with the existing representation mode, the invention has the beneficial effects that:
(1) And continuous visual analysis of data quality is provided for users, and management effects of the data quality are intuitively reflected.
(2) The continuous tracking analysis of the data treatment effect is realized in the calculation and representation modes of the marked line and the local line, which is beneficial to improving the data quality from the source and forming the virtuous circle of the data quality.
(3) In the process of data quality management, the method is also an effective means for measuring the data management maturity of the data responsibility department.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings used for the description will be briefly introduced below. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flow chart of a method for representing data quality of a sample line according to the present invention.
Fig. 2 is a schematic view of the "reticle" and "local line" of the present invention.
Detailed Description
The invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the present invention adopts a flow of a method for representing data quality of a sample line, which includes: the method comprises the steps of detecting a target database, obtaining a detection result, forming a problem database, repairing the problem, circulating data detection and repair tasks for a plurality of times, calculating the data accuracy before repair and the data accuracy after repair every day, and displaying the data quality treatment trend by using a sample line. The method comprises the following specific steps:
step one, a user collects data to a target database through file exchange, interface exchange, library table exchange, manual collection and other modes.
Setting a detection model, and detecting data in a target database.
In this example, the detection model contains format normalization check, reference integrity check, null check, logical check rules.
Step three, obtaining a detection result, forming a problem database, and obtaining the detected data increment M on the same day t Data increment N of current day problem database t 。
In this example, t=1 when first detected.
Further, a problem database is constructed according to the following method:
the question database contains a basic field Z w And a flag field Z b . Basic field Z w From the target database field, including check field name, check field value, etc., tag field Z b A field for marking the status of the problem data, including a repair identifier, etc.
Data of the current entering problem database, Z b All defaults to unrepaired.
And step four, repairing the problem data according to the detection result by the user.
Calculating the number R of repaired questions of the current day question database according to the question library state t 。
In this example, t=1 at the first calculation.
And fifthly, optimizing the source of the target database by a user, and solving the data problem from the source.
The source optimization of the target database comprises optimization of a business process, an acquisition process and a processing process.
And step six, repeating the step two to the step five, and carrying out repeated data detection and repair tasks.
The problem database construction step in the third step does not need repeated construction after the first construction.
Detecting the target database for 2 times, updating the problem database data by the server, and recording M t 、N t And R is t The method comprises the following steps:
s1, recording detected data increment M of target database in 2 nd detection 2 ;
S2, inserting new problem data into a problem database, and recording N 2 。N 2 When the detection is the 2 nd detection, the number of questions is newly added to a question database;
s3, the repaired problem data is processed by the server to Z b Updated to repaired and record R 2 。R 2 For the 2 nd detection, the data increment of the repaired problem in the problem database.
In this example, the co-cycle performs 5 times of data detection tasks, as in table 1, the detection process is as follows:
table 1 this example cycle test 5 times record table
Number of times t of detection | Detected data increment M t | Increment of number of questions N t | Repaired problem number increment R t |
1 | M 1 | N 1 | R 1 |
2 | M 2 | N 2 | R 2 |
3 | M 3 =0 | N 3 =0 | R 3 |
4 | M 4 | N 4 | R 4 |
5 | M 5 | N 5 =0 | R 5 |
S1, detecting the target database for the 3 rd time according to the second step.
S2, according to the third and fourth steps, the detection result is obtained:
the total amount of the target database data is unchanged, M 3 =0;
No additional problem, N 3 =0;
The data increment of the repaired problem is R 3 。
S3, detecting and repairing the target database for the 4 th time according to the second step to the fourth step, and obtaining a monitoring result:
the detected data volume of the target database is increased and recorded as M 4 ;
Increment of problem data N 4 ;
The data increment of the repaired problem is R 4 。
S4, according to the fifth step, the user optimizes the business process.
S5, detecting the target database for the 5 th time according to the second step to the fourth step, and obtaining a monitoring result:
the detected data volume of the target database is increased and recorded as M 5 ;
Increment of problem data N 5 =0;
The data increment of the repaired problem is R 5 。
Step seven, calculating the data accuracy rate P before repair and the data accuracy rate Q after repair of the 5 th time, wherein the steps are as follows:
s1, calculating the data accuracy before repair:
wherein,representing the sum of the increments of the 1 st to 5 th detected data in this example; />Representing the sum of the problem increments of the 1 st to 5 th detected data in this example.
S2, calculating the correct rate of the repaired data:
wherein,representing the sum of the increments of the 1 st to 5 th detected data in this example; />Representing the sum of the problem data increments in the 1 st to 5 th detections in this example; />Representing the sum of the problem data increments successfully repaired in the 1 st to 5 th test in this example.
And step eight, obtaining the data accuracy before repair and the data accuracy after repair by repeatedly executing the step seven task, and continuously showing the change trend of the mark and the book in the data quality treatment process through a sample line. .
The 'marked line' change trend represents the work effect of treating the detected data quality problem, the work effect of the data quality work process on the existing problem is reflected, and the 'marked line' rising trend reflects the work enthusiasm of the data responsibility department. The 'local line' variation trend reflects the comprehensive analysis aiming at the data quality problem, so that the data acquisition flow, the service flow and the service system data are optimized, and the source data quality problem is solved. The variation trend of the local line reflects the maturity of the data management of the source data quality problem, and the data responsibility department.
The trend line of the corrected data rate change is marked line. After continuous checking and repairing, the data accuracy is higher and higher, and the effect of treating the symptoms is shown.
The trend line of the data correct rate change before repair is the 'local line'. Through continuous examination, feedback, training and interpretation, the original data accuracy is higher and higher, so that the overall data quality and quality consciousness develop towards a benign direction, and the effect of 'root cause' is shown.
When the variation trend of the marked line and the line becomes more and more 100%, the curve trend becomes more and more gentle, and the management work of the data quality is represented, so that the purpose of treating both the symptoms and the root causes is achieved, and the ideal and sustainable data quality management mode is adopted at present.
Claims (3)
1. A data quality representation method using a specimen line, comprising the steps of:
step one, a user collects data to a target database through file exchange, interface exchange, library table exchange and manual collection modes;
step two, setting a detection model, and detecting data in a target database;
step three, obtaining a detection result, forming a problem database, and obtaining a detected data increment Mt on the same day and a data increment Nt of the problem database on the same day;
step four, repairing the problems, wherein a user repairs the problem data according to the detection result, and calculates the number Rt of repaired problems of the current day problem database according to the state of the problem database;
step five, optimizing the source of the target database by a user, solving the data problem from the source, wherein the source optimization of the target database comprises optimization of a business process, a collection process and a processing process;
step six, repeating the step two to the step five, and carrying out repeated data detection and repair tasks;
the problem database construction step in the third step is unnecessary to be repeatedly constructed after the first construction;
step seven, calculating the data accuracy rate P before repair and the data accuracy rate Q after repair for the nth time, wherein the steps are as follows:
s1, calculating the data accuracy before repair:
when the data increment Nt is the problem data increment of the problem database and the detected data increment Mt is the detected data increment of the target database, the repaired problem number Rt is the problem data increment successfully repaired according to the t detection result, and t=1, 2 and … n;
s2, calculating the correct rate of the repaired data:
step eight, obtaining the data accuracy before repair and the data accuracy after repair by repeatedly executing the step seven task, and continuously showing the change trend of the marking line and the local line in the data quality treatment process through the sample line;
in the eighth step:
the marking change trend represents the work effect of treating the detected data quality problem, and the work effect of the data quality working process on the existing problem is reflected;
the local line change trend reflects comprehensive analysis aiming at the data quality problem, so that the data acquisition flow, the service flow and the service system data are optimized, and the source data quality problem is solved;
the trend line of the data correct rate change after repair is a marked line, and the trend line of the data correct rate change before repair is a local line.
2. The method for representing data quality using a sample line according to claim 1, wherein in the second step:
the detection model comprises format normalization check, reference integrity check, null check, data missing check, data quantity check, uniqueness check, value range check, logic check, consistency check, cross comparison check and timeliness check rules.
3. The method for representing data quality using a sample line according to claim 1, wherein in the third step, the method for constructing the problem database is as follows:
the problem database comprises a basic field Zw and a marked field Zb, wherein the basic field Zw is from a target database field, comprises an inspection field name and an inspection field value, and the marked field Zb is a field for marking the data state and comprises a repair identifier; the data of the problem database is entered for the first time, and Zb defaults to unrepaired.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011390902.2A CN112506903B (en) | 2020-12-02 | 2020-12-02 | Data quality representation method using specimen line |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011390902.2A CN112506903B (en) | 2020-12-02 | 2020-12-02 | Data quality representation method using specimen line |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112506903A CN112506903A (en) | 2021-03-16 |
CN112506903B true CN112506903B (en) | 2024-02-23 |
Family
ID=74969159
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011390902.2A Active CN112506903B (en) | 2020-12-02 | 2020-12-02 | Data quality representation method using specimen line |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112506903B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104461042A (en) * | 2013-09-16 | 2015-03-25 | 百度在线网络技术(北京)有限公司 | Japanese input method and system with automatic error correcting function based on backspace key |
CN108513251A (en) * | 2018-02-13 | 2018-09-07 | 北京天元创新科技有限公司 | A kind of localization method and system based on MR data |
WO2019100771A1 (en) * | 2017-11-24 | 2019-05-31 | 阿里巴巴集团控股有限公司 | Question pushing method and device |
CN110032552A (en) * | 2019-03-27 | 2019-07-19 | 国网山东省电力公司青岛供电公司 | Standardized system and method based on equipment alteration information and scheduling online updating |
CN110554013A (en) * | 2019-08-29 | 2019-12-10 | 华夏安健物联科技(青岛)有限公司 | method for realizing rapid identification and comparison by using fluorescence spectrum characteristic information |
CN111143334A (en) * | 2019-11-13 | 2020-05-12 | 深圳市华傲数据技术有限公司 | Data quality closed-loop control method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150025872A1 (en) * | 2013-07-16 | 2015-01-22 | Raytheon Company | System, method, and apparatus for modeling project reliability |
-
2020
- 2020-12-02 CN CN202011390902.2A patent/CN112506903B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104461042A (en) * | 2013-09-16 | 2015-03-25 | 百度在线网络技术(北京)有限公司 | Japanese input method and system with automatic error correcting function based on backspace key |
WO2019100771A1 (en) * | 2017-11-24 | 2019-05-31 | 阿里巴巴集团控股有限公司 | Question pushing method and device |
CN108513251A (en) * | 2018-02-13 | 2018-09-07 | 北京天元创新科技有限公司 | A kind of localization method and system based on MR data |
CN110032552A (en) * | 2019-03-27 | 2019-07-19 | 国网山东省电力公司青岛供电公司 | Standardized system and method based on equipment alteration information and scheduling online updating |
CN110554013A (en) * | 2019-08-29 | 2019-12-10 | 华夏安健物联科技(青岛)有限公司 | method for realizing rapid identification and comparison by using fluorescence spectrum characteristic information |
CN111143334A (en) * | 2019-11-13 | 2020-05-12 | 深圳市华傲数据技术有限公司 | Data quality closed-loop control method |
Non-Patent Citations (2)
Title |
---|
基于编辑规则和主数据的数据修复技术研究;杨辉;中国优秀硕士学位论文全文数据库 信息科技辑;20170715(第07期);I138-566 * |
电网线损数据质量治理技术研究;姚劲松;辛永;黄文思;陆鑫;陈婧;霍成军;;工业仪表与自动化装置;20180415(第02期);21-24 * |
Also Published As
Publication number | Publication date |
---|---|
CN112506903A (en) | 2021-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105976120A (en) | Electric power operation monitoring data quality assessment system and method | |
EP3514700A1 (en) | Dynamic outlier bias reduction system and method | |
CN109784758B (en) | Engineering quality supervision early warning system and method based on BIM model | |
CN110728422A (en) | Building information model, method, device and settlement system for construction project | |
CN103971023A (en) | Automatic quality evaluating system and method in research and development process | |
EP4080429A1 (en) | Technology readiness level determination method and system based on science and technology big data | |
CN111078766A (en) | Data warehouse model construction system and method based on multidimensional theory | |
Tran et al. | How good are my search strings? Reflections on using an existing review as a quasi-gold standard | |
CN112506903B (en) | Data quality representation method using specimen line | |
WO2020259391A1 (en) | Database script performance testing method and device | |
Caballero-Hernández et al. | Discovering bottlenecks in a computer science degree through process mining techniques | |
Yu et al. | Using bug report as a software quality measure: an empirical study | |
CN115587333A (en) | Failure analysis fault point prediction method and system based on multi-classification model | |
CN112732773B (en) | Method and system for checking uniqueness of relay protection defect data | |
Wang et al. | Quantitative analysis of requirements evolution across multiple versions of an industrial software product | |
CN109685453B (en) | Method for intelligently identifying effective paths of workflow | |
Mi et al. | A dynamic early warning method of student study failure risk based on fuzzy synthetic evaluation | |
CN113010611A (en) | Method and system for automatically generating relations between relational database tables | |
CN108364244B (en) | ERP skill automatic scoring method and device based on multi-record matching | |
English | Total quality data management (TQdM) | |
Alimuddin et al. | Intellectual capital as a financial performance measurement in public sector | |
CN116028648B (en) | Medical text structured information extraction method universal for fine-grained scenes | |
CN113626323B (en) | Method for testing and evaluating quality of software life cycle at each stage | |
Ren et al. | A science mapping review of human and organizational factors in structural reliability | |
CN111159861B (en) | Lithium battery multi-source reliability test data evaluation method based on data envelope analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |