CN112506903B - Data quality representation method using specimen line - Google Patents

Data quality representation method using specimen line Download PDF

Info

Publication number
CN112506903B
CN112506903B CN202011390902.2A CN202011390902A CN112506903B CN 112506903 B CN112506903 B CN 112506903B CN 202011390902 A CN202011390902 A CN 202011390902A CN 112506903 B CN112506903 B CN 112506903B
Authority
CN
China
Prior art keywords
data
database
repair
check
data quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011390902.2A
Other languages
Chinese (zh)
Other versions
CN112506903A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Longshi Information Technology Co ltd
Original Assignee
Suzhou Longshi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Longshi Information Technology Co ltd filed Critical Suzhou Longshi Information Technology Co ltd
Priority to CN202011390902.2A priority Critical patent/CN112506903B/en
Publication of CN112506903A publication Critical patent/CN112506903A/en
Application granted granted Critical
Publication of CN112506903B publication Critical patent/CN112506903B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating

Abstract

The invention discloses a verification method for measuring the data quality work effect in data treatment work by adopting a sample line in the field of data treatment, which comprises the steps of collecting data to a target database through a file, an interface, a library table, a manual mode and the like; then setting a detection model, and detecting a target database; obtaining a detection result to form a problem database; then, problem repair is carried out, the source of the target database is optimized, and the data problem is solved from the source; cycling the data detection and repair tasks for a plurality of times; calculating the data accuracy before repair and the data accuracy after repair; finally, continuously showing the variation trend of the mark and the book in the data quality control process through sample lines. The invention provides continuity and whole flow analysis of data quality management for users through the sample line, embodies the effects of treating both symptoms and root causes, solves the data quality problem from the source, and promotes good circulation of data quality.

Description

Data quality representation method using specimen line
Technical Field
The invention relates to the technical field of data management, in particular to a verification method for measuring the data quality work effect in data management work by adopting a sample line.
Background
With the deep development of informatization, enterprises and government units increasingly accumulate massive data, and the data form dirty data and cause great barriers to data analysis and decision-making due to different sources, different acquisition modes, different use departments, non-uniform data storage, imperfect data specification and the like. Accordingly, enterprises and government agencies are increasingly taking improving data quality levels as an important task for data governance.
The working effect of the current data quality is mainly a quality analysis report aiming at single-batch data or a quality analysis report of certain data, the current state of the data quality of a certain link can only be reflected on one side, the continuity of the data quality analysis is lacking, and the management effect of the data quality cannot be comprehensively verified.
Therefore, the continuity analysis and the whole flow analysis provided for the data quality management can verify the comprehensive effect of the data quality management work, and the method has guiding significance for the data quality work.
Disclosure of Invention
Aiming at the defect of the existing verification method of the data quality work result, the invention firstly provides a representation mode of a sample line to analyze the data quality management work, embody the effects of treating both symptoms and root causes, solve the data quality problem from the source, and promote the good circulation of the data quality.
In order to achieve the above purpose, the present invention adopts a data quality representation mode of a specimen line, and specifically comprises the following steps:
step one, a user collects data to a target database through file exchange, interface exchange, library table exchange, manual collection and other modes.
Setting a detection model, and detecting data in a target database.
The detection model comprises rules such as format normalization check, reference integrity check, null check, data deletion check, data quantity check, uniqueness check, value range check, logic check, consistency check, cross comparison check, timeliness check and the like.
Step three, obtaining a detection result, forming a problem database, and obtaining the detected data increment M on the same day t Data increment N of current day problem database t
Wherein N is t And when the detection is carried out for the t-th time, increasing the problem data of the problem database. M is M t For the t-th detection, the target database is detected for data increment.
Further, the construction method of the problem database is as follows:
the question database contains a basic field Z w And a flag field Z b . Basic field Z w From the target database field, including check field name, check field value, etc., tag field Z b A field for marking the status of the problem data, including a repair identifier, etc.
Data of primary entry problem database, Z b All defaults to unrepaired.
And step four, repairing the problem data according to the detection result by the user.
Calculating the number R of repaired questions of the current day question database according to the question library state t
And fifthly, optimizing the source of the target database by a user, and solving the data problem from the source.
The source optimization of the target database comprises optimization of a business process, an acquisition process and a processing process.
And step six, repeating the step two to the step five, and carrying out repeated data detection and repair tasks.
The problem database construction step in the third step does not need repeated construction after the first construction.
Further, after t times of detection is performed on the target database, the server updates the problem database data and records M t 、N t And R is t The method comprises the following steps:
s1, recording detected data increment M of target database in the t-th detection t
S2, inserting new problem data into a problem database, and recording N t 。N t The number of questions newly added to the question database;
s3, the repaired problem data is processed by the server to Z b Updated to repaired and record R t 。R t Is the data increment of the repaired problem in the problem database after the t-th detection.
Step seven, calculating the data accuracy rate P before repair and the data accuracy rate Q after repair for the nth time, wherein the steps are as follows:
s1, calculating the data accuracy before repair:
wherein M is t For the detected data increment of the target database at the time of the t-th detection, N t For increasing the problem data amount in the t-th detection, R t For the problem data increment successfully repaired according to the t-th detection result, t=1, 2, … n.
S2, calculating the correct rate of the repaired data:
wherein M is t For the detected data increment of the target database at the time of the t-th detection, N t For increasing the problem data amount in the t-th detection, R t For the problem data increment successfully repaired according to the t-th detection result, t=1, 2, … n.
And step eight, obtaining the data accuracy before repair and the data accuracy after repair by repeatedly executing the step seven task, and continuously showing the change trend of the mark and the book in the data quality treatment process through a sample line.
The 'marked line' change trend represents the work effect of treating the detected data quality problem, the work effect of the data quality work process on the existing problem is reflected, and the 'marked line' rising trend reflects the work enthusiasm of the data responsibility department. The 'local line' variation trend reflects the comprehensive analysis aiming at the data quality problem, so that the data acquisition flow, the service flow and the service system data are optimized, and the source data quality problem is solved. The variation trend of the local line reflects the maturity of the data management of the source data quality problem, and the data responsibility department.
The trend line of the corrected data rate change is marked line. After continuous checking and repairing, the data accuracy is higher and higher, and the effect of treating the symptoms is shown.
The trend line of the data correct rate change before repair is the 'local line'. Through continuous examination, feedback, training and interpretation, the original data accuracy is higher and higher, so that the overall data quality and quality consciousness develop towards a benign direction, and the effect of 'root cause' is shown.
When the variation trend of the marked line and the line becomes more and more 100%, the curve trend becomes more and more gentle, and the management work of the data quality is represented, so that the purpose of treating both the symptoms and the root causes is achieved, and the ideal and sustainable data quality management mode is adopted at present.
Compared with the existing representation mode, the invention has the beneficial effects that:
(1) And continuous visual analysis of data quality is provided for users, and management effects of the data quality are intuitively reflected.
(2) The continuous tracking analysis of the data treatment effect is realized in the calculation and representation modes of the marked line and the local line, which is beneficial to improving the data quality from the source and forming the virtuous circle of the data quality.
(3) In the process of data quality management, the method is also an effective means for measuring the data management maturity of the data responsibility department.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings used for the description will be briefly introduced below. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flow chart of a method for representing data quality of a sample line according to the present invention.
Fig. 2 is a schematic view of the "reticle" and "local line" of the present invention.
Detailed Description
The invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the present invention adopts a flow of a method for representing data quality of a sample line, which includes: the method comprises the steps of detecting a target database, obtaining a detection result, forming a problem database, repairing the problem, circulating data detection and repair tasks for a plurality of times, calculating the data accuracy before repair and the data accuracy after repair every day, and displaying the data quality treatment trend by using a sample line. The method comprises the following specific steps:
step one, a user collects data to a target database through file exchange, interface exchange, library table exchange, manual collection and other modes.
Setting a detection model, and detecting data in a target database.
In this example, the detection model contains format normalization check, reference integrity check, null check, logical check rules.
Step three, obtaining a detection result, forming a problem database, and obtaining the detected data increment M on the same day t Data increment N of current day problem database t
In this example, t=1 when first detected.
Further, a problem database is constructed according to the following method:
the question database contains a basic field Z w And a flag field Z b . Basic field Z w From the target database field, including check field name, check field value, etc., tag field Z b A field for marking the status of the problem data, including a repair identifier, etc.
Data of the current entering problem database, Z b All defaults to unrepaired.
And step four, repairing the problem data according to the detection result by the user.
Calculating the number R of repaired questions of the current day question database according to the question library state t
In this example, t=1 at the first calculation.
And fifthly, optimizing the source of the target database by a user, and solving the data problem from the source.
The source optimization of the target database comprises optimization of a business process, an acquisition process and a processing process.
And step six, repeating the step two to the step five, and carrying out repeated data detection and repair tasks.
The problem database construction step in the third step does not need repeated construction after the first construction.
Detecting the target database for 2 times, updating the problem database data by the server, and recording M t 、N t And R is t The method comprises the following steps:
s1, recording detected data increment M of target database in 2 nd detection 2
S2, inserting new problem data into a problem database, and recording N 2 。N 2 When the detection is the 2 nd detection, the number of questions is newly added to a question database;
s3, the repaired problem data is processed by the server to Z b Updated to repaired and record R 2 。R 2 For the 2 nd detection, the data increment of the repaired problem in the problem database.
In this example, the co-cycle performs 5 times of data detection tasks, as in table 1, the detection process is as follows:
table 1 this example cycle test 5 times record table
Number of times t of detection Detected data increment M t Increment of number of questions N t Repaired problem number increment R t
1 M 1 N 1 R 1
2 M 2 N 2 R 2
3 M 3 =0 N 3 =0 R 3
4 M 4 N 4 R 4
5 M 5 N 5 =0 R 5
S1, detecting the target database for the 3 rd time according to the second step.
S2, according to the third and fourth steps, the detection result is obtained:
the total amount of the target database data is unchanged, M 3 =0;
No additional problem, N 3 =0;
The data increment of the repaired problem is R 3
S3, detecting and repairing the target database for the 4 th time according to the second step to the fourth step, and obtaining a monitoring result:
the detected data volume of the target database is increased and recorded as M 4
Increment of problem data N 4
The data increment of the repaired problem is R 4
S4, according to the fifth step, the user optimizes the business process.
S5, detecting the target database for the 5 th time according to the second step to the fourth step, and obtaining a monitoring result:
the detected data volume of the target database is increased and recorded as M 5
Increment of problem data N 5 =0;
The data increment of the repaired problem is R 5
Step seven, calculating the data accuracy rate P before repair and the data accuracy rate Q after repair of the 5 th time, wherein the steps are as follows:
s1, calculating the data accuracy before repair:
wherein,representing the sum of the increments of the 1 st to 5 th detected data in this example; />Representing the sum of the problem increments of the 1 st to 5 th detected data in this example.
S2, calculating the correct rate of the repaired data:
wherein,representing the sum of the increments of the 1 st to 5 th detected data in this example; />Representing the sum of the problem data increments in the 1 st to 5 th detections in this example; />Representing the sum of the problem data increments successfully repaired in the 1 st to 5 th test in this example.
And step eight, obtaining the data accuracy before repair and the data accuracy after repair by repeatedly executing the step seven task, and continuously showing the change trend of the mark and the book in the data quality treatment process through a sample line. .
The 'marked line' change trend represents the work effect of treating the detected data quality problem, the work effect of the data quality work process on the existing problem is reflected, and the 'marked line' rising trend reflects the work enthusiasm of the data responsibility department. The 'local line' variation trend reflects the comprehensive analysis aiming at the data quality problem, so that the data acquisition flow, the service flow and the service system data are optimized, and the source data quality problem is solved. The variation trend of the local line reflects the maturity of the data management of the source data quality problem, and the data responsibility department.
The trend line of the corrected data rate change is marked line. After continuous checking and repairing, the data accuracy is higher and higher, and the effect of treating the symptoms is shown.
The trend line of the data correct rate change before repair is the 'local line'. Through continuous examination, feedback, training and interpretation, the original data accuracy is higher and higher, so that the overall data quality and quality consciousness develop towards a benign direction, and the effect of 'root cause' is shown.
When the variation trend of the marked line and the line becomes more and more 100%, the curve trend becomes more and more gentle, and the management work of the data quality is represented, so that the purpose of treating both the symptoms and the root causes is achieved, and the ideal and sustainable data quality management mode is adopted at present.

Claims (3)

1. A data quality representation method using a specimen line, comprising the steps of:
step one, a user collects data to a target database through file exchange, interface exchange, library table exchange and manual collection modes;
step two, setting a detection model, and detecting data in a target database;
step three, obtaining a detection result, forming a problem database, and obtaining a detected data increment Mt on the same day and a data increment Nt of the problem database on the same day;
step four, repairing the problems, wherein a user repairs the problem data according to the detection result, and calculates the number Rt of repaired problems of the current day problem database according to the state of the problem database;
step five, optimizing the source of the target database by a user, solving the data problem from the source, wherein the source optimization of the target database comprises optimization of a business process, a collection process and a processing process;
step six, repeating the step two to the step five, and carrying out repeated data detection and repair tasks;
the problem database construction step in the third step is unnecessary to be repeatedly constructed after the first construction;
step seven, calculating the data accuracy rate P before repair and the data accuracy rate Q after repair for the nth time, wherein the steps are as follows:
s1, calculating the data accuracy before repair:
when the data increment Nt is the problem data increment of the problem database and the detected data increment Mt is the detected data increment of the target database, the repaired problem number Rt is the problem data increment successfully repaired according to the t detection result, and t=1, 2 and … n;
s2, calculating the correct rate of the repaired data:
step eight, obtaining the data accuracy before repair and the data accuracy after repair by repeatedly executing the step seven task, and continuously showing the change trend of the marking line and the local line in the data quality treatment process through the sample line;
in the eighth step:
the marking change trend represents the work effect of treating the detected data quality problem, and the work effect of the data quality working process on the existing problem is reflected;
the local line change trend reflects comprehensive analysis aiming at the data quality problem, so that the data acquisition flow, the service flow and the service system data are optimized, and the source data quality problem is solved;
the trend line of the data correct rate change after repair is a marked line, and the trend line of the data correct rate change before repair is a local line.
2. The method for representing data quality using a sample line according to claim 1, wherein in the second step:
the detection model comprises format normalization check, reference integrity check, null check, data missing check, data quantity check, uniqueness check, value range check, logic check, consistency check, cross comparison check and timeliness check rules.
3. The method for representing data quality using a sample line according to claim 1, wherein in the third step, the method for constructing the problem database is as follows:
the problem database comprises a basic field Zw and a marked field Zb, wherein the basic field Zw is from a target database field, comprises an inspection field name and an inspection field value, and the marked field Zb is a field for marking the data state and comprises a repair identifier; the data of the problem database is entered for the first time, and Zb defaults to unrepaired.
CN202011390902.2A 2020-12-02 2020-12-02 Data quality representation method using specimen line Active CN112506903B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011390902.2A CN112506903B (en) 2020-12-02 2020-12-02 Data quality representation method using specimen line

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011390902.2A CN112506903B (en) 2020-12-02 2020-12-02 Data quality representation method using specimen line

Publications (2)

Publication Number Publication Date
CN112506903A CN112506903A (en) 2021-03-16
CN112506903B true CN112506903B (en) 2024-02-23

Family

ID=74969159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011390902.2A Active CN112506903B (en) 2020-12-02 2020-12-02 Data quality representation method using specimen line

Country Status (1)

Country Link
CN (1) CN112506903B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461042A (en) * 2013-09-16 2015-03-25 百度在线网络技术(北京)有限公司 Japanese input method and system with automatic error correcting function based on backspace key
CN108513251A (en) * 2018-02-13 2018-09-07 北京天元创新科技有限公司 A kind of localization method and system based on MR data
WO2019100771A1 (en) * 2017-11-24 2019-05-31 阿里巴巴集团控股有限公司 Question pushing method and device
CN110032552A (en) * 2019-03-27 2019-07-19 国网山东省电力公司青岛供电公司 Standardized system and method based on equipment alteration information and scheduling online updating
CN110554013A (en) * 2019-08-29 2019-12-10 华夏安健物联科技(青岛)有限公司 method for realizing rapid identification and comparison by using fluorescence spectrum characteristic information
CN111143334A (en) * 2019-11-13 2020-05-12 深圳市华傲数据技术有限公司 Data quality closed-loop control method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150025872A1 (en) * 2013-07-16 2015-01-22 Raytheon Company System, method, and apparatus for modeling project reliability

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461042A (en) * 2013-09-16 2015-03-25 百度在线网络技术(北京)有限公司 Japanese input method and system with automatic error correcting function based on backspace key
WO2019100771A1 (en) * 2017-11-24 2019-05-31 阿里巴巴集团控股有限公司 Question pushing method and device
CN108513251A (en) * 2018-02-13 2018-09-07 北京天元创新科技有限公司 A kind of localization method and system based on MR data
CN110032552A (en) * 2019-03-27 2019-07-19 国网山东省电力公司青岛供电公司 Standardized system and method based on equipment alteration information and scheduling online updating
CN110554013A (en) * 2019-08-29 2019-12-10 华夏安健物联科技(青岛)有限公司 method for realizing rapid identification and comparison by using fluorescence spectrum characteristic information
CN111143334A (en) * 2019-11-13 2020-05-12 深圳市华傲数据技术有限公司 Data quality closed-loop control method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于编辑规则和主数据的数据修复技术研究;杨辉;中国优秀硕士学位论文全文数据库 信息科技辑;20170715(第07期);I138-566 *
电网线损数据质量治理技术研究;姚劲松;辛永;黄文思;陆鑫;陈婧;霍成军;;工业仪表与自动化装置;20180415(第02期);21-24 *

Also Published As

Publication number Publication date
CN112506903A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN105976120A (en) Electric power operation monitoring data quality assessment system and method
EP3514700A1 (en) Dynamic outlier bias reduction system and method
CN109784758B (en) Engineering quality supervision early warning system and method based on BIM model
CN110728422A (en) Building information model, method, device and settlement system for construction project
CN103971023A (en) Automatic quality evaluating system and method in research and development process
EP4080429A1 (en) Technology readiness level determination method and system based on science and technology big data
CN111078766A (en) Data warehouse model construction system and method based on multidimensional theory
Tran et al. How good are my search strings? Reflections on using an existing review as a quasi-gold standard
CN112506903B (en) Data quality representation method using specimen line
WO2020259391A1 (en) Database script performance testing method and device
Caballero-Hernández et al. Discovering bottlenecks in a computer science degree through process mining techniques
Yu et al. Using bug report as a software quality measure: an empirical study
CN115587333A (en) Failure analysis fault point prediction method and system based on multi-classification model
CN112732773B (en) Method and system for checking uniqueness of relay protection defect data
Wang et al. Quantitative analysis of requirements evolution across multiple versions of an industrial software product
CN109685453B (en) Method for intelligently identifying effective paths of workflow
Mi et al. A dynamic early warning method of student study failure risk based on fuzzy synthetic evaluation
CN113010611A (en) Method and system for automatically generating relations between relational database tables
CN108364244B (en) ERP skill automatic scoring method and device based on multi-record matching
English Total quality data management (TQdM)
Alimuddin et al. Intellectual capital as a financial performance measurement in public sector
CN116028648B (en) Medical text structured information extraction method universal for fine-grained scenes
CN113626323B (en) Method for testing and evaluating quality of software life cycle at each stage
Ren et al. A science mapping review of human and organizational factors in structural reliability
CN111159861B (en) Lithium battery multi-source reliability test data evaluation method based on data envelope analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant