CN112506903B

CN112506903B - Data quality representation method using specimen line

Info

Publication number: CN112506903B
Application number: CN202011390902.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Suzhou Longshi Information Technology Co ltd
Current assignee: Suzhou Longshi Information Technology Co ltd
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2024-02-23
Anticipated expiration: 2040-12-02
Also published as: CN112506903A

Abstract

The invention discloses a verification method for measuring the data quality work effect in data treatment work by adopting a sample line in the field of data treatment, which comprises the steps of collecting data to a target database through a file, an interface, a library table, a manual mode and the like; then setting a detection model, and detecting a target database; obtaining a detection result to form a problem database; then, problem repair is carried out, the source of the target database is optimized, and the data problem is solved from the source; cycling the data detection and repair tasks for a plurality of times; calculating the data accuracy before repair and the data accuracy after repair; finally, continuously showing the variation trend of the mark and the book in the data quality control process through sample lines. The invention provides continuity and whole flow analysis of data quality management for users through the sample line, embodies the effects of treating both symptoms and root causes, solves the data quality problem from the source, and promotes good circulation of data quality.

Description

Data quality representation method using specimen line

Technical Field

The invention relates to the technical field of data management, in particular to a verification method for measuring the data quality work effect in data management work by adopting a sample line.

Background

With the deep development of informatization, enterprises and government units increasingly accumulate massive data, and the data form dirty data and cause great barriers to data analysis and decision-making due to different sources, different acquisition modes, different use departments, non-uniform data storage, imperfect data specification and the like. Accordingly, enterprises and government agencies are increasingly taking improving data quality levels as an important task for data governance.

The working effect of the current data quality is mainly a quality analysis report aiming at single-batch data or a quality analysis report of certain data, the current state of the data quality of a certain link can only be reflected on one side, the continuity of the data quality analysis is lacking, and the management effect of the data quality cannot be comprehensively verified.

Therefore, the continuity analysis and the whole flow analysis provided for the data quality management can verify the comprehensive effect of the data quality management work, and the method has guiding significance for the data quality work.

Disclosure of Invention

Aiming at the defect of the existing verification method of the data quality work result, the invention firstly provides a representation mode of a sample line to analyze the data quality management work, embody the effects of treating both symptoms and root causes, solve the data quality problem from the source, and promote the good circulation of the data quality.

In order to achieve the above purpose, the present invention adopts a data quality representation mode of a specimen line, and specifically comprises the following steps:

step one, a user collects data to a target database through file exchange, interface exchange, library table exchange, manual collection and other modes.

Setting a detection model, and detecting data in a target database.

The detection model comprises rules such as format normalization check, reference integrity check, null check, data deletion check, data quantity check, uniqueness check, value range check, logic check, consistency check, cross comparison check, timeliness check and the like.

Step three, obtaining a detection result, forming a problem database, and obtaining the detected data increment M on the same day _t Data increment N of current day problem database _t 。

Wherein N is _t And when the detection is carried out for the t-th time, increasing the problem data of the problem database. M is M _t For the t-th detection, the target database is detected for data increment.

Further, the construction method of the problem database is as follows:

the question database contains a basic field Z _w And a flag field Z _b . Basic field Z _w From the target database field, including check field name, check field value, etc., tag field Z _b A field for marking the status of the problem data, including a repair identifier, etc.

Data of primary entry problem database, Z _b All defaults to unrepaired.

And step four, repairing the problem data according to the detection result by the user.

Calculating the number R of repaired questions of the current day question database according to the question library state _t 。

And fifthly, optimizing the source of the target database by a user, and solving the data problem from the source.

The source optimization of the target database comprises optimization of a business process, an acquisition process and a processing process.

And step six, repeating the step two to the step five, and carrying out repeated data detection and repair tasks.

The problem database construction step in the third step does not need repeated construction after the first construction.

Further, after t times of detection is performed on the target database, the server updates the problem database data and records M _t 、N _t And R is _t The method comprises the following steps:

s1, recording detected data increment M of target database in the t-th detection _t ；

S2, inserting new problem data into a problem database, and recording N _t 。N _t The number of questions newly added to the question database;

s3, the repaired problem data is processed by the server to Z _b Updated to repaired and record R _t 。R _t Is the data increment of the repaired problem in the problem database after the t-th detection.

Step seven, calculating the data accuracy rate P before repair and the data accuracy rate Q after repair for the nth time, wherein the steps are as follows:

s1, calculating the data accuracy before repair:

wherein M is _t For the detected data increment of the target database at the time of the t-th detection, N _t For increasing the problem data amount in the t-th detection, R _t For the problem data increment successfully repaired according to the t-th detection result, t=1, 2, … n.

S2, calculating the correct rate of the repaired data:

And step eight, obtaining the data accuracy before repair and the data accuracy after repair by repeatedly executing the step seven task, and continuously showing the change trend of the mark and the book in the data quality treatment process through a sample line.

The 'marked line' change trend represents the work effect of treating the detected data quality problem, the work effect of the data quality work process on the existing problem is reflected, and the 'marked line' rising trend reflects the work enthusiasm of the data responsibility department. The 'local line' variation trend reflects the comprehensive analysis aiming at the data quality problem, so that the data acquisition flow, the service flow and the service system data are optimized, and the source data quality problem is solved. The variation trend of the local line reflects the maturity of the data management of the source data quality problem, and the data responsibility department.

The trend line of the corrected data rate change is marked line. After continuous checking and repairing, the data accuracy is higher and higher, and the effect of treating the symptoms is shown.

The trend line of the data correct rate change before repair is the 'local line'. Through continuous examination, feedback, training and interpretation, the original data accuracy is higher and higher, so that the overall data quality and quality consciousness develop towards a benign direction, and the effect of 'root cause' is shown.

When the variation trend of the marked line and the line becomes more and more 100%, the curve trend becomes more and more gentle, and the management work of the data quality is represented, so that the purpose of treating both the symptoms and the root causes is achieved, and the ideal and sustainable data quality management mode is adopted at present.

Compared with the existing representation mode, the invention has the beneficial effects that:

(1) And continuous visual analysis of data quality is provided for users, and management effects of the data quality are intuitively reflected.

(2) The continuous tracking analysis of the data treatment effect is realized in the calculation and representation modes of the marked line and the local line, which is beneficial to improving the data quality from the source and forming the virtuous circle of the data quality.

(3) In the process of data quality management, the method is also an effective means for measuring the data management maturity of the data responsibility department.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings used for the description will be briefly introduced below. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow chart of a method for representing data quality of a sample line according to the present invention.

Fig. 2 is a schematic view of the "reticle" and "local line" of the present invention.

Detailed Description

The invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the present invention adopts a flow of a method for representing data quality of a sample line, which includes: the method comprises the steps of detecting a target database, obtaining a detection result, forming a problem database, repairing the problem, circulating data detection and repair tasks for a plurality of times, calculating the data accuracy before repair and the data accuracy after repair every day, and displaying the data quality treatment trend by using a sample line. The method comprises the following specific steps:

Setting a detection model, and detecting data in a target database.

In this example, the detection model contains format normalization check, reference integrity check, null check, logical check rules.

In this example, t=1 when first detected.

Further, a problem database is constructed according to the following method:

Data of the current entering problem database, Z _b All defaults to unrepaired.

In this example, t=1 at the first calculation.

Detecting the target database for 2 times, updating the problem database data by the server, and recording M _t 、N _t And R is _t The method comprises the following steps:

s1, recording detected data increment M of target database in 2 nd detection ₂ ；

S2, inserting new problem data into a problem database, and recording N ₂ 。N ₂ When the detection is the 2 nd detection, the number of questions is newly added to a question database;

s3, the repaired problem data is processed by the server to Z _b Updated to repaired and record R ₂ 。R ₂ For the 2 nd detection, the data increment of the repaired problem in the problem database.

In this example, the co-cycle performs 5 times of data detection tasks, as in table 1, the detection process is as follows:

table 1 this example cycle test 5 times record table

Number of times t of detection	Detected data increment M _t	Increment of number of questions N _t	Repaired problem number increment R _t
				1	M ₁	N ₁	R ₁
2	M ₂	N ₂	R ₂
				3	M ₃ ＝0	N ₃ ＝0	R ₃
4	M ₄	N ₄	R ₄
				5	M ₅	N ₅ ＝0	R ₅

S1, detecting the target database for the 3 rd time according to the second step.

S2, according to the third and fourth steps, the detection result is obtained:

the total amount of the target database data is unchanged, M ₃ ＝0；

No additional problem, N ₃ ＝0；

The data increment of the repaired problem is R ₃ 。

S3, detecting and repairing the target database for the 4 th time according to the second step to the fourth step, and obtaining a monitoring result:

the detected data volume of the target database is increased and recorded as M ₄ ；

Increment of problem data N ₄ ；

The data increment of the repaired problem is R ₄ 。

S4, according to the fifth step, the user optimizes the business process.

S5, detecting the target database for the 5 th time according to the second step to the fourth step, and obtaining a monitoring result:

the detected data volume of the target database is increased and recorded as M ₅ ；

Increment of problem data N ₅ ＝0；

The data increment of the repaired problem is R ₅ 。

Step seven, calculating the data accuracy rate P before repair and the data accuracy rate Q after repair of the 5 th time, wherein the steps are as follows:

s1, calculating the data accuracy before repair:

wherein,representing the sum of the increments of the 1 st to 5 th detected data in this example; />Representing the sum of the problem increments of the 1 st to 5 th detected data in this example.

S2, calculating the correct rate of the repaired data:

wherein,representing the sum of the increments of the 1 st to 5 th detected data in this example; />Representing the sum of the problem data increments in the 1 st to 5 th detections in this example; />Representing the sum of the problem data increments successfully repaired in the 1 st to 5 th test in this example.

And step eight, obtaining the data accuracy before repair and the data accuracy after repair by repeatedly executing the step seven task, and continuously showing the change trend of the mark and the book in the data quality treatment process through a sample line. .

Claims

1. A data quality representation method using a specimen line, comprising the steps of:

step one, a user collects data to a target database through file exchange, interface exchange, library table exchange and manual collection modes;

step two, setting a detection model, and detecting data in a target database;

step three, obtaining a detection result, forming a problem database, and obtaining a detected data increment Mt on the same day and a data increment Nt of the problem database on the same day;

step four, repairing the problems, wherein a user repairs the problem data according to the detection result, and calculates the number Rt of repaired problems of the current day problem database according to the state of the problem database;

step five, optimizing the source of the target database by a user, solving the data problem from the source, wherein the source optimization of the target database comprises optimization of a business process, a collection process and a processing process;

step six, repeating the step two to the step five, and carrying out repeated data detection and repair tasks;

the problem database construction step in the third step is unnecessary to be repeatedly constructed after the first construction;

s1, calculating the data accuracy before repair:

when the data increment Nt is the problem data increment of the problem database and the detected data increment Mt is the detected data increment of the target database, the repaired problem number Rt is the problem data increment successfully repaired according to the t detection result, and t=1, 2 and … n;

s2, calculating the correct rate of the repaired data:

step eight, obtaining the data accuracy before repair and the data accuracy after repair by repeatedly executing the step seven task, and continuously showing the change trend of the marking line and the local line in the data quality treatment process through the sample line;

in the eighth step:

the marking change trend represents the work effect of treating the detected data quality problem, and the work effect of the data quality working process on the existing problem is reflected;

the local line change trend reflects comprehensive analysis aiming at the data quality problem, so that the data acquisition flow, the service flow and the service system data are optimized, and the source data quality problem is solved;

the trend line of the data correct rate change after repair is a marked line, and the trend line of the data correct rate change before repair is a local line.

2. The method for representing data quality using a sample line according to claim 1, wherein in the second step:

the detection model comprises format normalization check, reference integrity check, null check, data missing check, data quantity check, uniqueness check, value range check, logic check, consistency check, cross comparison check and timeliness check rules.

3. The method for representing data quality using a sample line according to claim 1, wherein in the third step, the method for constructing the problem database is as follows:

the problem database comprises a basic field Zw and a marked field Zb, wherein the basic field Zw is from a target database field, comprises an inspection field name and an inspection field value, and the marked field Zb is a field for marking the data state and comprises a repair identifier; the data of the problem database is entered for the first time, and Zb defaults to unrepaired.