CN115292298A - Data quality verification system based on metadata - Google Patents

Data quality verification system based on metadata Download PDF

Info

Publication number
CN115292298A
CN115292298A CN202210824074.1A CN202210824074A CN115292298A CN 115292298 A CN115292298 A CN 115292298A CN 202210824074 A CN202210824074 A CN 202210824074A CN 115292298 A CN115292298 A CN 115292298A
Authority
CN
China
Prior art keywords
metadata
constraint
data quality
evaluation
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210824074.1A
Other languages
Chinese (zh)
Inventor
刘磊
徐奎东
毛志军
汤士伟
谢志宇
徐瀚昌
姜锋
沈欢
杨秋芬
潘宁
张丽
马玉刚
党忠妍
汪森然
王卫新
周融
王奇
韦法林
田亚龙
张志航
吕军成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WONDERS INFORMATION CO Ltd
Original Assignee
WONDERS INFORMATION CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WONDERS INFORMATION CO Ltd filed Critical WONDERS INFORMATION CO Ltd
Priority to CN202210824074.1A priority Critical patent/CN115292298A/en
Publication of CN115292298A publication Critical patent/CN115292298A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data quality verification system based on metadata, which is characterized by comprising a metadata management subsystem, a data quality evaluation subsystem and a data quality display subsystem. In order to more efficiently and accurately manage the data quality problem, the invention provides an automatic data quality evaluation model based on metadata, which is used for carrying out quality verification, report evaluation and problem source tracing on the data accessed to a regional platform, and the automatic evaluation reduces the human intervention, reduces the possibility of errors in manual verification index configuration and improves the working efficiency. The invention takes the metadata as the standard, describes the database, the table information, the fields and the assessment index information, and becomes the basis for assessing the data quality.

Description

Data quality verification system based on metadata
Technical Field
The invention relates to a data quality verification system.
Background
With the development of computer informatization technology, a large amount of data is generated for enterprises. But not all data can meet the application requirements, and the data quality is uneven. Thus, the unprecedented data quality still restricts the development of enterprises. Therefore, a data quality control platform needs to be established to judge: the received data has the problems and cannot meet the application requirements, so that the data quality is improved. The enterprise can complete the evaluation and analysis of the data quality through the data quality control platform and find out the problems of the data. In the existing data quality control platform, data is not unified due to inconsistency of standard definition, business terms, calculation caliber and the like of enterprises, and data quality is difficult to improve.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: in the existing data quality control platform, data is not unified due to inconsistency of standard definition, business terms, calculation caliber and the like of enterprises, and data quality is difficult to improve.
In order to solve the above technical problem, a technical solution of the present invention is to provide a metadata-based data quality verification system, which is characterized by comprising a metadata management subsystem, a data quality evaluation subsystem, and a data quality display subsystem, wherein:
the metadata management subsystem defines a metadata model according to quality check classification, stores database, table and field information of a data source into the metadata database after the data source is configured, and generates specific data quality assessment indexes of different standard types based on the collected database, table, field information and metadata model, wherein:
the metadata model is described through Class, and comprises basic information Class baseInfo, constraint Class constraint and evaluation configuration Class evaluation, the basic information of the metadata is defined in the basic information Class baseInfo, the definition of basic standard normalization and the definition of constraint oriented to a data set are realized in the constraint Class constraint, and evaluation parameters are configured in the evaluation configuration Class evaluation;
the data quality assessment indexes are divided into constraint, relevance, normalization, timeliness and stability;
the data quality evaluation subsystem selects one or more metadata indexes from all the data quality evaluation indexes to generate a data quality evaluation model based on the data quality evaluation indexes generated by the metadata management subsystem, and then the data quality evaluation model is analyzed and calculated to obtain a data quality control index evaluation result, wherein:
the data quality evaluation model is a requirement, and one or more metadata indexes generated by metadata mapping are selected to generate a set of data quality scoring standard;
the assessment method of the data quality assessment subsystem for the metadata indexes belonging to the relevance is to calculate the relevance rate of the metadata indexes, the relevance rate = M/N, a table T1 uploaded to a data quality verification system by a certain organization comprises N pieces of table data, wherein M pieces of table data can be related to a table T2 uploaded by the same organization;
the evaluation method of the data quality evaluation subsystem on the metadata indexes belonging to the constraint is to calculate the constraint coincidence rate = M/N of the metadata indexes, a table T1 uploaded to a data quality verification system by a certain organization comprises N pieces of table data, wherein M pieces of table data can find the records of a table T2 uploaded by the same organization;
the data quality evaluation subsystem evaluates metadata indexes belonging to consistency by calculating detailed data statistical results of the metadata indexes and service operation consistency rate = (TOTAL 1-TOTAL 2)/TOTAL 1, a certain mechanism reports data quantity to a data quality verification system through a table T1 to be TOTAL1, and the same mechanism reports the data quantity to the data quality verification system through the table T2 to be TOTAL2;
the evaluation method of the data quality evaluation subsystem on metadata indexes belonging to normalization is that the normalization rate = M/N of the metadata indexes is calculated, a table T1 uploaded to a data quality verification system by a certain organization comprises N pieces of table data, wherein the number of records of a target field filling specification is M;
the data quality evaluation subsystem evaluates the metadata indexes belonging to timeliness by calculating the average difference days = M/N of the metadata indexes, a table T1 uploaded to a data quality verification system by a certain mechanism comprises N pieces of table data, and the sum of the difference between the last uploading time of the mechanism and the service time is M days;
the evaluation method of the data quality evaluation subsystem for the metadata indexes belonging to the stability is to calculate the proportion of the non-outage days = (N-M) ÷ N, the days uploaded to a data quality verification system by a certain organization are N days, and the outage days are M days;
and the data quality display subsystem is used for displaying the examination result of the data quality control index.
Preferably, the metadata management subsystem includes a metadata model construction module, a metadata collection module, and a metadata mapping module, wherein:
the metadata model is constructed by a metadata model construction module, the metadata model is described by Class, and the metadata model is composed of basic information Class baseInfo, constraint Class constraint and evaluation configuration Class evaluation:
the basic information class baseInfo defines:
metadata type encoding baseInfo/metaType: the value range includes: 01 denotes integrity; 02 denotes identity; 03 indicates normativity; 04 denotes timeliness; 05 represents stability;
metadata type name baseInfo/metaTypeName;
the metadata type describes baseInfo/metaTypeDesc;
table encoding baseInfo/tableCode;
metadata name baseInfo/metaName;
metadata encoding baseInfo/metaCode;
to achieve integrity, defined in the constraint class constraint:
the association table codes constraint/tableCode;
association relation node constraints/constraint relationships;
table field constraint/constraint relationships/constraint relationship/column;
the association table field constraint/constraint relationships/columnR;
filtering condition nodes constraint/constraint relationships/filters;
table field constraint/constraint relationships/filters/filter/column;
the expression constraint/constraint relationships/filters/filter/expression, the value field of which includes: 01-means equal; 02-means greater than; 03-means equal to or greater than; 04-means less than; 05-means equal to or less than; 06-indicates starting with a string; 07-indicates ending with a string;
table field value constraint/constraint relationships/filters/filter/condVal;
to achieve normalization, defined in the constraint class constraint:
table field code constraint/columnCode;
the check type constraint/dicType has value fields including: 01- -value domain specification 02- -dictionary specification 03- -non-null value domain 04- -specification format specification 05- -rationality;
checking the type value: constraint/dicValue;
filtering condition nodes constraint/constraint relationships/filters;
table field constraint/constraint relationships/filters/filter/column;
the expression constraint/constraint relationships/filters/filter/expression, the value field of which includes: 01-represents equal; 02-means greater than; 03-means equal to or greater than; 04-means less than; 05-means equal to or less than; 06-indicates starting with a string; 07-indicates ending with a string;
table field value constraint/constraint relationships/filters/filter/condVal;
to achieve consistency, defined in the constraint class constraint:
comparing table codes constraint/tableCode;
comparing the relationship nodes constraint/constraint relationships;
table field constraints/column;
table field statistical function
constraint/constraintRelations/constraintRelation/columnExp;
Contrast table field constraint/constraint relationships/columnR;
comparison table field statistical function
constraint/constraintRelations/constraintRelation/columnRExp;
Filtering condition nodes constraint/constraint relationships/filters;
table field constraint/constraint relationships/filters/filter/column;
the expression constraint/constraint relationships/filters/filter/expression, the value field of which includes: 01-represents equal; 02-means greater than; 03-means equal to or greater than; 04-means less than; 05-means equal to or less than; 06-indicates starting with a string; 07-indicates ending with a string;
table field value constraint/constraint relationships/filters/filter/condVal;
to achieve timeliness, defined in a constraint class constraint:
service occurrence time field: constraint/timeColumnCode;
define in the evaluation configuration class evaluation:
weight evaluation/weight;
grading a standard node evaluation/range;
lower limit value evaluation/range one/lowerLimit;
upper limit value evaluation/range one/upperLimit;
evaluation/range one/score;
wherein, the evaluation interval is defined by an upper limit value evaluation/range one/upperLimit and a lower limit value evaluation/range one/lowerLimit;
the method comprises the following steps of storing database, table and field information of a data source into a metadata database through a metadata acquisition module, wherein the information of an acquisition table comprises three parts: business metadata, technical metadata, management metadata, wherein:
the service metadata comprises the service classification of the table, the service description of the table, the table name and the table user;
the technical metadata comprises table codes, field names, field types, field lengths, primary keys and the like;
the management metadata includes a table creation time, a last query time.
A metadata mapping module:
the metadata mapping module generates specific data quality assessment indexes of different standard types based on the database, the table and the field information acquired by the metadata acquisition module and the metadata model constructed by the metadata model construction module.
Preferably, the data quality evaluation subsystem comprises a data quality evaluation module and a data quality analysis module, wherein:
the data quality assessment module is used for selecting one or more metadata indexes from all data quality assessment indexes to generate a data quality assessment model,
the data quality analysis module is used for analyzing and calculating the data quality evaluation model to obtain a data quality control index evaluation result; the data quality analysis module classifies and summarizes calculation results of metadata indexes of different index types, stores the calculation results into a data warehouse, generates data quality reports based on different special reports, and performs query analysis by using a big data platform technology Presto perform, wherein: the single metadata index is fully divided into 10 points, and the evaluation methods of various metadata indexes are shown in the following table:
Figure BDA0003745666280000051
in the above table, a, b, and c represent the scoring threshold of each metadata index;
the data quality analysis module obtains the organization score by calculation based on the following formula:
Figure BDA0003745666280000061
in the formula, the metadata index score i The score obtained by calculation after the ith metadata index is examined by the representation data quality analysis module, and the weight of the metadata index i And represents the weight of the ith metadata index set in advance.
Preferably, the data quality report generated by the data quality analysis module comprises a data quality assessment report and an anomaly tracking report, wherein:
the data quality assessment report is used for displaying quality assessment results and organization ranking information obtained based on organization scores, and the results generated by the data quality analysis module based on the data quality assessment model are visualized through the data quality assessment report;
and the data quality analysis module acquires information of a library, a table and a main key where the error record is located based on the metadata index information and the analysis result of the data quality evaluation model to form an abnormal tracking report so as to facilitate a user to trace the data quality problem.
In order to manage the data quality problem more efficiently and accurately, the invention provides an automatic data quality evaluation model based on metadata, which is used for carrying out quality verification, report evaluation and problem tracing on data accessed to a regional platform. The invention takes the metadata as the standard, describes the database, the table information, the fields and the assessment index information, and becomes the basis for assessing the data quality.
Drawings
FIG. 1 is a flow chart of the system of the present invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The data quality control system architecture based on metadata disclosed by the embodiment is composed of three parts: the system comprises a metadata management subsystem, a data quality evaluation subsystem and a data quality display subsystem. The data quality display subsystem is used for displaying data quality information, and is a conventional software system well known to those skilled in the art, so this embodiment mainly describes the metadata management subsystem (including a metadata model construction module, a metadata acquisition module, and a metadata mapping module), the data quality evaluation subsystem (including a data quality evaluation module and a data quality analysis module, where the data quality evaluation module further includes a data quality evaluation model construction unit and a model analysis unit, and the data quality analysis module further includes a data quality assessment reporting unit and a check anomaly tracking unit) in detail.
In conjunction with FIG. 1, one) metadata management subsystem
The metadata management subsystem consists of three parts: the device comprises a metadata model building module, a metadata acquisition module and a metadata mapping module.
The metadata model is a model describing metadata and is a fundamental stone on which the entire data quality control system operates. The metadata model is constructed by a metadata model construction module, is described by Class and consists of basic information Class baseInfo, constraint Class and evaluation configuration Class evaluation.
Defined in the basic information class baseInfo:
metadata type encoding baseInfo/metaType: in this embodiment, the value range includes: 01-denotes integrity; 02-indicating consistency; 03-denotes normative; 04-representing timeliness; 05-indicates stability;
metadata type name baseInfo/metaTypeName;
the metadata type describes baseInfo/metaTypeDesc;
table encoding baseInfo/tableCode;
metadata name baseInfo/metaName;
the metadata encodes baseInfo/metaCode.
Through the constraint of the constraint, the definition of the basic standard normalization such as value domain normalization, dictionary normalization, non-null value domain, specified format normalization and rationality is realized, and the definition of the constraint of the data set including integrity, consistency, timeliness and stability is also realized.
To achieve integrity, defined in the constraint class constraint:
the association table codes constraint/tableCode;
association relation node constraints/constraint relationships;
table field constraints/column;
the association table field constraint/constraint relationships/columnR;
filtering condition nodes constraint/constraint relationships/filters;
table field constraint/constraint relationships/filters/filter/column;
expression constraint/constraint relationships/filters/filter/expression: in this embodiment, the value range includes: 01-represents equal; 02-means greater than; 03-means equal to or greater than; 04-means less than; 05-means equal to or less than; 06-indicates starting with a string; 07-indicates ending with a string
Table field value constraints/constraint relationships/filters/filter/condVal.
To achieve normalization, defined in the constraint class constraint:
table field code constraint/columnCode;
check type constraint/dicType: in this embodiment, the value range includes: 01- -value Domain Specification 02- -dictionary Specification 03- -non-null value Domain 04- -Format Specification 05- -rationality
Checking the type value: constraint/dicValue
Filtering condition nodes constraint/constraint relationships/filters;
table field constraint/constraint relationships/filters/filter/column;
expression constraint/constraint relationships/filters/filter/expression: in this embodiment, the value range includes: 01-represents equal; 02-means greater than; 03-means equal to or greater than; 04-means less than; 05-means equal to or less than; 06-indicates starting with a string; 07-indicates ending with a string
Table field value constraints/constraint relationships/filters/filter/condVal.
To achieve consistency, defined in the constraint class constraint:
comparing table codes constraint/tabelecode;
comparing the relationship nodes constraint/constraint relationships;
table field constraint/constraint relationships/constraint relationship/column;
table field statistical function
constraint/constraintRelations/constraintRelation/columnExp;
Contrast table field constraint/constraint relationships/constraint relationship/columnR;
comparison table field statistical function
constraint/constraintRelations/constraintRelation/columnRExp;
Filtering condition nodes constraint/constraint relationships/filters;
table field constraint/constraint relationships/filters/filter/column;
the expression constraint/constraint relationships/filters/filter/expression: in this embodiment, the value range includes: 01-represents equal; 02-means greater than; 03-means equal to or greater than; 04-means less than; 05-means equal to or less than; 06-indicates starting with a string; 07-indicates ending with a string
Table field value constraints/constraint relationships/filters/filter/condVal.
To achieve timeliness, defined in a constraint class constraint:
a service occurrence time field: constraint/timeColumnCode.
Stability: whether the mechanism uploads the record or not is judged, and the judgment can be carried out through the table code defined in the basic information class baseInfo, and the content does not need to be added in the constraint class.
Define in the evaluation configuration class evaluation:
weight evaluation/weight;
grading a standard node evaluation/range;
lower limit value evaluation/range one/lowerLimit;
upper limit value evaluation/range one/upperLimit;
evaluation/range one/score;
wherein, the evaluation interval is defined by an upper limit value evaluation/range one/upperLimit and a lower limit value evaluation/range one/lowerLimit.
In this embodiment, the metadata model definition manages menu page input through the metadata model.
A metadata collection module:
configuring a data source, and storing database, table and field information of the data source into a metadata database through a metadata acquisition module, wherein the information of an acquisition table consists of three parts: business metadata, technical metadata, management metadata, wherein:
the service metadata comprises the service classification of the table, the service description of the table, the table name and the table user;
the technical metadata comprises table codes, field names, field types, field lengths, primary keys and the like;
the management metadata includes a table creation time, a last query time.
A metadata mapping module:
the metadata mapping module generates specific data quality assessment indexes of different standard types based on the database, the table and the field information acquired by the metadata acquisition module and the metadata model constructed by the metadata model construction module. The data quality assessment indexes are divided into five categories: integrity (including constraint, association), normalization, timeliness and stability. The generated data quality assessment indexes provide sources for the data quality assessment subsystem to check the indexes. In this embodiment, the following format of index content is generated:
Figure BDA0003745666280000101
two) data quality evaluation subsystem
The data quality evaluation subsystem selects one or more metadata indexes from all the data quality evaluation indexes to generate a data quality evaluation model based on the data quality evaluation indexes generated by the metadata management subsystem, and then the data quality analysis module analyzes and calculates the data quality evaluation model to obtain a data quality control index evaluation result.
The data quality evaluation model is an important component of data quality control, and is used for selecting one or more metadata indexes generated by metadata mapping to generate a set of data quality scoring standard based on the requirements of a project customer.
According to assessment requirements, the monitoring indexes of each institution are assessed according to different time dimensions (week/month/season/year), and the assessment method of each index is as follows:
Figure BDA0003745666280000111
for example:
Figure BDA0003745666280000112
Figure BDA0003745666280000121
the data quality analysis module is used for automatically calculating each metadata index in the data quality evaluation model by means of a mathematical formula, a java data processing package, a function, a storage process and the like, calculating results of different index types are classified and summarized, the calculation results are stored in a data warehouse, a data quality report is generated based on different special reports, and a big data platform technology Presto perform query analysis.
Based on the automatic calculation result of the metadata indexes, according to the assessment requirements, the data quality analysis module assesses each metadata index of each organization according to different time dimensions (week/month/season/year), and the single metadata index is fully divided into 10 points. The evaluation method for various metadata indexes is as follows:
Figure BDA0003745666280000122
Figure BDA0003745666280000131
note: the index scoring thresholds (a/b/c) are different and are configured in the system according to the management requirements
The data quality analysis module obtains the organization score by calculation based on the following formula:
Figure BDA0003745666280000132
in the formula, the metadata index score i The score and the metadata index weight obtained by calculation after the ith metadata index is examined by the data quality analysis module i And represents the weight of the ith metadata index set in advance.
The data quality reports generated by the data quality analysis module include data quality assessment reports and anomaly tracking reports.
The data quality assessment report shows the quality assessment result and the organization ranking information obtained based on the organization score in various modes (such as graphs, reports and the like), and the result generated by the data quality analysis module based on the data quality assessment model is visualized through the data quality assessment report.
And the data quality analysis module acquires information of a library, a table and a main key where the error record is located based on the metadata index information and the analysis result of the data quality evaluation model to form an abnormal tracking report so as to facilitate a user to trace the data quality problem.
The specific implementation mode of the data quality control system based on the metadata comprises the following steps:
step 1, maintaining metadata model information in a metadata model by operation and maintenance personnel.
And 2, maintaining data source information including database drive, IP, port, account number, password and the like by operation and maintenance personnel in data source management.
And 3, acquiring metadata information of the database, the table and the field from the data source by using a metadata acquisition module.
And 4, generating a metadata quality index based on the metadata model, the database, the table and the field metadata information in the steps 1 and 3.
And 5, maintaining a data quality evaluation model, selecting the metadata quality index generated in the step 4, setting an execution period, and generating the data quality evaluation model.
And 6, analyzing the data quality evaluation model, generating an assessment index result, storing the assessment index result in a data warehouse, generating a theme report, and displaying the assessment report in a chart form at the front end.

Claims (4)

1. A data quality verification system based on metadata is characterized by comprising a metadata management subsystem, a data quality evaluation subsystem and a data quality display subsystem, wherein:
the metadata management subsystem defines a metadata model according to quality check classification, stores database, table and field information of a data source into a metadata database after the data source is configured, and generates specific data quality assessment indexes of different standard types based on the collected database, table, field information and metadata model, wherein:
the metadata model is described through Class, and is composed of basic information Class baseInfo, constraint Class constraint and evaluation configuration Class evaluation, the basic information of the metadata is defined in the basic information Class baseInfo, the definition of basic standard normalization and the definition of data set-oriented constraint are realized in the constraint Class constraint, and evaluation parameters are configured in the evaluation configuration Class evaluation;
the data quality assessment indexes are divided into constraint, relevance, normalization, timeliness and stability;
the data quality evaluation subsystem selects one or more metadata indexes from all the data quality evaluation indexes to generate a data quality evaluation model based on the data quality evaluation indexes generated by the metadata management subsystem, and then the data quality evaluation model is analyzed and calculated to obtain a data quality control index evaluation result, wherein:
the data quality evaluation model is a requirement, and one or more metadata indexes generated by metadata mapping are selected to generate a set of data quality scoring standard;
the assessment method of the data quality assessment subsystem for the metadata indexes belonging to the relevance is to calculate the relevance rate of the metadata indexes, the relevance rate = M/N, a table T1 uploaded to a data quality verification system by a certain organization comprises N pieces of table data, wherein M pieces of table data can be related to a table T2 uploaded by the same organization;
the evaluation method of the data quality evaluation subsystem on the metadata indexes belonging to the constraint is to calculate the constraint coincidence rate = M/N of the metadata indexes, a table T1 uploaded to a data quality verification system by a certain organization comprises N pieces of table data, wherein M pieces of table data can find the records of a table T2 uploaded by the same organization;
the data quality evaluation subsystem calculates the detail data statistical result of the metadata index and the service operation consistency rate = (TOTAL 1-TOTAL 2)/TOTAL 1, a certain organization reports the data quantity to the data quality verification system through a table T1 to be TOTAL1, and the same organization reports the data quantity to the data quality verification system through a table T2 to be TOTAL2;
the data quality evaluation subsystem evaluates metadata indexes belonging to normalization by calculating the normalization rate = M/N of the metadata indexes, a table T1 uploaded to a data quality verification system by a certain organization comprises N pieces of table data, wherein the number of records of the target field filling normalization is M;
the assessment method of the data quality assessment subsystem for the metadata indexes belonging to timeliness is that the average difference days = M/N of the metadata indexes are calculated, a table T1 uploaded to a data quality verification system by a certain mechanism comprises N pieces of table data, and the sum of the difference between the last uploading time and the service time of the mechanism is M days;
the evaluation method of the data quality evaluation subsystem on the metadata indexes belonging to the stability is to calculate the proportion of non-outage days = (N-M) ÷ N, the days uploaded to a data quality verification system by a certain mechanism are N days, and the outage days are M days;
and the data quality display subsystem is used for displaying the examination result of the data quality control index.
2. The metadata-based data quality verification system of claim 1, wherein the metadata management subsystem comprises a metadata model construction module, a metadata collection module, and a metadata mapping module, wherein:
the metadata model is constructed by a metadata model construction module, the metadata model is described by Class, and the metadata model is composed of basic information Class baseInfo, constraint Class constraint and evaluation configuration Class evaluation:
the basic information class baseInfo defines:
metadata type encoding baseInfo/metaType: the value range includes: 01 denotes integrity; 02 denotes identity; 03 denotes normative; 04 denotes timeliness; 05 represents stability;
metadata type name baseInfo/metaTypeName;
the metadata type describes baseInfo/metaTypeDesc;
table-encoding baseInfo/tableCode;
metadata name baseInfo/metaName;
metadata encoding baseInfo/metaCode;
to achieve integrity, defined in the constraint class:
the association table codes constraint/tableCode;
association relation node constraints/constraint relationships;
table field constraints/column;
association table field constraint/constraintRelations/columnR;
filtering condition nodes constraint/constraint relationships/filters;
table field constraint/constraint relationships/filters/filter/column;
the expression constraint/constraint relationships/filters/filter/expression, the value range of which includes: 01-means equal; 02-means greater than; 03-means equal to or greater than; 04-means less than; 05-means equal to or less than; 06-indicates starting with a string; 07-indicates ending with a string;
table field value constraint/constraint relationships/filters/filter/condVal;
to achieve normalization, defined in the constraint class constraint:
table field code constraint/columnCode;
the check type constraint/dicType has value fields including: 01- -value domain specification 02- -dictionary specification 03- -non-null value domain 04- -specification format specification 05- -rationality;
checking the type value: constraint/dicValue;
filtering condition nodes constraint/constraint relationships/filters;
table field constraint/constraint relationships/filters/filter/column;
the expression constraint/constraint relationships/filters/filter/expression, the value range of which includes: 01-means equal; 02-means greater than; 03-means equal to or greater than; 04-means less than; 05-means equal to or less than; 06-indicates starting with a string; 07-indicates ending with a string;
table field value constraint/constraint relationships/filters/filter/condVal;
to achieve consistency, defined in the constraint class constraint:
comparing table codes constraint/tableCode;
comparing the relationship nodes constraint/constraint relationships;
table field constraints/column;
table field statistical function
constraint/constraintRelations/constraintRelation/columnExp;
Contrast table field constraint/constraint relationships/columnR;
comparison table field statistical function
constraint/constraintRelations/constraintRelation/columnRExp;
Filtering condition nodes constraint/constraint relationships/filters;
table field constraint/constraint relationships/filters/filter/column;
the expression constraint/constraint relationships/filters/filter/expression, the value range of which includes: 01-means equal; 02-means greater than; 03-means equal to or greater than; 04-means less than; 05-means equal to or less than; 06-indicates starting with a string; 07-indicates ending with a string;
table field value constraint/constraint relationships/filters/filter/condVal;
to achieve timeliness, defined in a constraint class constraint:
a service occurrence time field: constraint/timeColumnCode;
define in the evaluation configuration class evaluation:
weight evaluation/weight;
grading a standard node evaluation/range;
lower limit value evaluation/range one/lowerLimit;
upper limit value evaluation/range one/upperLimit;
evaluation/range one/score;
wherein, the evaluation interval is defined by an upper limit value evaluation/range one/upperLimit and a lower limit value evaluation/range one/lowerLimit;
the method comprises the following steps of storing database, table and field information of a data source into a metadata database through a metadata acquisition module, wherein the information of an acquisition table comprises three parts: business metadata, technical metadata, management metadata, wherein:
the service metadata comprises the service classification of the table, the service description of the table, the table name and the table user;
the technical metadata comprises table codes, field names, field types, field lengths, primary keys and the like;
the management metadata includes a table creation time, a last query time.
A metadata mapping module:
the metadata mapping module generates specific data quality assessment indexes of different standard types based on the database, the table and the field information acquired by the metadata acquisition module and the metadata model constructed by the metadata model construction module.
3. The metadata-based data quality verification system of claim 1, wherein the data quality assessment subsystem comprises a data quality assessment module and a data quality analysis module, wherein:
the data quality evaluation module is used for selecting one or more metadata indexes from all data quality assessment indexes to generate a data quality evaluation model,
the data quality analysis module is used for analyzing and calculating the data quality evaluation model to obtain a data quality control index evaluation result; the data quality analysis module classifies and summarizes calculation results of metadata indexes of different index types, stores the calculation results into a data warehouse, generates data quality reports based on different special reports, and performs query analysis by using a big data platform technology Presto perform query analysis, wherein: the single metadata index is fully divided into 10 points, and the evaluation method of various metadata indexes is shown in the following table:
Figure FDA0003745666270000051
in the above table, a, b, and c represent the scoring threshold of each metadata index;
the data quality analysis module obtains the organization score by calculation based on the following formula:
Figure FDA0003745666270000052
in the formula, the metadata index score i The score and the metadata index weight obtained by calculation after the ith metadata index is examined by the data quality analysis module i And represents the weight of the ith metadata index set in advance.
4. A metadata-based data quality verification system as claimed in claim 3, wherein the data quality reports generated by the data quality analysis module include data quality assessment reports and anomaly tracking reports, wherein:
the data quality assessment report is used for displaying quality assessment results and organization ranking information obtained based on organization scores, and the results generated by the data quality analysis module based on the data quality assessment model are visualized through the data quality assessment report;
and the data quality analysis module acquires information of a library, a table and a main key where the error record is located based on the metadata index information and the analysis result of the data quality evaluation model to form an abnormal tracking report so as to facilitate a user to trace the data quality problem.
CN202210824074.1A 2022-07-14 2022-07-14 Data quality verification system based on metadata Pending CN115292298A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210824074.1A CN115292298A (en) 2022-07-14 2022-07-14 Data quality verification system based on metadata

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210824074.1A CN115292298A (en) 2022-07-14 2022-07-14 Data quality verification system based on metadata

Publications (1)

Publication Number Publication Date
CN115292298A true CN115292298A (en) 2022-11-04

Family

ID=83823232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210824074.1A Pending CN115292298A (en) 2022-07-14 2022-07-14 Data quality verification system based on metadata

Country Status (1)

Country Link
CN (1) CN115292298A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117743796A (en) * 2023-12-21 2024-03-22 太平洋资产管理有限责任公司 Instruction set automatic quality check method and system based on investment annotation data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117743796A (en) * 2023-12-21 2024-03-22 太平洋资产管理有限责任公司 Instruction set automatic quality check method and system based on investment annotation data

Similar Documents

Publication Publication Date Title
Ehrlinger et al. A survey of data quality measurement and monitoring tools
CN111324602A (en) Method for realizing financial big data oriented analysis visualization
US10698755B2 (en) Analysis of a system for matching data records
CN112000656A (en) Intelligent data cleaning method and device based on metadata
US20150220868A1 (en) Evaluating Data Quality of Clinical Trials
US11954945B2 (en) Systems and methods for analyzing machine performance
CN117033460B (en) Automatic data model construction system and method based on bus matrix
CN118037469B (en) Financial management system based on big data
CN117194919A (en) Production data analysis system
CN115292298A (en) Data quality verification system based on metadata
Vassiliadis Profiles of schema evolution in free open source software projects
CN117094743B (en) Automatic cigarette retail market data statistical analysis system and method
CN116506186A (en) Big data layering analysis method for network security level protection evaluation data
CN115274121A (en) Health medical data management method, system, electronic device and storage medium
Shah et al. Ace: Classification for information lifecycle management
WO2021256952A1 (en) Method for managing an information system of an enterprise
Goosen A system to quantify industrial data quality
CN112231304A (en) Data processing system and method introducing data warehouse construction technology
RU2744625C1 (en) Method of generating reports on the basic indicators of the display system of enterprise indicators
Munawar Extract Transform Loading (ETL) Based Data Quality for Data Warehouse Development
CN114490615A (en) Data quality inspection system and method based on EPDM data model
Kumar et al. Need for architecture recovery in OSS: A decade study
CN117171105A (en) Electronic archive management system based on knowledge graph
CN117828148A (en) Land sample data management method and system for land management
Ion et al. Quality of Open Source Integrated Software

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination