CN114090558A

CN114090558A - Data quality management method and device for database

Info

Publication number: CN114090558A
Application number: CN202111329182.3A
Authority: CN
Inventors: 鲍梦瑶; 刘佳伟; 章鹏; 张谦; 殷雪梅
Original assignee: Alipay Hangzhou Information Technology Co Ltd; Ant Blockchain Technology Shanghai Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd; Ant Blockchain Technology Shanghai Co Ltd
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2022-02-25

Abstract

An embodiment of the specification provides a data quality management method and device for a database, and the method comprises the following steps: acquiring a target SQL statement aiming at a database; analyzing a target SQL statement to obtain a plurality of data objects and a target incidence relation among the plurality of data objects, wherein a single data object is a field or a data table; updating a pre-established data relation record according to a plurality of data objects and target incidence relations, wherein the data relation record comprises at least part of data objects in a database and the existing incidence relations among the data objects; monitoring the data quality of a plurality of data objects, and judging whether problem data objects with the data quality not meeting requirements exist; when the judgment result shows that the problem data object exists, inquiring a target data object which has a preset incidence relation with the problem data object from the updated data relation record; and performing data quality management aiming at the problem data object and the target data object. The data management efficiency can be improved.

Description

Data quality management method and device for database

Technical Field

One or more embodiments of the present description relate to the field of computers, and more particularly, to a data quality management method and apparatus for a database.

Background

Driven by technological innovation, more and more enterprises open the way of digital transformation. Everything that is transformed digitally surrounds the data, including but not limited to data acquisition, precipitation, manipulation, and insight. Enterprises also begin to attach more importance to data as an "asset", how to manage it effectively and exert value. Effective data quality management is beneficial to objective analysis and decision making, and is the basis for enterprises to realize digital transformation.

Typically, the data is stored in a database comprising a large number of data tables, with an average of tens of fields per data table. In the face of huge amount of data in a database, data management efficiency needs to be improved in data quality management aiming at the database.

Disclosure of Invention

One or more embodiments of the present specification describe a data quality management method and apparatus for a database, which can improve data management efficiency.

In a first aspect, a method for data quality management for a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the method comprising:

obtaining a target Structured Query Language (SQL) statement for performing a target behavior operation with respect to the database;

analyzing the target SQL statement to obtain a plurality of data objects related to the target behavior operation and a target incidence relation among the plurality of data objects, wherein a single data object is a field or a data table;

updating a pre-established data relation record according to the incidence relations between the plurality of data objects and the target, wherein the data relation record comprises at least part of data objects in the database and the existing incidence relations among the data objects;

monitoring the data quality of the plurality of data objects, and judging whether a problem data object with the data quality not meeting the requirement exists in the plurality of data objects;

when the problem data object exists in the judgment result, inquiring a target data object having a preset association relation with the problem data object from the updated data relation record aiming at the problem data object;

and carrying out data quality management aiming at the problem data object and the target data object.

In one possible embodiment, the obtaining a target SQL statement for performing a target behavior operation on the database includes:

and regularly summarizing the historical SQL sentences in the database, and taking the summarized historical SQL sentences as the target SQL sentences.

and capturing the SQL sentences submitted and operated by the user as the target SQL sentences.

In a possible implementation manner, the monitoring the data quality of the plurality of data objects and determining whether there is a problem data object whose data quality does not meet the requirement in the plurality of data objects includes:

and determining the evaluation score of any one of the plurality of data objects according to a preset quality evaluation rule, and determining the data object with the evaluation score in a preset interval as a problem data object.

In one possible embodiment, the data relationship records are stored in the form of a graph, the graph comprising nodes and connecting edges, the nodes corresponding to the data objects and the connecting edges corresponding to the associative relationships.

Further, the updating the pre-established data relationship record according to the plurality of data objects and the target association relationship includes:

adding nodes corresponding to the first fields in the pre-established graph according to the first fields which are not included in the data relation records and are included in the data objects; the first field belongs to an existing first data table in the data relation record;

adding a connecting edge corresponding to a first association relation in the map according to the first association relation between the first field included in the target association relation and an existing second field in the data relation record; the second field belongs to a second data table existing in the data relation record;

adding a connecting edge corresponding to a second incidence relation between the first field and the first data table in the map according to the second incidence relation between the first field and the first data table in the target incidence relation.

adding a node corresponding to a third field and a node corresponding to a third data table in the pre-established graph according to the third field which is not included in the data relation records and the third data table to which the third field belongs;

adding a connecting edge corresponding to a third association relation in the map according to the third association relation between the third field included in the target association relation and a fourth field existing in the data relation record; the fourth field belongs to an existing fourth data table in the data relationship record;

adding a connecting edge corresponding to a fourth incidence relation in the map according to the fourth incidence relation between the third field and the third data table included in the target incidence relation.

Further, the querying, from the updated data relationship record, a target data object having a preset association relationship with the question data object includes:

determining a graph query statement according to the preset incidence relation;

and querying a corresponding node from the updated graph by using the graph query statement, wherein the node corresponds to the target data object.

In one possible embodiment, the target data object includes:

an upstream data object having a preset association relationship with the problem data object and/or a downstream data object having a preset association relationship with the problem data object; if the first data object generates the second data object through any behavior operation, the first data object is an upstream data object of the second data object, and the second data object is a downstream data object of the first data object.

In a possible implementation manner, the question data object and the target data object are both fields, and the preset association relationship includes truncation, where the truncation is to extract a substring of a character string corresponding to the question data object, or the truncation is to extract a substring of a character string corresponding to the target data object.

In a possible implementation, the performing data quality management on the problem data object and the target data object includes:

monitoring the data quality of the target data object, and judging whether the data quality of the target data object does not meet the requirement;

and sending an alarm aiming at the problem data object and the target data object with the judgment result of the data quality not meeting the requirement.

Further, the target data object comprises at least one downstream data object having a preset association relationship with the question data object;

the performing data quality management for the problem data object and the target data object includes:

issuing an alert for the problem data object and the at least one downstream data object.

Further, the target data object comprises a plurality of upstream data objects which have preset incidence relations with the question data object;

and searching the upstream data object with the problem initially in the plurality of upstream data objects, and taking the upstream data object as the root of data quality management.

and feeding back the problem data object and the target data object to a data technician for data management, or cleaning and sorting according to preset data management rules.

In a second aspect, there is provided a data quality management apparatus for a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the apparatus comprising:

an obtaining unit, configured to obtain a target Structured Query Language (SQL) statement for performing a target behavior operation on the database;

the analysis unit is used for analyzing the target SQL statement acquired by the acquisition unit to acquire a plurality of data objects related to the target behavior operation and a target incidence relation among the plurality of data objects, wherein a single data object is a field or a data table;

the updating unit is used for updating a pre-established data relation record according to the incidence relation between the plurality of data objects and the target obtained by the analyzing unit, wherein the data relation record comprises at least part of data objects in the database and the existing incidence relation among the data objects;

the monitoring unit is used for monitoring the data quality of the plurality of data objects obtained by the analysis unit and judging whether a problem data object with the data quality which does not meet the requirement exists in the plurality of data objects;

the query unit is used for querying a target data object having a preset association relation with the problem data object from the data relation record updated by the updating unit aiming at the problem data object when the judgment result of the monitoring unit indicates that the problem data object exists;

and the management unit is used for carrying out data quality management on the problem data object obtained by the monitoring unit and the target data object obtained by the query unit.

In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.

According to the method and the device provided by the embodiment of the specification, firstly, a target SQL statement for executing target behavior operation on the database is obtained; then analyzing the target SQL statement to obtain a plurality of data objects related to the target behavior operation and a target incidence relation among the plurality of data objects, wherein a single data object is a field or a data table; then updating a pre-established data relation record according to the incidence relations between the data objects and the target, wherein the data relation record comprises at least part of data objects in the database and the existing incidence relations among the data objects; then, monitoring the data quality of the plurality of data objects, and judging whether a problem data object with the data quality not meeting the requirement exists in the plurality of data objects; when the problem data object exists in the judgment result, inquiring a target data object having a preset association relation with the problem data object from the updated data relation record aiming at the problem data object; and finally, performing data quality management aiming at the problem data object and the target data object. As can be seen from the above, in the embodiments of the present specification, based on the analysis of the SQL statement, a plurality of data objects related to the target behavior operation and a target association relationship between the plurality of data objects can be automatically analyzed from the SQL statement, then the pre-established data relationship record is updated, and then the target data object having a preset association relationship with the problem data object is queried from the updated data relationship record, so as to perform data quality management on the problem data object and the target data object, which has high automation, reduces labor cost, and can improve data management efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

FIG. 2 illustrates a flow diagram of a method for data quality management for a database, according to one embodiment;

FIG. 3 illustrates a schematic diagram of a data quality monitoring method according to one embodiment;

FIG. 4 illustrates a schematic diagram of an application scenario for data quality management according to one embodiment;

FIG. 5 illustrates a system architecture diagram for data quality management of a database, according to one embodiment;

FIG. 6 illustrates an offline data quality management flow diagram according to one embodiment;

FIG. 7 illustrates an online data quality management flow diagram according to one embodiment;

fig. 8 shows a schematic block diagram of a data quality management apparatus for a database according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario involves data quality management for a database that includes a plurality of data tables, each data table including a plurality of fields, wherein a field corresponds to a column. Referring to fig. 1, taking an example that a database includes a table one and a table two, where the table one is an original data table in the database, and the data table includes a field 1, and the operation of the database is to create the table two, where the field 1 in the table two is in a truncated relationship with the field 1 in the table one, and the field 2 in the table two is also in a truncated relationship with the field 1 in the table one, that is, the field 1 and the field 2 in the table two are obtained by extracting a substring of a string corresponding to the field 1 in the table one, for example, the string corresponding to the field 1 in the table one is abcd, and accordingly, the field 1 in the table two corresponds to an ab of the string, and the field 2 in the table two corresponds to a cd of the string, it can be understood that the field 1 and the field 2 in the table two are obtained based on the field 1 in the table one, if the data quality of the field 1 in the table one does not satisfy the requirement, then the data quality big probability of the field 1 or the field 2 in the table two does not meet the requirement; if the data quality of field 1 or field 2 in Table two does not meet the requirements, the data quality of field 1 in Table one may also not meet the requirements. That is, the data having the association relationship also has an association with the determination result of whether or not the data quality satisfies the requirement.

The associative relationship is also called the relationship of blood relationship: the method is used for describing the upstream and downstream relationship between data, generally comprising copying, truncation, splicing, conversion and the like, and generally the data of one field is processed to obtain the data of another field.

Data quality: the data is in accordance with the use purpose of a data consumer in a service environment, and the degree of specific requirements of a service scene can be met, specifically comprising multiple aspects of accuracy, integrity, timeliness, relevance, consistency, reliability, accessibility and the like.

In this embodiment of the present specification, data quality monitoring may be performed on data objects in a database, where a single data object is a field or a data table, and there are many ways of monitoring, for example, it may be determined whether the data quality of the field or the data table meets a requirement by monitoring the number of null values of the field or the data table, and generally, the larger the number of null values is, the lower the data quality is. For example, in fig. 1, the number of null values in field 1 in table two is 2, and the number of null values in table two is 4, and it can be determined whether the data quality meets the requirement by comparing with a preset threshold value of the number of null values.

It should be noted that the above is only an example of data quality monitoring, and in practice, the data quality monitoring mode varies according to different requirements of specific service scenarios on data quality, and is not limited to this example. For example, it can also be determined whether the data quality of a field or a data table meets the requirement by monitoring the data update frequency of the field or the data table, and generally, the smaller the data update frequency, the lower the data quality.

In the embodiment of the present description, data quality monitoring may be performed by combining a plurality of data quality evaluation indexes.

Data quality management, also known as data governance: data governance is a whole set of administrative activities in an organization that involve the use of data. By effectively identifying various data quality problems, establishing data supervision, forming a data quality management system, monitoring and disclosing the data quality problems, providing detailed problem query and quality improvement suggestions, comprehensively improving the integrity, accuracy, timeliness, consistency and legality of data, reducing the data management cost and reducing decision deviation and loss caused by unreliable data.

The embodiment of the specification can automatically analyze a plurality of data objects related to target behavior operation and a target association relation among the plurality of data objects based on the analysis of the operation sentences of the database, and then perform data quality management based on the association relation.

Fig. 2 shows a flowchart of a data quality management method for a database comprising a plurality of data tables, each comprising a plurality of fields, according to an embodiment, which may be based on the implementation scenario shown in fig. 1. As shown in fig. 2, the data quality management method for a database in this embodiment includes the following steps: step 21, obtaining a Structured Query Language (SQL) statement for executing a target behavior operation on the database; step 22, analyzing the target SQL statement to obtain a plurality of data objects related to the target behavior operation and a target association relationship among the plurality of data objects, wherein a single data object is a field or a data table; step 23, updating a pre-established data relationship record according to the association relationship between the plurality of data objects and the target, wherein the data relationship record comprises at least part of data objects in the database and the existing association relationship between the data objects and the target association relationship; step 24, performing data quality monitoring on the plurality of data objects, and judging whether problem data objects with data quality not meeting requirements exist in the plurality of data objects; step 25, when the problem data object exists in the judgment result, inquiring a target data object having a preset association relation with the problem data object from the updated data relation record aiming at the problem data object; and 26, performing data quality management on the problem data object and the target data object. Specific execution modes of the above steps are described below.

First, in step 21, a target SQL statement for performing a target behavioral operation on the database is obtained. It is understood that various behavior operations for the database may be implemented by executing SQL statements, which may include, but are not limited to, creating a data table, adding fields to an existing data table, and so on.

In one example, the obtaining a target SQL statement for performing a target behavioral operation on the database includes:

This example may be used for offline data quality management, that is, monitoring whether the data quality of data objects involved in historical behavioral operations against the database meet requirements.

In another example, the obtaining a target SQL statement for performing a target behavioral operation on the database includes:

This example may be used for online data quality management, that is, monitoring whether the data quality of the data objects involved in the current behavioral operation for the database meets requirements.

Then, in step 22, the target SQL statement is parsed to obtain a plurality of data objects related to the target behavior operation and a target association relationship between the plurality of data objects, where a single data object is a field or a data table. It is understood that the association relationship may include a relationship between fields, a relationship between fields and data tables, and a relationship between data tables and data tables.

The SQL parsing is a basic stone for constructing data lineage relationships, and mainly parses fields and tables, fields and fields, and inheritance relationships between tables and tables described in SQL, and generally, relationships between fields may include copy (copy), truncation (substr), concatenation (concat), and the like; the relationship between tables is dependency (depended); the relationship between a field and a table is belonged. The triplet may be used to represent the resolved blood-related relationships (source _ node, target _ node, relation). Wherein, source _ node is the identifier of the source node; target _ node is the identification of the target node; a relationship is an inter-node relationship. For example, the following SQL:

Create Table1 as

Selectsubstr(identify_no,0,6)as identify_no_first,substr(identify_no,6,12)as identify_no_rest

From Table2；

the blood relationship obtained by SQL analysis includes:

(table2. identification _ no, table1. identification _ no _ first, substr), representing that the initial part field of the identification number in table1 is in a truncated relationship with the identification number field in table 2;

(table2. identification _ no, table1. identification _ no _ rest, substr), the remaining field representing the identification number in table1 is in a truncated relationship with the identification number field in table 2;

(Table2, Table1, depended), representing that Table1 and Table2 are dependencies;

(Table1. identity _ no _ first, Table1, belong), which represents the relationship that the initial part field of the identity number in Table1 belongs to Table 1;

(Table1. identity _ no _ rest, Table1, belong), which represents the relationship that the remaining part field of the identity number in Table1 belongs to Table 1;

(Table2. identity _ no, Table2, belong), which represents the relationship that the identification number field in Table2 belongs to Table2.

At present, a mature third party library for SQL analysis can be used, and the principle is not described herein again.

Then, in step 23, a pre-established data relationship record is updated according to the plurality of data objects and the target association relationship, where the data relationship record includes at least a part of the data objects in the database and the existing association relationship therebetween. It is understood that, after the step 22 is executed, the data relationship record is updated, and as the number of the SQL statements is increased, the number of the data objects and the association relationships included in the data relationship record is increased.

In one example, the data relationship records are stored in the form of a graph, the graph including nodes and connecting edges, the nodes corresponding to the data objects, the connecting edges corresponding to the associative relationships.

In the embodiments of the present description, the maps may be in the form of map databases. Graph database: also known as a graph database, is a non-relational database that uses graph theory to store relational information between entities. Compared with a relational database, the graph database can be conveniently and rapidly inquired and can be used for various calculations and reasoning.

It is understood that the node has a node identifier, which can uniquely identify a field or a data table, for example, a Globally Unique Identifier (GUID) is used as the node identifier, which is in the specific form of project _ name.

It will be appreciated that the above-described updating includes adding nodes and connecting edges in the graph, where the added nodes represent added fields belonging to the data table corresponding to the existing nodes in the graph.

It will be appreciated that the above-described updates include new nodes and connecting edges in the graph, the new nodes including nodes representing new fields belonging to the new data table and nodes of the new data table.

And step 24, monitoring the data quality of the plurality of data objects, and judging whether problem data objects with the data quality not meeting the requirements exist in the plurality of data objects. It can be understood that different service scenarios usually have different requirements on data quality, and therefore, the data quality monitoring mode is flexible and can be determined according to requirements.

In one example, the monitoring the data quality of the plurality of data objects and determining whether there is a problem data object whose data quality does not meet the requirement in the plurality of data objects includes:

FIG. 3 shows a schematic diagram of a data quality monitoring method according to one embodiment. Referring to fig. 3, data quality monitoring is realized through a series of artificially set evaluation rules, and the data quality evaluation rules and the data quality evaluation indexes are kept consistent, so that data quality is improved, and the commercial value of data is improved. Data quality assessment indicators may include, but are not limited to, data uniqueness, data consistency, data accuracy, data relevance, data integrity, and data timeliness. The data quality evaluation rule generally relates to an evaluation object, an evaluation index, a weight and an expected value, wherein the evaluation object can be any field or data table in a database, the evaluation index is short for the data quality evaluation index, when a plurality of evaluation indexes are involved, a corresponding weight and an expected value can be set for each evaluation index, an evaluation score of the evaluation object is comprehensively determined according to a difference value between a specific numerical value and the expected value of each evaluation index and the corresponding weight, and then whether the evaluation object is a problem data object is determined according to the evaluation score.

In step 25, when the problem data object exists in the judgment result, the target data object having a preset association relationship with the problem data object is inquired from the updated data relationship record for the problem data object. It is understood that the preset association relationship may be set manually, and may include, but is not limited to, an association relationship such as copying, truncation, splicing, and converting.

In one example, the data relationship records are stored in the form of a graph, the graph including nodes and connecting edges, the nodes corresponding to the data objects, the connecting edges corresponding to the associative relationships. The querying, from the updated data relationship record, a target data object having a preset association relationship with the question data object includes:

determining a graph query statement according to the preset incidence relation;

In this example, the query operation is converted into an image query statement, so as to search the target data object having a preset association relationship with the question data object in the graph database. The current common graph query language can conveniently and quickly calculate and reason various graphs.

In an example, the question data object and the target data object are both fields, and the preset association relationship includes truncation, where the truncation is to extract a substring of a character string corresponding to the question data object, or the truncation is to extract a substring of a character string corresponding to the target data object.

The query operation of the preset association in this example may be converted into a graph query statement, for example, the operation of truncating the sensitive field is described as follows by using Gremlin language: g.V (). has ("issensive", 1). outE (). has ("depentatype", "substr"). According to the graph query statement, all sensitive data (issensive attribute is 1) on the data blood-edge graph are traversed, whether the dependency relationship on the exit edges of the nodes is substring or not is checked, and if the dependency relationship on the exit edges of the nodes is truncated, the nodes and the related edges are returned. The various attributes of the returned nodes may then be parsed to obtain the executor, execution time, etc. of the SQL statement. And the operation of subsequent auditors is facilitated. The sensitive data is privacy data.

Finally, in step 26, data quality management is performed for the problem data object and the target data object. It will be appreciated that data quality management is not only performed for problem data objects, but also for target data objects having a preset association relationship therewith.

In one example, the target data object includes:

In the embodiments of the present description, a uniform data quality management manner or a differentiated data quality management manner may be adopted for an upstream data object of a problem data object and a downstream data object of the problem data object.

In one example, the performing data quality management for the problem data object and the target data object includes:

In this example, whether a target data object is an upstream data object or a downstream data object is not distinguished, data quality monitoring is performed uniformly, and an alarm is issued for the problem data object and the target data object whose determination result is that data quality does not meet a requirement.

In one example, the target data object includes at least one downstream data object having a preset associative relationship with the issue data object;

In this example, for the downstream data object having the preset association relation with the problem data object, data quality monitoring is not needed any more, and an alarm is directly issued.

In one example, the target data object includes a plurality of upstream data objects having a preset associative relationship with the issue data object;

Fig. 4 shows a schematic diagram of an application scenario of data quality management according to an embodiment. Referring to fig. 4, data quality monitoring is performed on a current data node in a graph database, if it is determined that there is no problem in the current data node, data quality monitoring is continued, and if it is determined that there is a problem in the current data node, blood vessel edge diffusion is performed, that is, a target node having a preset association relationship with the current data node is searched in the graph database, the searched target node may include both an upstream data node of the current data node and a downstream data node of the current data node, data quality management for the upstream data node mainly involves problem backtracking, and data quality management for the downstream data node mainly involves service alarm. When a data quality problem is found, a business alarm is sent to a data manager or an audit center on a downstream data node; problem backtracking is to trace back to the upstream of a data node to find the root of a problem when a data quality problem is found; the data quality management also comprises data governance, and the data governance is to carry out cleaning and sorting of the whole flow for the discovered data quality problems.

For example, the problem backtracking is to find the upstream node with a problem, for example, taking the case that the update frequency is too low to represent a problem of data quality, the update frequency of the node 4 is found to be too low, the update frequency of the upstream node 3 is also low, then the upstream node needs to be backtracked again, the update frequency of the upstream node 2 of the node 3 is also low, but the update frequency of the upstream node 1 of the node 2 is normal, then the data update frequency of the node 2 needs to be corrected first, and then the node 3 and the node 4 are sequentially treated according to the blood relationship.

For example, if the data vacancy is large, related data can be reapplied to the data source or the data source can be newly added to increase the integrity of the data; if the update frequency of the upstream and downstream data nodes is not consistent, the update frequency of the code generated by the downstream node can be checked to be consistent.

According to the method provided by the embodiment of the specification, firstly, a target SQL statement for executing target behavior operation on the database is obtained; then analyzing the target SQL statement to obtain a plurality of data objects related to the target behavior operation and a target incidence relation among the plurality of data objects, wherein a single data object is a field or a data table; then updating a pre-established data relation record according to the incidence relations between the data objects and the target, wherein the data relation record comprises at least part of data objects in the database and the existing incidence relations among the data objects; then, monitoring the data quality of the plurality of data objects, and judging whether a problem data object with the data quality not meeting the requirement exists in the plurality of data objects; when the problem data object exists in the judgment result, inquiring a target data object having a preset association relation with the problem data object from the updated data relation record aiming at the problem data object; and finally, performing data quality management aiming at the problem data object and the target data object. As can be seen from the above, in the embodiments of the present specification, based on the analysis of the SQL statement, a plurality of data objects related to the target behavior operation and a target association relationship between the plurality of data objects can be automatically analyzed from the SQL statement, then the pre-established data relationship record is updated, and then the target data object having a preset association relationship with the problem data object is queried from the updated data relationship record, so as to perform data quality management on the problem data object and the target data object, which has high automation, reduces labor cost, and can improve data management efficiency.

FIG. 5 illustrates a system architecture diagram for data quality management of a database, according to one embodiment. Referring to fig. 5, the system architecture may be divided into a data layer, a blood relationship layer, a functional layer and an application layer, wherein the data layer collects logs, metadata and SQL statements, and based on the SQL parser, the data blood relationship may be periodically generated and incrementally synchronized into a graph database, which may be referred to as a data blood relationship graph, and the blood relationship includes a relationship between fields, a relationship between fields and tables, and a relationship between tables. Nodes in the blood relationship layer are tables and columns, and edges are relationships among the nodes. The graph calculation engine is used for finishing data quality query, calculation and monitoring of upper-layer defined data upstream and downstream nodes. A data quality monitoring method defined manually in the functional layer, for example, monitoring according to the number of null values, the data updating frequency and the like; and tracking the upstream and downstream nodes of the current data node through a graph query statement when the quality problem of the current data node is found, and monitoring the data quality of the whole process. The offline timing management in the functional layer is mainly used for finding data quality problems from executed SQL statements. The online real-time governance is mainly used for discovering the SQL sentences with potential data quality risks from the SQL sentences to be executed in an ETL Hook mode. Three available application scenarios for data quality management are defined in the application layer, including the aforementioned service alarm, problem backtracking and data governance.

FIG. 6 illustrates an offline data quality management flow diagram according to one embodiment. Referring to fig. 6, the system may use a timer to summarize historical SQL statements executed by a database operator, then use an SQL parsing module to parse the summarized SQL statements, and arrange the summarized SQL statements into nodes and edges of a blood-edge relationship, and then insert the nodes and edges into a previously constructed data blood-edge graph. And performing quality verification on the data in the database by using a preset data quality monitoring rule. When data quality problems are found, the graph calculation engine is used for tracking the upstream and the downstream on the updated data blood margin graph. And meanwhile, the data quality problem is fed back to a data technician for data treatment, or the cleaning and the arrangement are directly carried out according to the preset data treatment rule.

FIG. 7 illustrates an online data quality management flow diagram according to one embodiment. Referring to fig. 7, the system first captures an SQL statement submitted and run by a user by using ETL Hook, then analyzes the SQL statement by using an SQL analysis module, and then inserts the analyzed blood relationship into a data blood relationship map. And simultaneously, monitoring the data quality of the newly updated data content, and if the data quality problem is found, tracking the upstream and the downstream on the updated data blood-margin atlas by using an atlas calculation engine. Meanwhile, the data quality problem is fed back to a data technician for data management, and then the SQL statement is executed. When no data quality problem is found, the SQL statement is directly executed.

According to an embodiment of another aspect, there is also provided a data quality management apparatus for a database, the database including a plurality of data tables, each data table including a plurality of fields, the apparatus being configured to perform the method provided by the embodiments of the present specification. Fig. 8 shows a schematic block diagram of a data quality management apparatus for a database according to one embodiment. As shown in fig. 8, the apparatus 800 includes:

an obtaining unit 81, configured to obtain a target structured query language SQL statement for performing a target behavior operation on the database;

an analyzing unit 82, configured to analyze the target SQL statement acquired by the acquiring unit 81 to obtain a plurality of data objects related to the target behavior operation and a target association relationship between the plurality of data objects, where a single data object is a field or a data table;

an updating unit 83, configured to update a pre-established data relationship record according to the association relationship between the plurality of data objects and the target obtained by the analyzing unit 82, where the data relationship record includes at least part of the data objects in the database and existing association relationships therebetween;

a monitoring unit 84, configured to perform data quality monitoring on the multiple data objects obtained by the parsing unit 82, and determine whether a problem data object whose data quality does not meet a requirement exists in the multiple data objects;

the query unit 85 is configured to, when the determination result of the monitoring unit 84 is that the problem data object exists, query, for the problem data object, a target data object having a preset association relationship with the problem data object from the data relationship record updated by the updating unit 83;

a management unit 86, configured to perform data quality management on the problem data object obtained by the monitoring unit 84 and the target data object obtained by the querying unit 85.

Optionally, as an embodiment, the monitoring unit 84 is specifically configured to determine an evaluation score of any one of the plurality of data objects according to a preset quality evaluation rule, and determine a data object with the evaluation score in a preset interval as a problem data object.

Optionally, as an embodiment, the target data object includes:

Optionally, as an embodiment, the management unit 86 includes:

the monitoring subunit is used for monitoring the data quality of the target data object and judging whether the data quality of the target data object does not meet the requirement;

and the alarm subunit is used for sending an alarm aiming at the problem data object and the target data object of which the judgment result of the monitoring subunit is that the data quality does not meet the requirement.

the management unit 86 is specifically configured to issue an alarm for the problem data object and the at least one downstream data object.

the management unit 86 is specifically configured to find an upstream data object that initially has a problem among the plurality of upstream data objects, and use the upstream data object as a root of data quality management.

Optionally, as an embodiment, the management unit 86 is specifically configured to feed back the problem data object and the target data object to a data technician for data governance, or perform cleaning and sorting according to preset data governance rules.

With the apparatus provided in this specification, first, the obtaining unit 81 obtains a target SQL statement for performing a target behavior operation on the database; then, the parsing unit 82 parses the target SQL statement to obtain a plurality of data objects related to the target behavior operation and a target association relationship among the plurality of data objects, where a single data object is a field or a data table; then, the updating unit 83 updates a pre-established data relationship record according to the association relationship between the plurality of data objects and the target, where the data relationship record includes at least part of the data objects in the database and the existing association relationship therebetween; then, the monitoring unit 84 performs data quality monitoring on the plurality of data objects, and determines whether a problem data object whose data quality does not meet the requirement exists in the plurality of data objects; when the problem data object exists in the judgment result, the query unit 85 queries, for the problem data object, a target data object having a preset association relationship with the problem data object from the updated data relationship record; finally, the management unit 86 performs data quality management for the problem data object and the target data object. As can be seen from the above, in the embodiments of the present specification, based on the analysis of the SQL statement, a plurality of data objects related to the target behavior operation and a target association relationship between the plurality of data objects can be automatically analyzed from the SQL statement, then the pre-established data relationship record is updated, and then the target data object having a preset association relationship with the problem data object is queried from the updated data relationship record, so as to perform data quality management on the problem data object and the target data object, which has high automation, reduces labor cost, and can improve data management efficiency.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A data quality management method for a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the method comprising:

acquiring a target Structured Query Language (SQL) statement for executing target behavior operation aiming at the database;

2. The method of claim 1, wherein said obtaining a target SQL statement for performing a target behavioral operation on the database comprises:

3. The method of claim 1, wherein said obtaining a target SQL statement for performing a target behavioral operation on the database comprises:

4. The method of claim 1, wherein said monitoring data quality of said plurality of data objects and determining if there is a problem data object of said plurality of data objects that does not meet said data quality requirement comprises:

5. The method of claim 1, wherein the data relationship records are stored in the form of a graph, the graph including nodes and connecting edges, the nodes corresponding to the data objects and the connecting edges corresponding to associative relationships.

6. The method of claim 5, wherein said updating a pre-established data relationship record based on said plurality of data objects and said target association comprises:

7. The method of claim 5, wherein said updating a pre-established data relationship record based on said plurality of data objects and said target association comprises:

8. The method of claim 5, wherein the querying, from the updated data relationship record, for a target data object having a preset association relationship with the question data object comprises:

determining a graph query statement according to the preset incidence relation;

9. The method of claim 1, wherein the target data object comprises:

10. The method according to claim 1, wherein the question data object and the target data object are fields, and the preset association relationship includes truncation, wherein the truncation is to extract a substring of a character string corresponding to the question data object, or the truncation is to extract a substring of a character string corresponding to the target data object.

11. The method of claim 1, wherein the performing data quality management for the problem data object and the target data object comprises:

12. The method of claim 9, wherein the target data object comprises at least one downstream data object having a preset associative relationship with the issue data object;

13. The method of claim 9, wherein the target data object comprises a plurality of upstream data objects having a preset associative relationship with the issue data object;

14. The method of claim 1, wherein the performing data quality management for the problem data object and the target data object comprises:

15. A data quality management apparatus for a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the apparatus comprising:

16. The apparatus according to claim 15, wherein the monitoring unit is specifically configured to determine an evaluation score of any one of the plurality of data objects according to a preset quality evaluation rule, and determine a data object with the evaluation score in a preset interval as a problem data object.

17. The apparatus of claim 15, wherein the target data object comprises:

18. The apparatus of claim 15, wherein the management unit comprises:

19. The apparatus of claim 17, wherein the target data object comprises at least one downstream data object having a preset associative relationship with the issue data object;

the management unit is specifically configured to issue an alert for the problem data object and the at least one downstream data object.

20. The apparatus of claim 17, wherein the target data object comprises a plurality of upstream data objects having a preset associative relationship with the issue data object;

the management unit is specifically configured to find an upstream data object that initially has a problem among the plurality of upstream data objects, and use the upstream data object as a root of data quality management.

21. The apparatus according to claim 15, wherein the management unit is specifically configured to feed back the problem data object and the target data object to a data technician for data management, or perform cleaning and finishing according to preset data management rules.

22. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-14.

23. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-14.