CN116126843A - Data quality evaluation method and device, electronic equipment and storage medium - Google Patents

Data quality evaluation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116126843A
CN116126843A CN202211700716.3A CN202211700716A CN116126843A CN 116126843 A CN116126843 A CN 116126843A CN 202211700716 A CN202211700716 A CN 202211700716A CN 116126843 A CN116126843 A CN 116126843A
Authority
CN
China
Prior art keywords
data
quality evaluation
evaluated
data quality
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211700716.3A
Other languages
Chinese (zh)
Inventor
李旭阳
宋亚红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dt Dream Technology Co Ltd
Original Assignee
Hangzhou Dt Dream Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dt Dream Technology Co Ltd filed Critical Hangzhou Dt Dream Technology Co Ltd
Priority to CN202211700716.3A priority Critical patent/CN116126843A/en
Publication of CN116126843A publication Critical patent/CN116126843A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Automatic Analysis And Handling Materials Therefor (AREA)

Abstract

The application provides a data quality assessment method, a data quality assessment device, electronic equipment and a storage medium. Wherein the method is applied to a data warehouse. The method comprises the following steps: acquiring a data blood relationship corresponding to the data to be evaluated in the target data table; wherein the data blood relationship indicates an inheritance relationship of the data; determining a data quality evaluation rule configured in an upstream data table inherited by the data to be evaluated based on the data blood relationship; and carrying out data quality evaluation on the data to be evaluated based on the data quality evaluation rule.

Description

Data quality evaluation method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of big data technologies, and in particular, to a data quality evaluation method, a device, an electronic apparatus, and a storage medium.
Background
In large data processing, data may be managed through a data warehouse. In a data warehouse, data is typically stored through multiple levels of data tables. To ensure the stability and accuracy of the data in the data warehouse, it is necessary to evaluate the quality of the data in each data table in the data warehouse. Because the amount of data in a data warehouse is typically large and interrelated, the above-described data quality assessment process is often complex and inefficient.
Disclosure of Invention
In view of this, the present specification provides the following methods, apparatuses, electronic devices, and storage media.
In a first aspect of the present application, there is provided a data quality assessment method, the method being applied to a data warehouse; the method comprises the following steps:
acquiring a data blood relationship corresponding to the data to be evaluated in the target data table; wherein the data blood relationship indicates an inheritance relationship of the data;
determining a data quality evaluation rule configured in an upstream data table inherited by the data to be evaluated based on the data blood relationship;
and carrying out data quality evaluation on the data to be evaluated based on the data quality evaluation rule.
In a second aspect of the present application, there is provided a data quality assessment apparatus, the apparatus being for application to a data warehouse; the device comprises:
the acquisition unit is used for acquiring a data blood-edge relationship corresponding to the data to be evaluated in the target data table; wherein the data blood relationship indicates an inheritance relationship of the data;
a determining unit, configured to determine, based on the data blood-edge relationship, a data quality evaluation rule that the data to be evaluated is configured in an upstream data table inherited by the data to be evaluated;
And the evaluation unit is used for evaluating the data quality of the data to be evaluated based on the data quality evaluation rule.
In a third aspect of the present application, there is provided an electronic device, including a communication interface, a processor, a memory, and a bus, where the communication interface, the processor, and the memory are connected to each other by the bus;
the memory stores machine readable instructions that, when invoked by the processor, perform the method of:
acquiring a data blood relationship corresponding to the data to be evaluated in the target data table; wherein the data blood relationship indicates an inheritance relationship of the data;
determining a data quality evaluation rule configured in an upstream data table inherited by the data to be evaluated based on the data blood relationship;
and carrying out data quality evaluation on the data to be evaluated based on the data quality evaluation rule.
In a fourth aspect of the present application, there is provided a machine-readable storage medium storing machine-readable instructions that, when invoked and executed by a processor, implement the method of:
acquiring a data blood relationship corresponding to the data to be evaluated in the target data table; wherein the data blood relationship indicates an inheritance relationship of the data;
Determining a data quality evaluation rule configured in an upstream data table inherited by the data to be evaluated based on the data blood relationship;
and carrying out data quality evaluation on the data to be evaluated based on the data quality evaluation rule.
The above embodiments of the present specification have at least the following advantageous effects:
in the above embodiment, the inheritance relationship of the data to be evaluated is determined by the data blood relationship of the data to be evaluated in the target data table, so as to inherit the quality evaluation rule configured in the upstream data table from the upstream data table inherited by the data to be evaluated, and perform the data quality evaluation on the data to be evaluated in the target data table according to the quality evaluation rule. By the method, inheritance of quality evaluation rules of blood relationship among data tables of all levels in the data warehouse can be realized, and quality evaluation of the whole life cycle of data is realized, so that effective circulation of high-quality data is ensured, and stability and accuracy of upper data application and data service are improved.
Drawings
FIG. 1 is a flow chart illustrating a method of data quality assessment in accordance with an exemplary embodiment;
FIG. 2 is a schematic diagram of a data blood relationship shown in an exemplary embodiment;
FIG. 3 is a schematic diagram of a data quality assessment rule configuration, as shown in an exemplary embodiment;
FIG. 4 is a diagram illustrating inheritance of a data quality assessment rule in accordance with an exemplary embodiment;
FIG. 5 is a schematic diagram illustrating an improvement quality determination for data quality assessment in accordance with an exemplary embodiment;
FIG. 6 is a hardware configuration diagram of an electronic device in which a data quality assessment apparatus is located, according to an exemplary embodiment;
fig. 7 is a block diagram of a data quality assessment apparatus according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present specification, a brief description of the related art related to data quality evaluation related to the embodiments of the present specification is provided below.
The data table is a carrier for storing data in a data warehouse and is the minimum unit for carrying out data management. In the process of data warehouse construction, the data is reliably and effectively processed through multi-level data management such as an ODS original library, an STD standard library, a DW subject library, a DM subject library and the like, and safe data service is provided for the outside.
The data blood margin can record information such as the source and the processing process of field-level data in a data table. With respect to one field, it is clear from the data blood-address which field of which table it originates from, what the processing logic is.
The data quality generally refers to the degree to which the data meets the purpose of use of the data consumer and can meet the specific requirements of the service scene in the service environment.
Data quality assessment, which generally refers to the process of detecting corrupted or inaccurate records from a collection of records, a database table, or a database, may include identifying incomplete, incorrect, inaccurate, or irrelevant portions of the data. The purpose is to improve the data quality in preparation for the later traffic demands. Typically the steps that must be taken after data collection to ensure that the analysis of the data is accurate and reliable thereafter.
Data quality improvement generally refers to the process of correcting a detected corruption or inaccurate record based on the data quality assessment results. Invalid data such as dirty data or coarse data may be replaced, modified, or deleted.
In the field of large data, it is often necessary to process large amounts of data, which can be managed by a data warehouse. In the data warehouse, the data takes the data table as a storage carrier, and the data is reliably and effectively changed through the multi-level data tables such as an ODS original library, an STD standard library, a DW subject library, a DM subject library and the like, and safe data service is provided for the outside.
To ensure the stability and accuracy of the data in the data warehouse, it is necessary to evaluate the quality of the data in each data table in the data warehouse. Because the amount of data in a data warehouse is typically large and interrelated, the above-described data quality assessment process is often complex and inefficient.
In the related art, data quality evaluation rules are generally configured only for key data tables in a data warehouse, and local evaluation is performed. In this way, when the problem data is found, the problem is traced according to the data blood edges of the problem data, which is time-consuming and labor-consuming and easy to cause the expansion of data pollution.
In view of this, the present specification aims to propose a technical solution for inheriting data quality evaluation rules of data from an upstream table through data blood-edge relationships of the data.
When the method is realized, the data blood-edge relation corresponding to the data to be evaluated in the target data table can be acquired first; wherein the data blood relationship indicates an inheritance relationship of the data to be evaluated; based on the data blood relationship, determining a data quality evaluation rule configured in an upstream data table inherited by the data to be evaluated; and carrying out data quality evaluation on the data to be evaluated based on the data quality evaluation rule.
It can be seen that, in the technical solution of the present specification, the inheritance relationship of the data to be evaluated is determined by the data blood relationship of the data to be evaluated in the target data table, so as to inherit the quality evaluation rule configured in the upstream data table from the upstream data table inherited by the data to be evaluated, and perform the data quality evaluation on the data to be evaluated in the target data table according to the quality evaluation rule. By the method, inheritance of quality evaluation rules of blood relationship among data tables of all levels in the data warehouse can be realized, and quality evaluation of the whole life cycle of data is realized, so that effective circulation of high-quality data is ensured, and stability and accuracy of upper data application and data service are improved.
The following describes the present application through specific embodiments and in connection with specific application scenarios.
Referring to fig. 1, fig. 1 is a flowchart illustrating a data quality assessment method according to an exemplary embodiment, which is applied to a data warehouse; the above method may perform the steps of:
step 102: acquiring a data blood relationship corresponding to the data to be evaluated in the target data table; wherein the data blood relationship indicates an inheritance relationship of the data.
In a data warehouse, data may typically be stored in the form of data tables in a multi-level data table structure. The data of the data table of the later hierarchy is typically inherited and processed from the data in the data table of the earlier hierarchy.
Referring to FIG. 2, FIG. 2 is a schematic diagram of a data blood relationship shown in an exemplary embodiment;
in one common data warehouse structure, data in the data warehouse may be divided into the layers of an original library ODS, a standard library STD, a subject library DW, a subject library DM, and the like.
Wherein, the data in the data table in the standard library STD can be inherited from the data table in the original library ODS, the data in the data table in the subject library DW can be inherited from the data table in the standard library STD, the data in the data table in the subject library DM can be inherited from the data table in the subject library DW, etc.
Typically, data, when entering a data warehouse, will first enter the paste source layer, or data table of the original library.
As shown in FIG. 2, the original library ODS contains an ODS data table, wherein field 1 of the ODS data table is treated by data processing, inherited by field 1 of STD data table 1 contained in the standard library STD; the field 2/4 of the ODS data table is treated by data processing, and is inherited by the field 2/4 of the STD data table 2 contained in the standard library STD. After the data processing treatment, the field 1 of the STD data table 1 is inherited by the field 1 of the DW data table in the theme library DW, and after the data processing treatment, the field 2 of the STD data table 2 is inherited by the field 2 of the DW data table in the theme library DW. And the field 1/2 in the DW data table is processed and treated by data respectively and inherited by the field 1/2 of the DM data table in the thematic library DM.
The arrows in fig. 2 represent the inheritance relationship of the data, i.e. the data blood-edge relationship. For data for which no data blood-edge relationship is noted in fig. 2, it is not represented that no data blood-edge relationship exists.
The data blood relationship of each field data in the data table may be stored in each data table in the data warehouse. The above-mentioned blood-edge relationship in the data table should at least describe the source of the data, i.e. the indication and fields of the upstream table.
When data enters a data table of a data warehouse, data quality evaluation rules for the data fields can be configured according to the data fields of the data table.
The data quality evaluation rule describes a criterion for judging whether the data quality is good or bad, and generally the evaluation result should be binary, namely pass or not pass, legal or not pass, no problem or problem, and the like.
The data quality evaluation rule may be a technical rule given to the data itself or a business rule given to a business attribute relationship. For example, the data quality assessment rules may include one or more of null verification, identification number verification, date format verification, vector surface rationality verification, and the like.
The same or different data quality assessment rules may be employed for different data fields, and one or more data quality assessment rules may be employed for the same data field.
The related technicians can flexibly configure the data quality evaluation rules according to the characteristics and the use scene of the data, and the specific configuration mode of the data quality evaluation rules is not specifically limited in the specification.
Referring to fig. 3, fig. 3 is a schematic diagram illustrating a data quality evaluation rule configuration according to an exemplary embodiment.
The original library ODS contains an ODS data table containing fields 1/2/3/4. When the data quality evaluation rule configuration is carried out, a uniqueness rule can be configured for the field 1, namely, repetition cannot occur among all data under the field; configuring a null value check rule for the field 1/2, namely, the value of each data under the field cannot be null; configuring an identity card verification rule for the field 3, namely, each data under the field is required to meet the format requirement of the identity card data; a date checking rule is configured for the field 4, that is, each data under the field is required to meet the format requirement of the date data.
When the configuration of the data quality rule is required for the data to be evaluated in the target data table in the non-source-attached layer or the original library ODS, the data blood-edge relationship corresponding to the data to be evaluated in the target data table can be acquired first. Wherein, the data blood relationship indicates the inheritance relationship of the data.
Step 104: based on the data blood relationship, a data quality evaluation rule configured in an upstream data table inherited by the data to be evaluated is determined.
After the data blood relationship of the data to be evaluated is obtained, the source of the data to be evaluated, that is, the field inherited in the upstream data table corresponding to the target data table, may be determined, and the data quality evaluation rule configured in the upstream table may be directly obtained from the knocked-down field.
Since the data to be evaluated is usually directly relayed from the upstream table, the data quality evaluation rule configured by the data to be evaluated can be inherited from the upstream table and used as the data quality evaluation rule of the data to be evaluated for configuration.
Referring to fig. 4, fig. 4 is a schematic diagram illustrating inheritance of a data quality assessment rule according to an exemplary embodiment.
FIG. 4 is a data quality evaluation rule inheritance relationship obtained after inheriting the data blood relationship of the data warehouse according to the embodiment in FIG. 2 after the data quality evaluation rule configuration is performed on the ODS data table in the original library ODS in the embodiment in FIG. 3.
It can be seen that the data quality evaluation rule configured for each field in the ODS data table can be pushed to the STD data table 1, the STD data table 2 in the STD standard library, the DW data table in the topic library DW, and the DM data table in the topic library DM layer by layer according to the data blood relationship.
It can be seen that the data quality assessment rules can inherit and pass between the data tables at each level in the data warehouse through the layer-by-layer data blood-edge relationships of the data tables in the data warehouse.
Step 106: and carrying out data quality evaluation on the data to be evaluated based on the data quality evaluation rule.
After the data quality evaluation rule is configured for the data to be evaluated in the target data table, the data quality of the data to be evaluated can be evaluated according to the data quality evaluation rule so as to evaluate the data quality of the data.
In the above embodiment, the inheritance relationship of the data to be evaluated is determined by the data blood relationship of the data to be evaluated in the target data table, so as to inherit the quality evaluation rule configured in the upstream data table from the upstream data table inherited by the data to be evaluated, and perform the data quality evaluation on the data to be evaluated in the target data table according to the quality evaluation rule. By the method, inheritance of quality evaluation rules of blood relationship among data tables of all levels in the data warehouse can be realized, and quality evaluation of the whole life cycle of data is realized, so that effective circulation of high-quality data is ensured, and stability and accuracy of upper data application and data service are improved.
When the data in the target data table is inherited from the upstream table, the data may not be directly inherited, and when the data is inherited, conversion, operation, and the like of the data may be performed. At this time, the configured data quality assessment of the data in the upstream table may not be applicable in the target data table.
At this time, the data which cannot correctly inherit the data quality evaluation rule can be manually corrected by a manual checking mode, and the appropriate data quality evaluation rule can be reconfigured for the data, or a new data quality evaluation rule can be automatically generated by an automatic detection mode, which is not particularly limited in the specification.
In an exemplary embodiment, the determining, based on the data blood-edge relationship, a data quality evaluation rule for which the data to be evaluated is configured in an upstream data table inherited by the data to be evaluated includes:
determining whether the data to be evaluated is processed or not based on the data blood-edge relationship;
if the data to be evaluated is processed, automatically generating a data quality evaluation rule of the data to be evaluated according to the data elements of the data to be evaluated;
and if the data to be evaluated is not processed, determining a data quality evaluation rule configured in an upstream data table inherited by the data to be evaluated.
According to the blood relationship of the data to be evaluated, whether the data to be evaluated is directly obtained from an upstream data table or is processed can be determined.
The present specification does not specifically limit how to determine whether or not the data to be evaluated is processed. For example, whether the data to be evaluated is processed or not may be determined by directly reading the inheritance manner or the processing manner of the data to be evaluated according to the data blood relationship, or whether the data inherited in the upstream data table is processed or not may be determined by comparing the data to be identical or not according to the data blood relationship.
If the data is not processed and is directly inherited, the data quality evaluation rule of the evaluation data can also be directly inherited from the upstream data table and directly configured into the target data table;
if the data is processed, then the data quality assessment rules cannot be inherited directly from the upstream data table.
The data quality of the data to be evaluated can be evaluated automatically by the data rule or the data element corresponding to the data to be evaluated in the target data table.
According to the embodiment, whether the data to be evaluated is processed or not is confirmed, and the data quality evaluation rule for the data to be evaluated is automatically generated according to the data source under the condition that the data is processed, so that error inheritance of the data quality evaluation rule is prevented, quality evaluation of the whole life cycle of the data is realized, effective circulation of high-quality data is ensured, and therefore stability and accuracy of upper layer data application and data service are improved.
In one exemplary embodiment shown in the present description, the data blood-edge relationship indicates a manner of processing the data to be evaluated;
after determining, based on the data blood-edge relationship, a data quality evaluation rule for the data to be evaluated configured in an upstream data table inherited by the data to be evaluated, the method further includes:
and converting the acquired data quality evaluation rule into a data quality evaluation rule aiming at the data to be evaluated based on the processing mode of the data to be evaluated indicated by the data blood relationship.
When the data in the data table is inherited in a processed manner, the processed manner can be written into the data blood relationship of the data.
Specifically, the processing mode may be recorded in an SQL statement mode, for example. The processing mode of the data can be restored through the data blood relationship of the data.
When the corresponding data processing mode is obtained from the data blood relationship of the data to be evaluated in the target data table, the data quality evaluation rule configured by the data field inherited by the data to be evaluated in the upstream data table corresponding to the target data table can be processed correspondingly automatically based on the data processing mode, so that the data quality evaluation rule suitable for the data to be evaluated in the target data table is obtained.
For example, the data to be evaluated in the target data table is inherited from the gender field in the upstream table a corresponding to the target data table, the gender field of the table a is configured with a valid value checking rule, and the valid value is "male" or "female"; when the data to be evaluated in the target data table inherits the gender field in the upstream table A, the processing is performed, and the processing rule is as follows: converting the value "men" to the value "1" and the value "women" to the value "0"; then, according to the data processing rule, the above-mentioned effective value checking rule may be converted into a new effective value checking rule so that its effective value is "0" or "1" to be configured as a data quality evaluation rule of the data to be evaluated in the above-mentioned target data table.
For different processing modes, different processing modes of the data quality evaluation rule should be corresponding. The implementation of the processing mode of the data quality evaluation rule is not particularly limited in this specification. For example, a value replacement method similar to the above example may be adopted, or a correspondence relation between the data blood-edge relationship and the processing method of the data quality evaluation rule may be stored in advance in the data warehouse, or learning of the processing method of the data blood-edge relationship to the quality evaluation rule may be realized by a machine model or a deep learning model, or the like.
In the above embodiment, the data processing manner of the data to be evaluated is obtained in the blood relationship, so that the data quality evaluation rule in the upstream data table is automatically processed correspondingly, so as to obtain the data quality evaluation rule suitable for the data to be evaluated in the target data table, and the quality evaluation of the full life cycle of the data is realized, so that the effective circulation of high-quality data is ensured, and the stability and accuracy of the application of upper layer data and the data service are improved.
Data in a data warehouse is typically updated and circulated over time, and thus data quality assessment of data in a data warehouse needs to be performed from time to time.
Typically, different data tables in a data warehouse have a relatively fixed data update period.
In an exemplary embodiment shown in the present specification, the performing, based on the data quality evaluation rule, data quality evaluation on the data to be evaluated includes:
and carrying out data quality evaluation on the data to be evaluated periodically based on the data quality evaluation rule and the data updating period of the target data table.
The data quality evaluation can be carried out on the data to be evaluated in the target data table periodically according to the data updating period of the target data table. Typically, the time period of the periodic evaluation should be no less than the data update period of the target data table.
Specifically, as shown in fig. 3, a monitoring task for the target data table may be generated to perform data quality evaluation. And the quality monitoring task is operated, so that the data quality evaluation and monitoring alarm aiming at the target data table can be automatically completed.
According to the embodiment, the data quality evaluation is carried out on the data to be evaluated in the target data table at regular intervals, so that the quality evaluation of the full life cycle of the data in the time dimension is realized, the effective circulation of high-quality data is ensured, and the stability and the accuracy of the upper layer data application and the data service are improved.
In one exemplary embodiment shown in the present description, the method further comprises:
generating a first data quality evaluation result of the target data table; the data quality assessment results include problem data that fails the data quality assessment.
When the data quality evaluation for the target data table is completed, a first data quality evaluation result can be generated for the data provider to carry out corresponding correction so as to correct the problem data.
In order to facilitate the modification of the data provider, the first data quality evaluation result may include all problem data that fails the data quality evaluation.
In one exemplary embodiment shown in the present specification, the method further includes:
obtaining a modified target data table, and updating the target data table;
based on the data quality evaluation rule, performing data quality evaluation again for the updated target data table, and generating a second data quality evaluation result;
and determining the rectifying quality of the target data table based on the second data quality evaluation result and the first data quality evaluation result.
After the data provider performs rectification on the data, the target data table can be updated, and after the target data table is updated, data quality evaluation is performed on the data to be evaluated in the target data table again, and a second data quality evaluation result can be generated aiming at the re-performed evaluation. Similarly, the second data quality evaluation result may include all problem data that fails the data quality evaluation at the time of the re-evaluation.
And comparing the first data quality evaluation result with the second data quality evaluation result, namely evaluating the quality of the data rectification.
The quality of the data modification is described by what numerical value is adopted, and the specification is not particularly limited.
For example, a difference, a reduced proportion, or the like of the problem data in the second data quality evaluation result relative to the problem data in the first data quality evaluation result may be used as a measure of the rectification quality. 4
In an exemplary embodiment shown in the present specification, the determining the quality of improvement of the target data table based on the second data quality evaluation result and the first data quality evaluation result includes:
extracting a first preset number of problem data based on the first data quality evaluation result;
counting a second number of repeated problem data in the problem data contained in the second data quality evaluation result and the first preset number of the problem data;
and determining an improvement quality of the target data table based on the first number and the second number.
As shown in fig. 5, fig. 5 is a schematic diagram illustrating an improvement quality determination for data quality assessment according to an exemplary embodiment.
When the first data quality evaluation result is generated for the data to be evaluated in the target data table, a part of the first preset number of problem data may be extracted from the problem table E formed by all the problem data contained in the first preset number of problem data, so as to form a problem table e_xf.
The first preset number of the extracted part of problem data may be set according to conditions such as the data amount of the target data table, for example, the number of problem data in the problem table E is 500, and the first preset number may be set to be 50.
The extracted part of problem data may be extracted randomly from the problem table E, or may be extracted according to measurement dimensions such as integrity, normalization, timeliness, accuracy, uniqueness, consistency, etc., which is not specifically limited in this description.
After the problem table E is fed back to the corresponding data provider or data rectifying party, the data provider or data rectifying party can conduct data rectifying on the data in the target data table.
After the data modification is completed, the target data table is updated, and data quality evaluation is performed again for the updated target data table, so that a second data quality evaluation result and a corresponding problem table E' are generated.
The number of the problem data in the problem table E 'is counted, and the problem data in the problem table E' is compared with the problem data in the problem table E_xf.
The number of questions in the question table E' and the second number of repeated question data in the question data contained in the question table E_xf; can be used as an evaluation dimension of the quality of data rectification.
For example, after the data has been modified and the data has been retrieved from the evaluation, the number of question data in the question table E' is 130, and the second number is 30, i.e. 30 pieces of question data are repeated with the question data extracted before the modification. Then, it can be considered that the number of problem data of the correction success is 50-30=20, and the correction efficiency is 20/50=40% with respect to the first evaluation.
The above embodiment intuitively reflects the quality of data rectification through the change of the quantity of the problem data before and after rectification, and provides effective reference for data management.
In conventional quality assessment results reporting, the data quality assessment results are typically presented in a technical language, which is difficult for data manager/business system personnel to understand.
In view of this, in one exemplary embodiment shown in the present specification, the method further includes:
and generating a quantized data quality assessment result report of the target data table based on the first data quality assessment result and a preset quantization scoring rule.
And quantizing the first data quality evaluation result generated aiming at the target data table according to a preset quantization scoring rule to generate a quantized data quality evaluation result report.
For example, the quantitative quality assessment report quantified by the method described above may include a percentile of data quality assessment results so that the user may intuitively perceive the data quality of the data to be assessed.
In the present specification, the target data table may be a certain data table, or may be a plurality of data tables, all data tables at a certain level, or the like, and the number or the range of the target data tables is not specifically limited in the present specification.
In an exemplary embodiment shown in the present specification, the preset quantization scoring rule includes at least one of the following:
a quantization scoring rule based on a quantitative proportion of problem data in the target data table;
a quantization scoring rule based on a quantitative ratio of target data tables in which problem data exists;
and a quantization scoring rule based on the quantitative proportion of the problem data in the target data table and the quantitative proportion of the target data table with the problem data in the target data table.
The quantization scoring rule may be calculated based on the number of problem data in the target data table, the number proportion of the target data table with problem data, or the like, wherein the number proportion of the target data table with problem data may occupy different weights.
For example, the quantization scoring rule described above may be determined according to the following formula:
Figure BDA0004023944330000141
wherein, in the above formula, z represents a quantization score, and p represents the number of data tables in the problem data in the target data table; q represents the number of total tables in the target data table; x is x i Representing the number of problem data in the ith target data table; y is i Representing the total amount of data in the ith target data table; s represents the statistical weight of the number proportion dimension of the target data table with problem data in the table target data table; t represents the statistical weight of the number proportion dimension of the problem data in the target data table.
In the above embodiment, the data quality condition is quantized through the custom scoring rule and the weight setting, so that service personnel can understand the overall data quality profile conveniently, and the data quality performance evaluation between different hierarchy dimensions can be directly performed.
The present specification also provides an embodiment of a data quality evaluation apparatus corresponding to the embodiment of the data quality evaluation method described above.
Referring to fig. 6, fig. 6 is a hardware configuration diagram of an electronic device in which a data quality evaluation apparatus is located in an exemplary embodiment. At the hardware level, the device includes a processor 602, an internal bus 604, a network interface 606, a memory 608, and a non-volatile storage 610, although other hardware required for the service is possible. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 602 reading a corresponding computer program from the non-volatile memory 610 into the memory 608 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic unit, but may also be hardware or a logic device.
Referring to fig. 7, fig. 7 is a block diagram illustrating a data quality assessment apparatus according to an exemplary embodiment. The data quality evaluation device can be applied to the electronic equipment shown in fig. 6 to realize the technical scheme of the specification. Wherein, the data quality evaluation device may include:
an acquiring unit 701, configured to acquire a data blood-edge relationship corresponding to data to be evaluated in a target data table; wherein the data blood relationship indicates an inheritance relationship of the data;
a determining unit 702, configured to determine, based on the data blood-edge relationship, a data quality evaluation rule configured by the data to be evaluated in an upstream data table inherited by the data to be evaluated;
an evaluation unit 703, configured to perform data quality evaluation on the data to be evaluated based on the data quality evaluation rule.
In this embodiment, the determining unit 702 is specifically configured to determine whether the data to be evaluated is processed based on the data blood-edge relationship;
if the data to be evaluated is processed, automatically generating a data quality evaluation rule of the data to be evaluated according to the data elements of the data to be evaluated;
and if the data to be evaluated is not processed, determining a data quality evaluation rule configured in an upstream data table inherited by the data to be evaluated.
In this embodiment, the data blood-edge relationship indicates a processing manner of the data to be evaluated;
the device further comprises: a conversion unit configured to:
and converting the acquired data quality evaluation rule into a data quality evaluation rule aiming at the data to be evaluated based on the processing mode of the data to be evaluated indicated by the data blood relationship.
In this embodiment, the foregoing evaluation unit 703 is specifically configured to perform data quality evaluation on the data to be evaluated periodically based on the data quality evaluation rule and the data update period of the target data table.
In this embodiment, the apparatus further includes: a first generation unit, configured to generate a first data quality evaluation result of the target data table; the data quality assessment results include problem data that fails the data quality assessment.
In this embodiment, the apparatus further includes: a rectification quality determination unit for:
obtaining a modified target data table, and updating the target data table;
based on the data quality evaluation rule, performing data quality evaluation again for the updated target data table, and generating a second data quality evaluation result;
And determining the rectifying quality of the target data table based on the second data quality evaluation result and the first data quality evaluation result.
In this embodiment, the correction quality determining unit is specifically configured to: extracting a first preset number of problem data based on the first data quality evaluation result;
counting a second number of repeated problem data in the problem data contained in the second data quality evaluation result and the first preset number of the problem data;
and determining an improvement quality of the target data table based on the first number and the second number.
In this embodiment, the apparatus further includes: a second generating unit for:
and generating a quantized data quality assessment result report of the target data table based on the first data quality assessment result and a preset quantization scoring rule.
In this embodiment, the preset quantization scoring rule includes at least one of the following:
a quantization scoring rule based on a quantitative proportion of problem data in the target data table;
a quantization scoring rule based on a quantitative ratio of target data tables in which problem data exists;
And a quantization scoring rule based on the quantitative proportion of the problem data in the target data table and the quantitative proportion of the target data table with the problem data in the target data table.
In this embodiment, the data quality evaluation rule includes at least one of:
a uniqueness rule; null value verification rules; identity card verification rules; date verification rules.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are illustrative only, in that the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims (10)

1. A data quality assessment method is applied to a data warehouse; characterized in that the method comprises:
acquiring a data blood relationship corresponding to the data to be evaluated in the target data table; wherein the data blood relationship indicates an inheritance relationship of the data;
determining a data quality evaluation rule configured in an upstream data table inherited by the data to be evaluated based on the data blood relationship;
and carrying out data quality evaluation on the data to be evaluated based on the data quality evaluation rule.
2. The method of claim 1, wherein the determining, based on the data blood-edge relationship, a data quality evaluation rule for which the data under evaluation is configured in an upstream data table inherited by the data under evaluation comprises:
determining whether the data to be evaluated is processed or not based on the data blood-edge relationship;
If the data to be evaluated is processed, automatically generating a data quality evaluation rule of the data to be evaluated according to the data elements of the data to be evaluated;
and if the data to be evaluated is not processed, determining a data quality evaluation rule configured in an upstream data table inherited by the data to be evaluated.
3. The method of claim 1, wherein the data blood relationship indicates a manner in which the data to be evaluated is processed;
after determining, based on the data blood-edge relationship, a data quality evaluation rule for the data to be evaluated configured in an upstream data table inherited by the data to be evaluated, the method further includes:
and converting the acquired data quality evaluation rule into a data quality evaluation rule aiming at the data to be evaluated based on the processing mode of the data to be evaluated indicated by the data blood relationship.
4. The method according to claim 1, wherein the method further comprises:
generating a first data quality evaluation result of the target data table; the data quality assessment results include problem data that fails the data quality assessment.
5. The method according to claim 4, wherein the method further comprises:
Obtaining a modified target data table, and updating the target data table;
based on the data quality evaluation rule, performing data quality evaluation again for the updated target data table, and generating a second data quality evaluation result;
and determining the rectifying quality of the target data table based on the second data quality evaluation result and the first data quality evaluation result.
6. The method of claim 5, wherein the determining the quality of improvement of the target data table based on the second data quality assessment result and the first data quality assessment result comprises:
extracting a first preset number of problem data based on the first data quality evaluation result;
counting a second number of repeated problem data in the problem data contained in the second data quality evaluation result and the first preset number of the problem data;
and determining an improvement quality of the target data table based on the first number and the second number.
7. The method according to claim 4, wherein the method further comprises:
and generating a quantized data quality assessment result report of the target data table based on the first data quality assessment result and a preset quantization scoring rule.
8. A data quality assessment device is applied to a data warehouse; characterized in that the device comprises:
the acquisition unit is used for acquiring a data blood-edge relationship corresponding to the data to be evaluated in the target data table; wherein the data blood relationship indicates an inheritance relationship of the data;
a determining unit, configured to determine, based on the data blood-edge relationship, a data quality evaluation rule that the data to be evaluated is configured in an upstream data table inherited by the data to be evaluated;
and the evaluation unit is used for evaluating the data quality of the data to be evaluated based on the data quality evaluation rule.
9. An electronic device, comprising a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are connected with each other through the bus;
the memory stores machine readable instructions, and the processor performs the method of any of claims 1-7 by invoking the machine readable instructions.
10. A machine-readable storage medium storing machine-readable instructions which, when invoked and executed by a processor, implement the method of any one of claims 1-7.
CN202211700716.3A 2022-12-28 2022-12-28 Data quality evaluation method and device, electronic equipment and storage medium Pending CN116126843A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211700716.3A CN116126843A (en) 2022-12-28 2022-12-28 Data quality evaluation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211700716.3A CN116126843A (en) 2022-12-28 2022-12-28 Data quality evaluation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116126843A true CN116126843A (en) 2023-05-16

Family

ID=86300181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211700716.3A Pending CN116126843A (en) 2022-12-28 2022-12-28 Data quality evaluation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116126843A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473493A (en) * 2023-12-28 2024-01-30 杭州数智政通科技有限公司 Data tracing and quality detection method and system based on data elements
CN117557200A (en) * 2024-01-10 2024-02-13 宁波安得智联科技有限公司 Warehouse adjustment plan evaluation method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473493A (en) * 2023-12-28 2024-01-30 杭州数智政通科技有限公司 Data tracing and quality detection method and system based on data elements
CN117557200A (en) * 2024-01-10 2024-02-13 宁波安得智联科技有限公司 Warehouse adjustment plan evaluation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US11210144B2 (en) Systems and methods for hyperparameter tuning
CN116126843A (en) Data quality evaluation method and device, electronic equipment and storage medium
EP3591586A1 (en) Data model generation using generative adversarial networks and fully automated machine learning system which generates and optimizes solutions given a dataset and a desired outcome
US20190026358A1 (en) Big data-based method and device for calculating relationship between development objects
CN107168995B (en) Data processing method and server
CN111242793B (en) Medical insurance data abnormality detection method and device
CN110991474A (en) Machine learning modeling platform
CN109918678B (en) Method and device for identifying field meaning
CN112487083A (en) Data verification method and equipment
CN115203167A (en) Data detection method and device, computer equipment and storage medium
CN110990523A (en) Legal document determining method and system
CN110046086B (en) Expected data generation method and device for test and electronic equipment
CN108415990B (en) Data quality monitoring method and device, computer equipment and storage medium
CN112669134A (en) Method, equipment and medium for realizing auditing intellectualization through auditing rule machine learning
CN115437965B (en) Data processing method suitable for test management platform
CN116010216A (en) Method, device, equipment and storage medium for evaluating health degree of data asset
CN113971119B (en) Unsupervised model-based user behavior anomaly analysis and evaluation method and system
KR20190010091A (en) Anonymization Device for Preserving Utility of Data and Method thereof
CN115827324B (en) Data backup method, network node and system
CN113704103B (en) Test case recommendation method and device, medium and electronic equipment
CN117764455A (en) Universal index management method and system for data
CN118051505A (en) Data asset management system
CN116521662A (en) Method, device, equipment and medium for detecting effect of data cleaning
CN116881787A (en) Data sample classification method and device, processor and electronic equipment
CN118035180A (en) Metadata completion method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination