CN116521662A - Method, device, equipment and medium for detecting effect of data cleaning - Google Patents

Method, device, equipment and medium for detecting effect of data cleaning Download PDF

Info

Publication number
CN116521662A
CN116521662A CN202310451991.4A CN202310451991A CN116521662A CN 116521662 A CN116521662 A CN 116521662A CN 202310451991 A CN202310451991 A CN 202310451991A CN 116521662 A CN116521662 A CN 116521662A
Authority
CN
China
Prior art keywords
data
field
target
rule
target field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310451991.4A
Other languages
Chinese (zh)
Inventor
廖扬勇
魏莱
郝文鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dt Dream Technology Co Ltd
Original Assignee
Hangzhou Dt Dream Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dt Dream Technology Co Ltd filed Critical Hangzhou Dt Dream Technology Co Ltd
Priority to CN202310451991.4A priority Critical patent/CN116521662A/en
Publication of CN116521662A publication Critical patent/CN116521662A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a method, a device, equipment and a medium for detecting the effect of data cleaning, which relate to the technical field of big data, and the method comprises the following steps: acquiring a target data table and an original data table; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table; acquiring quality rules corresponding to each target field; comparing the data quantity of the abnormal data of at least one target field in the target data table and the original data table respectively to obtain the variation quantity of the abnormal data of each target field under the corresponding quality rule; and determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule. Therefore, the change amount of abnormal data of the data table is determined by comparing the original data table before data cleaning with the target data table after data cleaning, and the quality improvement rate of the data table can be effectively determined, so that the effect of automatically evaluating the data cleaning based on the quality improvement rate of the data table can be realized.

Description

Method, device, equipment and medium for detecting effect of data cleaning
Technical Field
The present application relates to the field of big data technologies, and in particular, to a method, an apparatus, a device, and a medium for detecting an effect of data cleaning.
Background
The data cleaning is to perform a series of cleaning processing flows such as filtering, de-duplication, de-empting, standardized conversion and the like on the data through technical operations such as associating the data standard, configuring the cleaning rule, developing the cleaning task and the like under the unified data standard and the cleaning rule, clean dirty data, correct error data, output good data and realize the process of improving the data quality.
At present, in the data cleaning process, after the data cleaning is completed, problem summary output is performed, generally, a found problem data list is fed back to a data source unit, or the result and problem of the data cleaning are reported to a data warehouse constructor in a text chart mode. However, in addition to finding the problem of data quality, a unified, standard and accurate quantitative evaluation method is lacking for whether the data quality is improved after data cleaning, and the important value of data cleaning in data warehouse construction cannot be embodied.
Disclosure of Invention
The object of the present application is to solve at least to some extent one of the above technical problems.
Therefore, the application provides a method, a device, equipment and a medium for detecting the effect of data cleaning, which are used for acquiring a target data table and an original data table; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table; acquiring quality rules corresponding to each target field; comparing the data quantity of the abnormal data of at least one target field in the target data table and the original data table respectively to obtain the variation quantity of the abnormal data of each target field under the corresponding quality rule; and determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule. Therefore, the change amount of abnormal data of the data table is determined by comparing the original data table before data cleaning with the target data table after data cleaning, and the quality improvement rate of the data table can be effectively determined, so that the effect of automatically evaluating the data cleaning based on the quality improvement rate of the data table can be realized.
An embodiment of a first aspect of the present application provides a method for detecting an effect of data cleaning, where the method includes: acquiring a target data table and an original data table; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table; acquiring quality rules corresponding to the target fields; comparing the data quantity of the abnormal data of the at least one target field in the target data table and the original data table respectively to obtain the variation quantity of the abnormal data of each target field under the corresponding quality rule; and determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule.
Optionally, the comparing the data amounts of the abnormal data in the target data table and the original data table by the at least one target field to obtain the variation of the abnormal data of each target field under the corresponding quality rule includes: for any one of the target fields, under the condition that a corresponding newly added field exists in the target data table in the target field, determining the change amount of abnormal data of the target field under a corresponding quality rule according to the total data amount of the newly added field in the target data table and the first abnormal data amount which does not accord with the corresponding quality rule; and under the condition that the target field does not have a corresponding newly added field in the target data table, determining the change amount of the abnormal data of the target field under a corresponding quality rule according to the second abnormal data amount of the target field in the original data table and the third abnormal data amount in the target data table.
Optionally, the determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule includes: for any target field, determining the field quality improvement rate of the target field according to the change amount of the abnormal data of the target field under the corresponding quality rule; and determining the quality improvement rate of the data table according to the first weight of each target field in the original data table and the field quality improvement rate.
Optionally, the determining, according to the change amount of the abnormal data of the target field under the corresponding quality rule, the field quality improvement rate of the target field includes: according to the variable quantity of the target field under the corresponding quality rule, determining the data quality improvement rate of the target field under the corresponding quality rule; and determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under the corresponding quality rule and the field rule weight of the corresponding quality rule in the original data table.
Optionally, the determining, according to the change amount of the target field under the corresponding quality rule, the data quality improvement rate of the target field under the corresponding quality rule includes: under the condition that the target field has a corresponding newly added field in the target data table, determining the data quality improvement rate of the target field under a corresponding quality rule according to the change amount of the target field under the corresponding quality rule and the ratio of the total data amount of the newly added field in the target data table; and under the condition that the target field does not have a corresponding newly added field in the target data table, determining the data quality improvement rate of the target field under the corresponding quality rule according to the ratio of the variable quantity of the target field under the corresponding quality rule to the fourth abnormal data quantity of the target field in the original data table.
Optionally, the determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under the corresponding quality rule and the field rule weight of the corresponding quality rule in the original data table includes: under the condition that the quality rule corresponding to the target field is one, determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under the corresponding quality rule and the quality rule weight of the corresponding quality rule in the original data table; and under the condition that the quality rules corresponding to the target field are multiple, determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under each corresponding quality rule and the field rule weight of each corresponding quality rule in the original data table.
Optionally, the field rule weight of the quality rule corresponding to the target field in the original data table is obtained through the following steps: for any target field, acquiring a rule weight value of a quality rule corresponding to the target field in the original data table; and determining the field rule weight of the quality rule corresponding to the target field in the original data table according to the rule weight value of the quality rule corresponding to the target field in the original data table.
Optionally, the determining, according to the rule weight value of the quality rule corresponding to the target field in the original data table, the field rule weight of the quality rule corresponding to the target field in the original data table includes: under the condition that the quality rule corresponding to the target field is one, determining the field rule weight of the quality rule corresponding to the target field as a set weight value; determining a first coefficient based on the sum of rule weight values of all the quality rules corresponding to the target field in the original data table under the condition that the quality rules corresponding to the target field are multiple; and determining the field rule weight of the quality rule corresponding to the target field according to the duty ratio of the rule weight value of the quality rule and the first coefficient aiming at any quality rule corresponding to the target field.
Optionally, the method for detecting the effect of data cleaning further includes: for any target field, determining whether cleaning abnormality occurs according to the variation of the abnormal data of the target field under the corresponding quality rule; responding to the occurrence of cleaning abnormality, and generating and sending prompt information according to the change quantity of the abnormal data of the target field under the corresponding quality rule; the prompt message is used for prompting that the cleaning rule is invalid or the cleaning rule is wrong.
According to the data cleaning effect detection method, a target data table and an original data table are obtained; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table; acquiring quality rules corresponding to each target field; comparing the data quantity of the abnormal data of at least one target field in the target data table and the original data table respectively to obtain the variation quantity of the abnormal data of each target field under the corresponding quality rule; and determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule. Therefore, the change amount of abnormal data of the data table is determined by comparing the original data table before data cleaning with the target data table after data cleaning, and the quality improvement rate of the data table can be effectively determined, so that the effect of automatically evaluating the data cleaning based on the quality improvement rate of the data table can be realized.
An embodiment of a second aspect of the present application provides an effect detection apparatus for data cleansing, the apparatus including:
the first acquisition module is used for acquiring a target data table and an original data table; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table;
The second acquisition module is used for acquiring the quality rule corresponding to each target field;
the comparison module is used for comparing the data quantity of the abnormal data of the at least one target field in the target data table and the original data table respectively so as to obtain the variation quantity of the abnormal data of each target field under the corresponding quality rule;
the first determining module is used for determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule.
Optionally, the comparing module is specifically configured to, for any one of the target fields, determine, when the target field has a corresponding added field in the target data table, a change amount of abnormal data of the target field under a corresponding quality rule according to a total data amount of the added field in the target data table and a first abnormal data amount that does not conform to the corresponding quality rule; and under the condition that the target field does not have a corresponding newly added field in the target data table, determining the change amount of the abnormal data of the target field under a corresponding quality rule according to the second abnormal data amount of the target field in the original data table and the third abnormal data amount in the target data table.
Optionally, the first determining module is specifically configured to determine, for any one of the target fields, a field quality improvement rate of the target field according to a change amount of abnormal data of the target field under a corresponding quality rule; and determining the quality improvement rate of the data table according to the first weight of each target field in the original data table and the field quality improvement rate.
Optionally, the first determining module is specifically configured to determine, according to the amount of change of the target field under the corresponding quality rule, a data quality improvement rate of the target field under the corresponding quality rule; and determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under the corresponding quality rule and the field rule weight of the corresponding quality rule in the original data table.
Optionally, the first determining module is specifically configured to determine, when the target field has a corresponding newly added field in the target data table, a data quality improvement rate of the target field under a corresponding quality rule according to a ratio of the change amount of the target field under the corresponding quality rule to a total data amount of the newly added field in the target data table; and under the condition that the target field does not have a corresponding newly added field in the target data table, determining the data quality improvement rate of the target field under the corresponding quality rule according to the ratio of the variable quantity of the target field under the corresponding quality rule to the fourth abnormal data quantity of the target field in the original data table.
Optionally, the first determining module is specifically configured to determine, when the quality rule corresponding to the target field is one, a field quality improvement rate of the target field according to a data quality improvement rate of the target field under the corresponding quality rule and a quality rule weight of the corresponding quality rule in the original data table; and under the condition that the quality rules corresponding to the target field are multiple, determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under each corresponding quality rule and the field rule weight of each corresponding quality rule in the original data table.
Optionally, the field rule weight of the quality rule corresponding to the target field in the original data table is obtained through the following modules:
the third acquisition module is used for acquiring a rule weight value of the quality rule corresponding to any target field in the original data table;
and the second determining module is used for determining the field rule weight of the quality rule corresponding to the target field in the original data table according to the rule weight value of the quality rule corresponding to the target field in the original data table.
Optionally, the second determining module is specifically configured to determine, when the quality rule corresponding to the target field is one, that a field rule weight of the quality rule corresponding to the target field is a set weight value; determining a first coefficient based on the sum of rule weight values of all the quality rules corresponding to the target field in the original data table under the condition that the quality rules corresponding to the target field are multiple; and determining the field rule weight of the quality rule corresponding to the target field according to the duty ratio of the rule weight value of the quality rule and the first coefficient aiming at any quality rule corresponding to the target field.
Optionally, the data cleaning effect detection device further includes:
the third determining module is used for determining whether cleaning abnormality occurs according to the change amount of the abnormal data of any target field under the corresponding quality rule;
the processing module is used for responding to the occurrence of cleaning abnormality, generating and sending prompt information according to the change quantity of the abnormal data of the target field under the corresponding quality rule; the prompt message is used for prompting that the cleaning rule is invalid or the cleaning rule is wrong.
According to the data cleaning effect detection device, the target data table and the original data table are obtained; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table; acquiring quality rules corresponding to each target field; comparing the data quantity of the abnormal data of at least one target field in the target data table and the original data table respectively to obtain the variation quantity of the abnormal data of each target field under the corresponding quality rule; and determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule. Therefore, the change amount of abnormal data of the data table is determined by comparing the original data table before data cleaning with the target data table after data cleaning, and the quality improvement rate of the data table can be effectively determined, so that the effect of automatically evaluating the data cleaning based on the quality improvement rate of the data table can be realized.
An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the effect detection method of data cleansing as described in the first aspect when the program is executed.
An embodiment of a fourth aspect of the present application proposes a non-transitory computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, implements the effect detection method of data cleansing as described in the first aspect.
An embodiment of a fifth aspect of the present application proposes a computer program product comprising a computer program which, when executed by a processor of an electronic device, enables the electronic device to perform the effect detection method of data cleaning as described in the first aspect.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a schematic diagram of the positioning of data cleansing in a data warehouse provided herein;
fig. 2 is a schematic flow chart of data cleaning in a source layer provided in the present application;
FIG. 3 is a flow chart of a method for detecting an effect of data cleansing according to an embodiment of the present disclosure;
fig. 4 is a flow chart of a method for detecting an effect of data cleaning according to a second embodiment of the present disclosure;
Fig. 5 is a flow chart of a method for detecting an effect of data cleaning according to the third embodiment of the present application;
fig. 6 is a flow chart of a method for detecting an effect of data cleaning according to a fourth embodiment of the present disclosure;
fig. 7 is a flow chart of a method for detecting an effect of data cleaning according to a fifth embodiment of the present disclosure;
FIG. 8 is a schematic diagram of a data flow between a source layer and a standard layer according to the present disclosure;
fig. 9 is a schematic structural diagram of a data cleaning effect detection device according to a sixth embodiment of the present disclosure;
fig. 10 is a schematic structural diagram of an electronic device according to a seventh embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.
Currently, the positioning of data cleansing in a data warehouse is shown in FIG. 1:
wherein, paste source layer (ODS (Operational Data Store, operational data storage) layer): service source (or data source) system data (i.e. source data) is accessed to a source layer of a big data platform through an ETL (Extract-Transform-Load), wherein the service source system data can comprise structured data, semi-structured data and unstructured data. It should be noted that, no processing is required to be performed on the service source system data in the data access process.
Standard layer (STD (Standard) layer): the method mainly comprises the steps of exploring data of a source layer, forming a data standard (data element and data dictionary) by combining national standards/line standards, and finally cleaning and converting the data of the source layer according to the data standard to form standardized data. It should be noted that the surface structure after the cleaning is completed directly inherits the surface structure of the source layer.
Theme layer: and analyzing, integrating, classifying and fusing the related data in each standardized or cleaned service system in a macroscopic view, abstracting to form the entity object in the service field, and finally forming the data set of the entity object with normalization, integrity and consistency. For example: in the field of safety production supervision, a plurality of business systems all relate to a production enterprise; the related information of enterprises of different business systems can be analyzed, extracted and designed into a large and complete enterprise data model, and the enterprise theme information which can face various business scenes and business fields is formed through data fusion.
Thematic layer: the data processing system comprises a data mart layer, a theme layer and a standard layer, wherein the data meeting specific business scenes can be generated according to the data of the theme layer and the standard layer according to the application requirements of the upper layer business so as to support the application.
The data management process mainly comprises the following steps:
source-pasting process flow: the method mainly relies on an ETL tool, and according to the actual condition of service source system data, the data is extracted into a source pasting layer of a large data resource pool in a full-quantity or increment, timing or real-time mode, and no processing treatment is carried out on the data in the process.
Standard layer flow: the standard cleaning is performed by taking the data in the source layer as a reference, and the flow of cleaning the data in the source layer is shown in fig. 2, wherein the flow of cleaning the data in the source layer mainly comprises the following steps:
2.1 analysis of data exploration results
And data exploration and sharing are carried out on the data in the source layer, the null rate, the maximum text length, the value range, the code distribution and the like of the field data in the data table are mainly explored, and the data of each field in the data table can be comprehensively mastered according to the exploration result.
2.2 data pairs
All standard data elements (mainly comprising qualifiers, data elements and data dictionaries) of the data table field can be combed according to understanding and mastering conditions of the fields and combined with national standards/line standards, and can be imported into a data cleaning system, and then the fields in the data table are rapidly mapped to corresponding standard data elements so as to intelligently generate cleaning rules according to the standard data elements.
2.3 data cleaning
The cleaning rules of the fields of the source layer data surface table can be configured according to the standard data elements and the self-defined cleaning rules, a data cleaning task is generated, and the data is subjected to standardized cleaning to form standard data.
2.4 problem summary output
The cleaning result can be checked, the cleaning result can be classified and summarized, the problem table data in the problem library can be analyzed, and the user can be summarized and reported.
Theme layer flow:
1) And carrying out deep analysis on the standardized or cleaned data to abstract and form a business data model.
2) Based on the previous data and the investigation of the service systems by developers, the authority of the data of each service in each service is basically mastered.
3) The topic development is performed by configuring topic fusion rules, wherein the specific fusion rules can comprise horizontal splitting, vertical splitting, multi-table combination, multi-table connection, single-table authority deduplication, custom rules and the like.
Thematic layer flow: on the premise of service demands of service users, processing can be performed according to a dimension modeling mode, wherein the processing comprises defining dimensions, combing indexes to be calculated, designing layers of dimensions and the like, and a data set facing to decision analysis demands of a service system is generated.
However, existing data cleansing schemes suffer mainly from several drawbacks:
1. after data cleaning is completed, quantitative evaluation is not performed on improvement of data quality, and the value of data cleaning in data warehouse construction cannot be effectively evaluated.
At present, in the data cleaning process, although problem summary output is performed after the data cleaning is completed, a found problem data list is generally fed back to a data source unit, or the data cleaning result and problem are reported to a data warehouse constructor in a text chart mode, whether the quality of the source layer data is improved or not is not improved beyond the problem of finding the data quality, and a unified, standard and accurate quantitative evaluation method is lacking, so that the important value of data cleaning in data warehouse construction cannot be embodied.
Although the quality of the cleaned data is checked in some scenes, the discovered data quality problem may be a source data problem due to the change of the data before and after cleaning, or may be a data problem caused by improper operation of data cleaning, and the data quality problem of the data table in the data warehouse is overlapped, so that the field level data quality problem is difficult to locate, and finally, the data quality improvement after data cleaning is not effectively evaluated and measured.
2. The data service personnel can configure the data cleaning rules and logic whether to be correct or effective, and lack the mechanism of automatic discrimination and alarming.
After the data service personnel completes the data cleaning task, the data service personnel cannot quickly and accurately acquire whether the configured rule or logic is valid or not and whether the configured rule or logic is correct or not, the current data quality system alarm mainly informs the data service personnel of the quality problem of the data table, but the quality problem is caused by the operation of the data cleaning process or the abnormality of source data, and after the data cleaning is completed, the data service personnel still need to manually judge, so that the problems of complicated operation flow, large workload, difficult guarantee of accuracy and the like exist.
3. The data cleaning is not effectively connected with the data quality and the data blood margin, and the positioning, processing, feedback and verification of the data problems do not form a data quality closed loop of the data cleaning.
The data cleaning system, the data quality system and the data blood edge system are already basic functional components of the large data warehouse platform and are all used for exploring, positioning, improving and checking the data quality. In practical operation, the problem of data quality is often more difficult to solve and check than the problem of data quality is found, the data quality and data cleaning are basically the operation of a single table, the data blood margin is mainly "seen", however, the problem of data quality is cross-level, after data cleaning, the data is from a source layer to a standard layer, therefore, the capabilities need to be organically combined in evaluating the effect of improving the data quality after data cleaning, and the data quality closed loop of the data problem in data cleaning is realized.
Aiming at the problems, the embodiment of the application provides a method, a device, equipment and a medium for detecting the effect of data cleaning.
The method for detecting the effect of data cleaning provided in the present application will be described in detail with reference to fig. 3.
Fig. 3 is a flowchart of a method for detecting an effect of data cleaning according to an embodiment of the present disclosure.
The data cleaning effect detection method of the embodiment of the application may be executed by the data cleaning effect detection device provided by the embodiment of the application. The data cleaning effect detection device can be applied to electronic equipment to execute data cleaning effect detection.
The electronic device may be any stationary or mobile computing device capable of performing data processing, for example, a mobile computing device such as a notebook computer, a smart phone, a wearable device, or a stationary computing device such as a desktop computer, or a server, or other types of computing devices, which are not limited in this application.
As shown in fig. 3, the method for detecting the effect of data cleaning includes the following steps:
step 301, obtaining a target data table and an original data table; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table.
In the embodiment of the present application, the original data table may be any one data table.
In the embodiment of the present application, the target field may be a field in the original data table, and the number of target fields may be, but is not limited to, one, which is not limited in the present application.
In the embodiment of the application, the original data table can be obtained, and the data corresponding to each target field in the original data table can be cleaned, so that the cleaned target data table can be obtained.
As an application scenario, an original data table may be acquired from a patch source layer (ODS layer) of a data warehouse, and a target data table after cleaning the original data table may be acquired from a standard layer (STD layer) of the data warehouse.
Step 302, a quality rule corresponding to each target field is obtained.
In the embodiment of the present application, each target field may have a corresponding quality rule, and the quality rule corresponding to each target field may be used to perform quality detection on data corresponding to the target field. For example, when the target field is "identification card number", the quality corresponding to the target field is "identification card verification" to detect whether the data corresponding to the target field "identification card number" exists or is correct; for another example, when the target field is "name", the quality rule corresponding to the target field is "null check" to detect whether the data corresponding to the target field "name" is control.
It should be noted that, the quality rule corresponding to any target field may be one or may be plural, which is not limited in this application.
As an example, assume that a target field "identification card number" exists in the original data table, and the quality rule corresponding to the target field may include "identification card check" and "uniqueness check"; assuming that the target field "name" exists in the original data table, the quality rule corresponding to the target field may include "null value check".
It should be noted that the above examples of the quality rules corresponding to the target fields are merely exemplary, and in practical applications, the quality rules corresponding to the target fields may be set according to needs, which is not limited in this application.
In the embodiment of the present application, a quality rule corresponding to each target field may be obtained.
For example, a corresponding quality rule may be pre-configured for each target field. For example, the original data table includes the target fields of "identification card number", "name", "gender code" and "update time", and the quality rules corresponding to each target field in the original data table are shown in table 1:
table 1 quality rules corresponding to each target field in the raw data table
Target field Quality rules
Identification card number Identity card verification
Identification card number Uniqueness verification
Name of name Null value verification
Gender code Dictionary table value field verification
Update time Date and time verification
Step 303, comparing the data amounts of the abnormal data in the target data table and the original data table respectively for at least one target field, so as to obtain the variation of the abnormal data of each target field under the corresponding quality rule.
In the embodiment of the application, the data quantity of the abnormal data of at least one target field in the target data table and the original data table can be compared, so that the variation of the abnormal data of each target field under the corresponding quality rule can be obtained.
Step 304, determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule.
In the embodiment of the application, the quality improvement rate of the data table can be used for indicating the quality improvement rate of the original data table after the cleaning treatment.
In the embodiment of the application, the quality improvement rate of the data table can be determined based on the variation of the abnormal data of each target field under the corresponding quality rule.
According to the data cleaning effect detection method, a target data table and an original data table are obtained; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table; acquiring quality rules corresponding to each target field; comparing the data quantity of the abnormal data of at least one target field in the target data table and the original data table respectively to obtain the variation quantity of the abnormal data of each target field under the corresponding quality rule; and determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule. Therefore, the change amount of abnormal data of the data table is determined by comparing the original data table before data cleaning with the target data table after data cleaning, and the quality improvement rate of the data table can be effectively determined, so that the effect of automatically evaluating the data cleaning based on the quality improvement rate of the data table can be realized.
In order to clearly explain how to compare the data amounts of the abnormal data in the target data table and the original data table of at least one target field respectively in the above embodiments of the present application, so as to obtain the variation amounts of the abnormal data of each target field under the corresponding quality rule, the present application further provides a data cleaning effect detection method.
Fig. 4 is a flow chart of a method for detecting an effect of data cleaning according to a second embodiment of the present disclosure.
As shown in fig. 4, according to the above embodiment of the present application, the method for detecting the effect of data cleaning may further include the following steps:
step 401, for any target field, determining a change amount of abnormal data of the target field under a corresponding quality rule according to a total data amount of the new field in the target data table and a first abnormal data amount which does not conform to the corresponding quality rule when the target field has a corresponding new field in the target data table.
In one possible implementation manner of the embodiment of the present application, a blood-edge relationship (may also be referred to as a mapping relationship) between a field in the original data table and a field in the target data table may be maintained, so that a field, in which each target field in the original data table has a blood-edge relationship with a target field in the target data table, may be determined according to the maintained blood-edge relationship between the field in the target data table and the field in the original data table, that is, a field, in which each target field corresponds to the target field, in the target data table is determined.
As an example, the target field in the original data table of the patch source layer (ODS layer) includes an identification number, a name, a sex code, an update time; after data cleaning, the fields in the target data table of the standard layer (STD layer) comprise an identity card number, a name, a gender code, a gender name and an update time; the blood-edge relationship between the target field in the original data table and the field in the target data table is shown in table 2:
TABLE 2 blood relationship between target fields in raw data tables and fields in target data tables
The method comprises the steps that a target field 'identity card number' in an original data table and a field 'identity card number' in the target data table have a blood relationship; the target field 'name' in the original data table has a blood relationship with the field 'name' in the target data table; the target field 'update time' in the original data table has a blood relationship with the field 'update time' in the target data table; specifically, the target field "sex code" in the original data table has a blood relationship with the field "sex code" and the field "sex name" in the target data table, respectively; the method comprises the steps of carrying out dictionary standardized cleaning on a target field 'gender code' in an original data table in a data cleaning process, and carrying out dictionary decoding on the 'gender code'; for example, the "gender code" dictionary name stored in the original data table of the patch source layer (ODS layer) is 1, and the dictionary name 1 dictionary is decoded into a gender name male through dictionary standardized cleaning; the dictionary name of the 'gender code' stored in the original data table is 2, and the dictionary name 2 dictionary is decoded into gender name females through dictionary standardized cleaning.
In the embodiment of the present application, the added field may be a field that is added to the target data table in comparison with the target field in the original data table.
In the embodiment of the application, the target field may have a corresponding newly added field in the target data table, that is, the target field may have a newly added field having a blood-related relationship (or mapping relationship) with the target field in the target data table.
Still further, in the above example, compared to the original data table before the data cleansing, the field "gender name" is newly added to the target data table after the data cleansing, and the newly added field "gender name" has a blood relationship with the target field "gender code" in the original data table, so in table 1, the target field "gender code" in the original data table has the corresponding newly added field "gender name" in the target data table.
In this embodiment of the present application, for a newly added field corresponding to a target field in a target data table, a quality rule corresponding to the newly added field may be a quality rule corresponding to a target field having a blood-related relationship with the newly added field, that is, the newly added field may inherit a quality rule corresponding to a target field having a blood-related relationship with the newly added field.
In the embodiment of the application, the total data amount of the newly added field in the target data table and the first abnormal data amount which does not accord with the corresponding quality rule can be determined.
As an example, assume that the target data table is shown in table 3, wherein the newly added field is "gender name", the total data amount of the newly added field in the target data table is 3, and the first abnormal data amount that does not meet the corresponding quality rule is 1.
TABLE 3 target data sheet
In this embodiment of the present application, for any target field, when a target field has a corresponding newly added field in a target data table, the amount of change of abnormal data of the target field under a corresponding quality rule may be determined according to the total data amount of the newly added field in the target data table and the first abnormal data amount that does not conform to the corresponding quality rule.
As a possible implementation manner, the change amount of the abnormal data of the target field under the corresponding quality rule can be determined according to the difference between the total data amount of the newly added field and the first abnormal data amount.
For example, the difference between the total data amount of the newly added field and the first abnormal data amount may be used as the change amount of the abnormal data of the target field under the corresponding quality rule. For example, still referring to table 3 in the above example, the total data amount of the newly added field "sex name" in the target data table is 3, the first abnormal data amount that does not meet the corresponding quality rule is 1, and the difference 2 between the total data amount of the newly added field "sex name" and the first abnormal data amount may be used as the change amount of the abnormal data of the target field "sex code" corresponding to the newly added field under the corresponding quality rule.
Therefore, under the condition that the target field has a corresponding newly added field in the target data table, the change amount of the abnormal data of the target field under the corresponding quality rule can be effectively determined.
Step 402, determining the variation of the abnormal data of the target field under the corresponding quality rule according to the second abnormal data amount of the target field in the original data table and the third abnormal data amount in the target data table when the target field does not have the corresponding newly added field in the target data table.
It is understood that the target field may not have a corresponding newly added field in the target data table. Still referring to table 2 in the above example, as shown in table 2, the target fields "id number", "name" and "update time" do not have corresponding newly added fields in the target data table.
In the embodiment of the present application, the second abnormal data amount may indicate a data amount of abnormal data of which the target field does not conform to the corresponding quality rule in the original data table.
In the embodiment of the present application, the third abnormal data amount may indicate a data amount of abnormal data of which the target field does not conform to the corresponding quality rule in the target data table.
In the embodiment of the application, in the case that the target field does not have a corresponding newly added field in the target data table, the second abnormal data amount of the target field in the original data table and the third abnormal data amount in the target data table may be determined.
As an example, assuming that the original data table is shown in table 4, the target data table is shown in table 3, where, for the target field "identification card number", it may be determined that the second abnormal data amount of the target field that does not conform to the corresponding quality rule "uniqueness check" in the original data table is 1, and the third abnormal data amount that does not conform to the corresponding quality rule "uniqueness check" in the target data table is 0.
Table 4 raw data table
Identification card number Name of name Gender code Update time
42220219930330XX12 Zhang San 1 2023-01-01 11:05:10
42220219930330XX12 Wang Wu 2 2023-01-02 10:05:20
42220219910330XX13 NULL 1 20230102100520
42220219920330XX14 Li Si 5 20230102100520
NULL Zhao Liu 1 20230102100520
Still referring to tables 3 and 4, for the target field "id card number", it may be determined that the second abnormal data amount of the target field that does not conform to the corresponding quality rule "id card check" in the original data table is 1, and the third abnormal data amount that does not conform to the corresponding quality rule "id card check" in the target data table is 0.
In the embodiment of the present application, when the target field does not have a corresponding newly added field in the target data table, the change amount of the abnormal data of the target field under the corresponding quality rule may be determined according to the second abnormal data amount of the target field in the original data table and the third abnormal data amount in the target data table.
As a possible implementation manner, the change amount of the abnormal data of the target field under the corresponding quality rule may be determined according to the difference between the second abnormal data amount and the third abnormal data amount.
For example, a difference between the second abnormal data amount and the third abnormal data amount may be used as the change amount of the abnormal data of the target field under the corresponding quality rule.
Still referring to tables 3 and 4 in the above examples, for the target field "identification card number", it may be determined that the second abnormal data amount of the target field that does not conform to the corresponding quality rule "identification card check" in the original data table is 1, the third abnormal data amount of the target field that does not conform to the corresponding quality rule "identification card check" in the target data table is 0, and the difference 1 between the second abnormal data amount and the third abnormal data amount may be used as the variation of the abnormal data of the target field "identification card number" under the corresponding quality rule "identification card check".
Therefore, under the condition that the target field does not have a corresponding newly added field in the target data table, the change amount of the abnormal data of the target field under the corresponding quality rule can be effectively determined.
As an example, assuming that the original data table is shown in table 4, the quality rules corresponding to the target fields in the original data table are shown in table 1, the original data table 4 is subjected to data cleaning, the target data table after data cleaning is shown in table 3, and the variation of the abnormal data of the target fields under the corresponding quality rules is shown in table 5:
Table 5 variation of abnormal data of each target field under corresponding quality rule
Target field Quality rules Variation of
Identification card number Identity card verification 1
Identification card number Uniqueness verification 1
Name of name Null value verification 1
Gender code Dictionary table value field verification 2
Update time Date and time verification 3
In one possible implementation manner of the embodiment of the present application, for any target field, whether a cleaning abnormality occurs may be determined according to the amount of change of the abnormal data of the target field under the corresponding quality rule; in response to the occurrence of the cleaning abnormality, generating and sending prompt information according to the change amount of the abnormal data of the target field under the corresponding quality rule; the prompt information is used for prompting that the cleaning rule is invalid or the cleaning rule is wrong.
It should be noted that, the data corresponding to each first field in the original data table may be cleaned according to the cleaning rule, so as to obtain the cleaned target data table. The cleaning rule may include, for example, date and time standardization, named entity cleaning, longitude and latitude conversion, blank removal, case conversion, character string interception, precision standardization, dictionary standardization, full-half-angle conversion, specific character removal, designated data filtering, repeated redundant data removal, and the like, which is not limited in this application.
As an example, for any target field, in the case that a corresponding newly added field exists in the target data table in the target field, when the change amount of the abnormal data of the target field under the corresponding quality rule is equal to 0, that is, the difference between the total data amount of the newly added field in the target data table and the first abnormal data amount which does not conform to the corresponding quality rule is equal to 0, it is indicated that a cleaning abnormality occurs; at this time, corresponding first prompt information can be generated according to the variation of the abnormal data of the target field under the corresponding quality rule; the first prompt information may be used to indicate that the cleaning rule is invalid.
As another example, for any target field, in the case where the target field has a corresponding newly added field in the target data table, when the change amount of the abnormal data of the target field under the corresponding quality rule is greater than 0, that is, the difference between the total data amount of the newly added field and the first abnormal data amount is greater than 0, it is indicated that the cleaning rule is valid, and at this time, no hint information needs to be generated.
As a further example, for any target field, in the case where the target field does not have a corresponding newly added field in the target data table, when the amount of change of the abnormal data of the target field under the corresponding quality rule is less than 0, that is, the difference between the second abnormal data amount of the target field in the original data table and the third abnormal data amount in the target data table is less than 0, it is indicated that the abnormal data amount after cleaning is greater than the abnormal data amount before cleaning, and cleaning abnormality occurs; at this time, corresponding second prompt information can be generated according to the variation of the abnormal data of the target field under the corresponding quality rule; the second prompt message may be used to indicate a cleaning rule error.
As still another example, for any one of the target fields, in the case where the target field does not have a corresponding newly added field in the target data table, when the amount of change of the abnormal data of the target field is equal to 0 under the corresponding quality rule, that is, the difference between the second abnormal data amount and the third abnormal data amount is equal to 0, it is indicated that the abnormal data amount after the cleaning is equal to the abnormal data amount before the cleaning, a cleaning abnormality occurs; at this time, corresponding third prompt information can be generated according to the variation of the abnormal data of the target field under the corresponding quality rule; the second prompt information may be used to indicate that the cleaning rule is invalid.
As yet another example, for any target field, in the case where the target field does not have a corresponding newly added field in the target data table, when the change amount of the abnormal data of the target field under the corresponding quality rule is greater than 0, that is, the difference between the second abnormal data amount and the third abnormal data amount is greater than 0, it is indicated that the abnormal data amount after cleaning is smaller than the abnormal data amount before cleaning, and the quality rule corresponding to the target field is valid, at this time, no hint information need to be generated.
Therefore, when the cleaning is abnormal, the related staff can be prompted in a display mode that the cleaning rule is invalid or wrong, so that the related staff can adjust the cleaning rule in time.
According to the data cleaning effect detection method, under the condition that a corresponding newly added field exists in a target data table in any target field, according to the total data amount of the newly added field in the target data table and the first abnormal data amount which does not accord with the corresponding quality rule, the change amount of abnormal data of the target field under the corresponding quality rule is determined; and under the condition that the target field does not have a corresponding newly added field in the target data table, determining the change amount of the abnormal data of the target field under the corresponding quality rule according to the second abnormal data amount of the target field in the original data table and the third abnormal data amount in the target data table. Therefore, the change amount of the abnormal data of each target field in the original data table under the corresponding quality rule can be effectively determined.
In order to clearly explain how to determine the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule in any embodiment of the application, the application also provides a data cleaning effect detection method.
Fig. 5 is a flow chart of a method for detecting an effect of data cleaning according to a third embodiment of the present disclosure.
As shown in fig. 5, according to any embodiment of the present application, the method for detecting an effect of data cleaning may further include the following steps:
step 501, for any target field, determining a field quality improvement rate of the target field according to the change amount of the abnormal data of the target field under the corresponding quality rule.
In this embodiment of the present application, for any target field, the field quality improvement rate of the target field may be determined according to the amount of change of the abnormal data of the target field under the corresponding quality rule.
Step 502, determining the quality improvement rate of the data table according to the first weight of each target field in the original data table and the quality improvement rate of the field.
In the embodiment of the application, the first weight of each target field in the original data table can be obtained.
As one possible implementation, a field weight value of each target field may be obtained; a second coefficient may be determined based on a sum of field weight values of the respective target fields; for any target field, the first weight of the target field in the original data table may be determined according to the duty ratio of the field weight value of the target field to the second coefficient.
It should be noted that, the field weight value of each target field may be set by related staff according to the dimensions of data management, service application, security protection, etc. of the field, for example, the field weight value of the target field "id card number" is 80, the field weight value of the target field "name" is 60, the field weight value of the target field "sex code" is 40, and the field weight value of the target field "update time" is 20.
It should be further noted that the above examples of the field weight values of the respective target fields are merely exemplary, and may be other in practical applications, which is not limited in this application.
As an example, assume that there are n target fields, and the field weight value of the i-th target field is Fwv i Wherein n is a positive integer greater than 0, i.e. [1, n]And i is a positive integer; determining the second coefficient asFor the jth target field, the first weight Fw of the jth target field in the original data table can be determined according to the duty ratio of the field weight value of the jth target field to the second coefficient j The method comprises the following steps:
wherein j is a positive integer and j is E [1, n ].
In the embodiment of the application, the quality improvement rate of the data table can be determined according to the first weight of each target field in the original data table and the field quality improvement rate.
As an example, assuming that n target fields exist in the original data table, the first weight of the jth target field in the original data table is Fw j Word of jth target fieldThe segment quality improvement rate is Fr j The data sheet quality improvement rate Tr may be determined according to the following formula:
wherein n is a positive integer greater than 0, j ε [1, n ] and j is a positive integer.
According to the data cleaning effect detection method, the field quality improvement rate of any target field is determined according to the change amount of abnormal data of the target field under the corresponding quality rule; and determining the quality improvement rate of the data table according to the first weight of each target field and the field quality improvement rate. Therefore, on one hand, the field quality improvement rate of each target field can be effectively determined, and on the other hand, the first weight of each target field is given, so that the data table quality improvement rate can be effectively determined based on the first weight of each target field and the field quality improvement rate.
In order to clearly explain how to determine the field quality improvement rate of the target field according to the change amount of the abnormal data of the target field under the corresponding quality rule in the above embodiment of the present application, the present application further provides a data cleaning effect detection method.
Fig. 6 is a flow chart of a method for detecting an effect of data cleaning according to a fourth embodiment of the present disclosure.
As shown in fig. 6, according to the above embodiment of the present application, the method for detecting the effect of data cleansing may further include the following steps:
step 601, determining the data quality improvement rate of the target field under the corresponding quality rule according to the change amount of the target field under the corresponding quality rule.
In the embodiment of the application, the data quality improvement rate of the target field under the corresponding quality rule can be determined according to the change amount of the target field under the corresponding quality rule.
Step 602, determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under the corresponding quality rule and the field rule weight of the corresponding quality rule in the original data table.
It should be noted that, the quality rule corresponding to each target field may have a corresponding field rule weight in the original data table.
In order to clearly illustrate how to obtain the field rule weight of the quality rule corresponding to each target field in the original data table, in one possible implementation manner of the embodiment of the present application, for any target field, the rule weight value of the quality rule corresponding to the target field in the original data table may be obtained; and the field rule weight of the quality rule corresponding to the target field in the original data table can be determined according to the rule weight value of the quality rule corresponding to the target field in the original data table.
It should be noted that, the rule weight value of the quality rule corresponding to each target field in the original data table may be set by related staff according to actual requirements. For example, assume that the target fields in the original data table are "identification card number", "name", "gender code" and "update time", respectively, and the rule weight values of the quality rules corresponding to the target fields in the original data table are shown in table 6:
Table 6 rule weight values of quality rules corresponding to each target field in the original data table
/>
It should be noted that, the foregoing examples of the rule weight values of the quality rules corresponding to the target fields in the original data table are merely exemplary, and in practical application, the rule weight values of the quality rules corresponding to the target fields in the original data table may be other, and may be set as required, which is not limited in this application.
In the embodiment of the present application, the field rule weight of the quality rule corresponding to the target field in the original data table may be determined according to the rule weight value of the quality rule corresponding to the target field in the original data table.
As a possible implementation manner, when the quality rule corresponding to the target field is one, the field rule weight of the quality rule corresponding to the target field may be determined to be a set weight value.
In the embodiment of the present application, the set weight may be preset, for example, may be 100%, 80%, or the like, which is not limited in this application.
As an example, still referring to table 6, assuming that the weight is set to 100%, the quality rule corresponding to the target field "sex code" is one, that is, "dictionary table value field check", and the field rule weight of the quality rule corresponding to the target field "dictionary table value field check" is 100%.
As another possible implementation manner, when the quality rule corresponding to the target field is plural, the first coefficient may be determined based on a sum of rule weight values of each quality rule corresponding to the target field in the original data table; and determining the field rule weight of the quality rule corresponding to the target field according to the duty ratio of the rule weight value of the quality rule and the first coefficient aiming at any quality rule corresponding to the target field.
As an example, assume that the target field corresponds to m quality rules, and the rule weight value of the ith quality rule corresponding to the target field in the original data table is Frwv i Wherein m is a positive integer greater than 1, i.e. [1, m]And i is a positive integer; determining a first coefficient asFor the jth quality rule corresponding to the target field, determining the field rule weight Frw of the jth quality rule corresponding to the target field according to the duty ratio of the rule weight value of the jth quality rule to the first coefficient j The method comprises the following steps:
wherein j is a positive integer and j is E [1, n ].
For example, still referring to table 6, the quality rules corresponding to the target field "id card number" are "id card check" and "uniqueness check", the rule weight value of the quality rule "id card check" corresponding to the target field in the original data table is 30, the rule weight value of the quality rule "uniqueness check" corresponding to the target field in the original data table is 20, the sum of the rule weight values of the quality rules corresponding to the target field in the original data table is 50, and it can be determined that the first coefficient is 50; aiming at the quality rule identity card verification corresponding to the target field identity card number, the field rule weight of the quality rule identity card verification corresponding to the target field identity card number can be determined to be 60% according to the ratio of the rule weight value of the quality rule to the first coefficient; and for the quality rule 'uniqueness check' corresponding to the target field 'identity card number', the field rule weight of the quality rule 'uniqueness check' corresponding to the target field 'identity card number' can be determined to be 40% according to the duty ratio of the rule weight value of the quality rule and the first coefficient.
Therefore, the field rule weight of the quality rule corresponding to each target field in the original data table can be effectively obtained.
In the embodiment of the application, the field quality improvement rate of the target field can be determined according to the data quality improvement rate of the target field under the corresponding quality rule and the field rule weight of the corresponding quality rule in the original data table.
As a possible implementation manner, when the quality rule corresponding to the target field is one, the field quality improvement rate of the target field can be determined according to the data quality improvement rate of the target field under the quality rule and the field rule weight of the quality rule in the original data table.
As an example, when the quality rule corresponding to the target field is one, the product of the data quality improvement rate of the target field under the corresponding quality rule and the field rule weight of the corresponding quality rule in the original data table may be taken as the field quality improvement rate of the target field.
For example, assuming that the original data table is shown in table 4, the target data table is shown in table 3, the quality rules corresponding to each target field are shown in table 1, and for the target field "name", the quality rule corresponding to the target field "name" is only one, that is, "null check", the field quality improvement rate of the target field "name" can be determined as rr×frw according to the product of the data quality improvement rate Rr of the target field "name" under the corresponding quality rule "null check" and the field rule weight Frw of the corresponding quality rule "null check" in the original data table.
As another possible implementation manner, when the quality rules corresponding to the target field are multiple, the field quality improvement rate of the target field may be determined according to the data quality improvement rate of the target field under each corresponding quality rule and the field rule weight of each corresponding quality rule in the original data table.
As an example, when the quality rule corresponding to the target field is multiple, for any quality rule corresponding to the target field, determining the sub-field quality improvement rate of the target field under the quality rule according to the product of the data quality improvement rate of the target field under the quality rule and the field rule weight of the quality rule in the original data table; and the field quality improvement rate of the target field can be determined according to the sum of the quality improvement rates of the subfields.
For example, assuming that the number of quality rules corresponding to the target field is m, the data quality improvement rate of the target field under the corresponding ith quality rule in the original data table is Rr i And the field rule weight of the ith quality rule corresponding to the target field in the original data table is Frw i Wherein m is a positive integer greater than 0, i.e. [1, m]And i is a positive integer; for the ith quality rule corresponding to the target field, the method can be based on the Data quality improvement rate Rr of target field under ith quality rule i And field rule weights Frw for the ith quality rule in the original data table i Determining the subfield quality improvement rate of the target field under the ith quality rule as Rr i *Frw i *100%, so that the field quality improvement rate Fr of the target field can be determined according to the sum of the quality improvement rates of the subfields, where Fr is:
according to the data cleaning effect detection method, the data quality improvement rate of the target field under the corresponding quality rule is determined according to the change quantity of the target field under the corresponding quality rule; and determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under the corresponding quality rule and the field rule weight of the corresponding quality rule in the original data table. Therefore, the field quality improvement rate of each target field in the original data table can be effectively determined.
In order to clearly explain how to determine the data quality improvement rate of the target field under the corresponding quality rule according to the change amount of the target field under the corresponding quality rule in the above embodiment of the present application, the present application further provides a data cleaning effect detection method.
Fig. 7 is a flow chart of a method for detecting an effect of data cleaning according to a fifth embodiment of the present application.
As shown in fig. 7, according to the above embodiment of the present application, the method for detecting the effect of data cleansing may further include the following steps:
in step 701, under the condition that the target field has a corresponding newly added field in the target data table, determining the data quality improvement rate of the target field under the corresponding quality rule according to the change amount of the target field under the corresponding quality rule and the duty ratio of the total data amount of the newly added field in the target data table.
It should be noted that, the explanation of step 401 that the target field has a corresponding newly added field in the target data table is also applicable to this embodiment, and will not be described herein.
In the embodiment of the present application, when the target field has a corresponding newly added field in the target data table, the data quality improvement rate of the target field under the corresponding quality rule may be determined according to the ratio of the change amount of the target field under the corresponding quality rule to the total data amount of the newly added field in the target data table.
As an example, in the case where a target field has a corresponding newly added field in a target data table, it is assumed that the total amount of data of the newly added field corresponding to the target field in the target data table is a; the change amount of the target field under the corresponding quality rule is B, and the ratio of the change amount of the target field under the corresponding quality rule to the total data amount of the newly added field in the target data table can be calculated As a target field, the data quality improvement rate under the corresponding quality rule.
For example, assuming that the original data table is shown in table 4, the target data table is shown in table 3, the variation of the abnormal data of each target field in the original data table under the corresponding quality rule is shown in table 5, the target fields include "identification card number", "name", "gender code", "update time", and for the target field "gender code", the target field "gender code" has a corresponding newly added field "gender name" in the target data table, and it can be determined that the total data amount of the newly added field "gender name" in the target data table is 3; the variation of the target field sex code under the corresponding quality rule is 2; according to the ratio of the change amount of the target field sex code under the corresponding quality rule and the total data amount of the newly added field sex name in the target data table, determining that the data quality improvement rate of the target field sex code under the corresponding quality rule is
Step 702, determining a data quality improvement rate of the target field under the corresponding quality rule according to a ratio of a change amount of the target field under the corresponding quality rule to a fourth abnormal data amount of the target field in the original data table under the condition that the target field does not have the corresponding newly added field in the target data table.
In this embodiment of the present application, when the target field does not have a corresponding newly added field in the target data table, the data quality improvement rate of the target field under the corresponding quality rule may be determined according to the ratio of the change amount of the target field under the corresponding quality rule to the fourth abnormal data amount of the target field in the original data table.
As an example, when the target field does not have a corresponding newly added field in the target data table, assuming that the fourth abnormal data amount of the target field in the original data table is D and the change amount of the target field under the corresponding quality rule is C, the ratio of the change amount of the target field under the corresponding quality rule to the fourth abnormal data amount of the target field in the original data table may be used as the data quality improvement rate of the target field under the corresponding quality rule.
For example, assuming that the original data table is shown in table 4, the target data table is shown in table 3, the variation of the abnormal data of each target field in the original data table under the corresponding quality rule is shown in table 5, the target field includes "identification card number", "name", "sex code", "update time", and for the target field "identification card number", there is no corresponding newly added field in the target data table, it may be determined that the fourth abnormal data amount of the target field "identification card number" in the original data table does not conform to the corresponding quality rule "identification card check", the variation of the target field "identification card number" under the corresponding quality rule "identification card check" is 1, and the ratio of the variation of the target field "identification card number" under the corresponding quality rule "identification card check" to the fourth abnormal data amount may be determined as the data quality improvement rate of the target field "identification card number" under the corresponding quality rule "identification card check".
For another example, for the target field "identity card", the fourth abnormal data amount which does not conform to the corresponding quality rule "uniqueness check" in the original data table may be determined to be 1, the variation of the target field "identity card number" under the corresponding quality rule "uniqueness check" may be determined to be 1, and the ratio of the variation of the target field "identity card number" under the corresponding quality rule "uniqueness check" to the fourth abnormal data amount may be determined to be the data quality improvement rate of the target field "identity card number" under the corresponding quality rule "uniqueness check".
According to the data cleaning effect detection method, under the condition that a corresponding newly added field exists in a target data table in a target field, the data quality improvement rate of the target field under the corresponding quality rule is determined according to the change amount of the target field under the corresponding quality rule and the ratio of the total data amount of the newly added field in the target data table; and under the condition that the target field does not have a corresponding newly added field in the target data table, determining the data quality improvement rate of the target field under the corresponding quality rule according to the change amount of the target field under the corresponding quality rule and the duty ratio of the fourth abnormal data amount of the target field in the original data table. Therefore, the data quality improvement rate of each target field under the corresponding quality rule can be effectively determined.
In order to more clearly describe the effect detection method of data cleaning of the present application, the following description is made in detail with reference to examples.
As an application scenario, an example of applying the method for detecting the effect of data cleaning in the application scenario in which data cleaning is performed based on a data warehouse is shown in fig. 8, where the data flow between a source layer and a standard layer is shown in the scenario in which data cleaning is performed based on the data warehouse, where:
paste source layer (ODS layer): and storing the data accessed from each data source, and providing the original data for data cleaning without processing the data content.
Cleaning rules: common cleaning rules include date and time standardization, cleaning of named entities (such as resident identification cards, mailbox addresses, unified social credit codes, mobile phone numbers and the like), longitude and latitude conversion, blank removal, case conversion, character string interception, precision standardization, dictionary standardization, full-half angle conversion, specific character removal, specified data filtering, repeated redundant data removal and the like. It should be noted that, there is a partial cleaning rule, and the partial cleaning rule may cause a new field of the cleaned data table, for example, dictionary standardization; correspondingly, there is also a part of rules that will not cause additional fields of the post-cleaning results table, such as case-to-case conversion.
Field mapping relation: the mapping relation (in the present application, the association relation is marked as the mapping relation) between the data table and the fields before and after data cleaning can be obtained through the data blood edge system. The data blood edge system is a tool for managing the data blood edge because the data processing logic forms a dependency relationship among the data tables, the data fields and the data fields.
As an example, the original data table of the patch source layer (ODS layer), such as the personnel basic information (ods_ryjbxx), contains fields of: identification card number (sfzh), name (xm), gender code (xbdm), update time (gxsj); after data cleaning, the fields in the target data table personnel basic information (std_ryjbxx) of the standard layer (STD layer) are respectively: identification card number (sfzh), name (xm), gender code (xbdm), gender name (xbmc), update time (gxsj); the mapping relationship between the fields in the original data table and the fields in the target data table is shown in table 7:
table 7 mapping relationship between fields in the original data table and fields in the target data table
Wherein, the gender code (xbdm) field is subjected to dictionary standardization cleaning in the data cleaning process, so that the STD layer target data table is newly added with the field gender name (xbmc) after the data cleaning. For example, the original data table of the ODS layer stores the gender code dictionary names 1 and 2, respectively corresponding standard dictionary values are male and female, and when the quality improvement rate of the data table is calculated, the field mapped by the gender code (xbdm) of the ODS layer is the gender name (xbmc) of the STD layer.
Standard layer (STD layer): and the storage device is used for storing the standardized data after data cleaning.
After the data cleaning is completed, the inventors use the calculated data table quality improvement rate to evaluate the data quality improvement effect after the data cleaning. The data table quality improvement rate can indicate the improvement of the data quality after data cleaning relative to the data quality before data cleaning, and is a way for evaluating the data cleaning efficiency. In order to calculate the quality improvement rate of the data table before and after data cleaning, the quality improvement rate of the data table and the data quality improvement rate of the data table can be calculated by acquiring and analyzing the data quality condition of each line of data of each field of the data table before and after data cleaning, wherein the quality improvement rate of the data table depends on the data quality improvement rate of the field, and the quality improvement rate of the data table can be acquired by carrying out weighted calculation on the data quality improvement rate of the field.
It should be further noted that, in the process of calculating the quality improvement rate of the data table, a data blood edge system may be used to determine mapping relations between the source layer table, the field, the standard layer table and the field before and after data cleaning, and the calculation of the quality improvement rate of the data table needs to be combined with the data quality system, where the data quality system may be a tool for performing quality evaluation and monitoring management on data based on a data quality rule, and the quality of the fields of the data before and after cleaning is checked by the data quality system.
According to the characteristics of the data cleaning field object, two cases are divided:
firstly, after data is cleaned, no field is added in the cleaned standard layer table, such as case conversion and identity card verification. Before and after data cleaning, the mapping relation between the ODS layer table field and the STD layer table field is '1 to 1', and the quality rule of the STD layer table field can directly inherit the quality rule of the ODS layer table field, such as the mapping relation between the ODS layer in table 7 and the 'identity card number' field in the personnel basic information table of the STD layer, and the quality rule of the 'identity card number' of the STD layer table field can directly inherit the quality rule of the 'identity card number' of the ODS layer table field.
And secondly, after data cleaning, newly added fields exist in the cleaned standard layer table, such as dictionary table cleaning. At this time, the mapping relationship between the ODS layer table field and the STD table field is "1 to 2", such as the data cleaning of the "sex code" field in Table 7, the ODS layer is the "sex code (xbdm)", after the data cleaning, the STD layer has 2 corresponding fields, respectively the "sex code (xbdm)" and the "sex name (xbmc)", and at this time, the mapping relationship between the "sex name (xbmc)" of the STD layer and the "sex code (xbdm)" of the ODS layer can be taken, and the "sex name (xbmc)" field of the STD layer inherits the quality rule of the "sex code (xbdm)" of the ODS layer.
Based on the above description, the method for detecting the effect of data cleaning in the present application may include the following steps:
according to the quality requirement of data warehouse construction on the data table fields, a service expert can configure quality rules and field quality rule weight values (marked as rule weight values in the application) for each target field in an original data table of an ODS layer, and can calculate the field rule weight of each quality rule of the target field according to the quality rules corresponding to the same field, wherein the field rule weight can indicate the weight ratio of a single quality rule in all configuration rules of the corresponding field. It should be noted that, the purpose of configuring the quality rule is to perform quality check on the field, and the purpose of configuring the weight value of the rule is to identify the importance degree of the quality rule in the same field, so that the quality condition of the field can be more accurately evaluated.
The field rule weight of the quality rule for each target field may be calculated in the following manner:
assuming that m quality rules corresponding to a target field are adopted, the field quality rule weight value of the ith quality rule corresponding to the target field in the original data table is Frwv i Wherein m is a positive integer greater than 0, i.e. [1, m ]And i is a positive integer; based on the target fieldThe sum of the field quality rule weight values of the corresponding quality rules in the original data table is used for determining the first coefficient asFor the jth quality rule corresponding to the target field, determining the field rule weight Frw of the jth quality rule corresponding to the target field according to the duty ratio of the field quality rule weight value of the jth quality rule to the first coefficient and the formula (3) j The method comprises the steps of carrying out a first treatment on the surface of the Wherein j is E [1, n ]]And j is a positive integer.
The quality rule weight value may be a shaping value.
As an example, as shown in table 8, table 8 shows quality rules, field quality rule weight values, and field rule weights of each field configuration of a person basic information table (ods_ryjbxx table) attached to a source layer.
TABLE 8 quality rules and field rule weights for target fields of basic information Table
Field name Quality rules Field quality rule weight value Field rule weight (%)
Identification card number Identity card verification 30 60
Identification card number Uniqueness verification 20 40
Name of name Null value verification 100 100
Gender code Dictionary table value field verification 100 100
Update time Date and time verification 100 100
Step two, configuring corresponding field weight values for each target field of the original data table, and calculating the table field weight (first weight in the application) of each target field according to the field weight values of each target field; the table field weight may be the duty ratio of the field weight value in the field weight values of all the target fields in the original data table.
It should be noted that, the service expert may evaluate the field weight value of the target field according to the dimensions of the data management dimension, the service application dimension, the security protection dimension, and the like of the field. Moreover, the purpose of configuring the field weights of the table is to distinguish the importance degree of the target field in the original data table, so that the influence of the quality of the single field on the quality of the whole data table can be more accurately evaluated.
The table field weight for each target field in the original data table may be calculated as follows:
assuming that n target fields exist, ithThe field weight value of the target field is Fwv i Wherein n is a positive integer greater than 0, i.e. [1, n]And i is a positive integer; determining the second coefficient asFor the jth target field, determining the table field weight Fw of the jth target field in the original data table according to the formula (1) according to the duty ratio of the field weight value of the jth target field to the second coefficient j The method comprises the steps of carrying out a first treatment on the surface of the Wherein j is E [1, n ]]And j is a positive integer.
As an example, as shown in table 9, table 9 shows field weight values and table field weights of each field configuration of a person basic information table (ods_ryjbxx table) attached to a source layer.
Table 9 personnel basic information table of source layer, field weight value of each field and table field weight
Field name Field weight value Table field weight (%)
Identification card number 80 40
Name of name 60 30
Gender code 40 20
Update time 20 10
And step three, judging whether the data is cleaned to cause a new field of the target data table. The mapping relation (or blood edge relation) between the target field of the STD table (which is marked as an original data table in the application) and the field of the ODS table (which is marked as a target data table in the application) can be obtained through a data blood edge system, the mapping relation between the field of the ODS table and the field of the STD layer table is shown as table 7, and the STD table can inherit the quality rule of the corresponding field of the ODS table for the case of no newly added field; for the case of the newly added field, the STD table newly added field may inherit the quality rule of the target field having the mapping relationship between the ODS table and the newly added field, and the field having the mapping relationship between the STD table and the newly added field may not include evaluation calculation. For example, when the new field of the STD table is "gender name (xbmc)", and the field of the ODS table having a mapping relationship with the new field is "gender code (xbdm)", the "gender code (xbdm)" in the STD table may not be included in the calculation.
Step four: according to the configured quality rule, the quality check can be carried out on the ODS table and the STD table through a data quality system, and the calculation indexes which can be obtained through statistics comprise: abnormal data volume of each field of the data table before and after data cleaning does not accord with the corresponding quality rule, and total data volume of each field of the data table before and after data cleaning. Therefore, the change amount of the abnormal data of each target field of the ODS table under the corresponding quality rule after data cleaning can be calculated according to the abnormal data amount of each field of the data table before and after cleaning and the data total amount of each field of the data table before and after cleaning. Statistics can be performed in two ways:
1) For any target field in the ODS table, after data cleaning, if a new added field with a blood relationship (or mapping relationship) with the target field exists in the cleaned STD table, determining the total data amount of the new added field in the STD table and the first abnormal data amount which does not accord with the corresponding quality rule; therefore, the change amount of the abnormal data of the target field under the corresponding quality rule can be determined according to the difference between the total data amount of the newly added field and the first abnormal data amount.
For example, assuming that the total data amount of the newly added field in the STD table is staod, the first abnormal data amount of the newly added field in the STD table is stuod, and the change amount Edr of the abnormal data of the target field having a mapping relationship with the newly added field under the corresponding quality rule is:
Edr=staod-stuaod; (5)
by analyzing the variation of the abnormal data of the newly added field, whether the cleaning rule or logic is correct or not can be evaluated, and the following two cases can be classified:
1. when edr=0, the first abnormal data amount of the newly added field after data cleansing and the total data amount of the newly added field after data cleansing are consistent, which indicates that the cleansing rule or logic is invalid, and the data quality system can remind the data service personnel that the cleansing rule or logic may need to be adjusted.
2. When Edr >0, the first abnormal data volume of the newly added field after data cleaning is smaller than the data total volume of the newly added field after data cleaning, which indicates that the cleaning rule or logic is valid, and the cleaning rule really plays a role.
2) If the target field does not have a newly added field corresponding to the mapping of the target field in the ODS table after data cleaning, determining a second abnormal data amount of the target field in the ODS table and a third abnormal data amount in the STD table; and the change amount of the abnormal data of the target field under the corresponding quality rule can be determined according to the difference between the second abnormal data amount and the third abnormal data amount.
For example, assuming that the third abnormal data amount of the target field in the STD table is the fixed, and the second abnormal data amount of the target field in the ODS table is the fixed, the change amount Edr of the abnormal data of the target field under the corresponding quality rule is:
Edr=otuaod-stuaod; (6)
by analyzing the amount of change in the exception data of the target field, it is possible to evaluate whether the cleansing rule or logic is correct, and it is possible to divide the following three cases:
1. when Edr <0, the third abnormal data volume of the target field in the STD table after data cleaning is larger than the second abnormal data volume of the target field in the ODS table before cleaning, which indicates cleaning rules or logic errors, and the data service personnel can be notified through the data quality system to timely adjust the cleaning rules or logic.
2. When edr=0, the third abnormal data amount of the target field in the STD table after data cleansing is equal to the second abnormal data amount of the target field in the ODS table before cleansing, which indicates that the cleansing rule or logic is invalid, and the data service personnel may be reminded of the cleansing rule or logic by the data quality system, which may need to be adjusted.
3. When Edr >0, the third abnormal data amount of the target field in the STD table after data cleansing is smaller than the second abnormal data amount of the target field in the ODS table before cleansing, which indicates that the cleansing rule or logic is valid, and the cleansing rule does play a role.
As an example, the data in the personnel basic information table of the ODS layer is as shown in table 10:
TABLE 10 personnel basic information Table of ODS layer data
The result data of the STD layer personnel basic information table is shown in table 3 after data cleaning, including operations of filtering the empty identification card number information, removing repeated identification card numbers (through the sequence of updating time), removing blank processing on gender, decoding a gender code dictionary table, normalizing updating time and the like. Through the quality check system, the relevant data indexes of the personnel basic information table before and after data cleaning can be counted as shown in table 11:
TABLE 11 personnel basic information Table related data indicators before and after data cleaning
/>
Step five: after data cleaning, calculating the data quality improvement rate of a single target field under a certain quality rule, wherein the data quality improvement rate of the single target field can be divided into the following two cases:
1) For any target field in the ODS table, after data cleaning, if a new added field with a blood relation (or mapping relation) with the target field exists in the cleaned STD table, the total data amount of the new added field in the STD table can be determined; according to the ratio of the change amount of the abnormal data of the target field under the corresponding quality rule to the total data amount of the newly added field in the STD table, the data quality improvement rate of the target field under the corresponding quality rule can be determined.
As an example, assuming that the change amount of the abnormal data of the target field having a mapping relationship with the newly added field under the corresponding quality rule is Edr and the total data amount of the newly added field in the STD table is staod, the data quality improvement rate Rr of the target field having a mapping relationship with the newly added field under the corresponding quality rule may be determined according to the following formula:
Rr=Edr/staod*100%; (7)
2) If the target field does not have a newly added field corresponding to the mapping of the target field in the ODS table after data cleaning, determining a fourth abnormal data volume of the target field in the ODS table; and determining the data quality improvement rate of the target field under the corresponding quality rule according to the ratio of the change amount of the target field under the corresponding quality rule to the fourth abnormal data amount.
As an example, if the target field does not have a newly added field corresponding to the mapping of the target field in the ODS table after data cleaning, assuming that the change amount of the abnormal data of the target field under the corresponding quality rule is Edr and the fourth abnormal data amount of the target field in the ODS table is otoad, the data quality improvement rate Rr of the target field under the corresponding quality rule may be determined according to the following formula:
Rr=Edr/otuaod*100%; (8)
from this, according to the formula (7) and the formula (8), the data quality improvement rate of each target field in the table 11 under the corresponding quality rule is calculated as shown in the table 12:
data quality improvement rate under quality rule corresponding to table 12
Step six: and calculating the field quality improvement rate of the single target field after data cleaning.
For any quality rule corresponding to a single target field, determining the sub-field quality improvement rate of the target field under the quality rule according to the product of the data quality improvement rate of the target field under the quality rule and the field rule weight of the quality rule in the ODS table; and the field quality improvement rate of the target field can be determined according to the sum of the quality improvement rates of the subfields.
As an example, assuming that the target field corresponds to m quality rules, the target field has a data quality improvement rate Rr under the corresponding ith quality rule in the ODS table i And the field rule weight of the ith quality rule corresponding to the target field in the ODS table is Frw i Wherein m is a positive integer greater than 0, i.e. [1, m]And i is a positive integer; for the ith quality rule corresponding to the target field, the data quality improvement rate Rr of the target field under the ith quality rule can be determined i And field rule weights Frw for the ith quality rule in the original data table i Determining the subfield quality improvement rate of the target field under the ith quality rule as Rr i *Frw i *100%, so that the sum of the quality improvement rates of the subfields can be used,i.e. the field quality improvement rate Fr of the target field is determined according to equation (4).
Thus, based on tables 12 and 8, the field quality improvement rate of each target field of the ODS layer personnel basic information table can be calculated using the formula (4), as shown in table 13:
table 13 field quality improvement ratio of each target field of the personnel basic information table of ODS layer
Step six: and calculating the quality improvement rate of the data table.
And determining the quality improvement rate of the data table according to the table field weight and the field quality improvement rate of each target field in the ODS table.
As an example, assume that n target fields exist in the ODS table, the table field weight of the jth target field is Fw j The field quality improvement rate of the jth target field is Fr j The data sheet quality improvement rate Tr may be determined according to equation (2), where n is a positive integer greater than 0, j ε [1, n]And j is a positive integer.
Thus, based on tables 9 and 13, the data table quality improvement rate Q of the ODS layer personnel basic information table of table 10 after data management or data cleaning can be calculated by using the formula (9) as follows:
Q=100%*40%+50%*30%+66.7%*20%+100%*10%=78.34%; (9)
in summary, based on the data cleaning flow of the data warehouse, on the premise of improving the data quality, the method for detecting the effect of data cleaning is provided, the condition of improving the data quality after data cleaning can be evaluated, and meanwhile, a strategy for judging whether the cleaning rule and logic of the data cleaning configuration are correct can be provided.
According to the data cleaning effect detection method, on the premise of data cleaning, the data cleaning system, the data blood edge system and the data quality system can be utilized, the condition of data quality improvement after data cleaning is evaluated by utilizing the quality rule corresponding to the data according to the table field weight and the field rule weight provided by related staff, the data cleaning effect can be verified, and the problems existing in the data cleaning process can be actively found in time.
Corresponding to the data cleaning effect detection methods provided in the above embodiments, an embodiment of the present application further provides a data cleaning effect detection device. Since the data cleaning effect detection device provided in the embodiment of the present application corresponds to the data cleaning effect detection method provided in the above several embodiments, implementation of the data cleaning effect detection method is also applicable to the data cleaning effect detection device provided in the embodiment, and will not be described in detail in the embodiment.
Fig. 9 is a schematic structural diagram of a data cleaning effect detection device according to a sixth embodiment of the present application.
As shown in fig. 9, the data cleaning effect detection apparatus 900 may include: a first acquisition module 910, a second acquisition module 920, a comparison module 930, and a first determination module 940.
Wherein, the first obtaining module 910 is configured to obtain a target data table and an original data table; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table.
The second obtaining module 920 obtains quality rules corresponding to each target field.
And the comparison module 930 is configured to compare the data amounts of the abnormal data in the target data table and the original data table, respectively, for at least one target field, so as to obtain the variation of the abnormal data of each target field under the corresponding quality rule.
The first determining module 940 is configured to determine a quality improvement rate of the data table based on a variation of the abnormal data of each target field under the corresponding quality rule.
In one possible implementation manner of the embodiment of the present application, the comparing module 930 is configured to: for any target field, under the condition that a corresponding newly added field exists in a target data table in the target field, determining the variation of abnormal data of the target field under a corresponding quality rule according to the total data amount of the newly added field in the target data table and the first abnormal data amount which does not accord with the corresponding quality rule; and under the condition that the target field does not have a corresponding newly added field in the target data table, determining the change amount of the abnormal data of the target field under the corresponding quality rule according to the second abnormal data amount of the target field in the original data table and the third abnormal data amount in the target data table.
In one possible implementation manner of the embodiment of the present application, a first determining module 940 is configured to: for any target field, determining the field quality improvement rate of the target field according to the variation of the abnormal data of the target field under the corresponding quality rule; and determining the quality improvement rate of the data table according to the first weight of each target field in the original data table and the field quality improvement rate.
In one possible implementation manner of the embodiment of the present application, a first determining module 940 is configured to: according to the variable quantity of the target field under the corresponding quality rule, determining the data quality improvement rate of the target field under the corresponding quality rule; and determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under the corresponding quality rule and the field rule weight of the corresponding quality rule in the original data table.
In one possible implementation manner of the embodiment of the present application, a first determining module 940 is configured to: under the condition that a corresponding newly added field exists in the target data table in the target field, determining the data quality improvement rate of the target field under the corresponding quality rule according to the change amount of the target field under the corresponding quality rule and the ratio of the total data amount of the newly added field in the target data table; and under the condition that the target field does not have a corresponding newly added field in the target data table, determining the data quality improvement rate of the target field under the corresponding quality rule according to the change amount of the target field under the corresponding quality rule and the duty ratio of the fourth abnormal data amount of the target field in the original data table.
In one possible implementation manner of the embodiment of the present application, a first determining module 940 is configured to: under the condition that the quality rule corresponding to the target field is one, determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under the corresponding quality rule and the field rule weight of the corresponding quality rule in the original data table; and under the condition that the number of the quality rules corresponding to the target field is multiple, determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under each corresponding quality rule and the field rule weight of each corresponding quality rule in the original data table.
In one possible implementation manner of the embodiment of the present application, the field rule weight of the quality rule corresponding to the target field in the original data table is obtained through the following modules:
and the third acquisition module is used for acquiring a rule weight value of the quality rule corresponding to the target field in the original data table aiming at any target field.
And the second determining module is used for determining the field rule weight of the quality rule corresponding to the target field in the original data table according to the rule weight value of the quality rule corresponding to the target field in the original data table.
In a possible implementation manner of the embodiment of the present application, the second determining module is configured to: under the condition that the quality rule corresponding to the target field is one, determining the field rule weight of the quality rule corresponding to the target field as a set weight value; determining a first coefficient based on the sum of rule weight values of all quality rules corresponding to the target field in the original data table under the condition that the plurality of quality rules corresponding to the target field are provided; and determining the field rule weight of the quality rule corresponding to the target field according to the duty ratio of the rule weight value of the quality rule and the first coefficient aiming at any quality rule corresponding to the target field.
In a possible implementation manner of the embodiment of the present application, the effect detection apparatus 900 for data cleaning may further include:
and the third determining module is used for determining whether cleaning abnormality occurs according to the change amount of the abnormal data of the target field under the corresponding quality rule aiming at any target field.
The processing module is used for responding to the occurrence of cleaning abnormality, generating and sending prompt information according to the variation of the abnormal data of the target field under the corresponding quality rule; the prompt information is used for prompting that the cleaning rule is invalid or the cleaning rule is wrong.
According to the data cleaning effect detection device, the target data table and the original data table are obtained; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table; acquiring quality rules corresponding to each target field; comparing the data quantity of the abnormal data of at least one target field in the target data table and the original data table respectively to obtain the variation quantity of the abnormal data of each target field under the corresponding quality rule; and determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule. Therefore, the change amount of abnormal data of the data table is determined by comparing the original data table before data cleaning with the target data table after data cleaning, and the quality improvement rate of the data table can be effectively determined, so that the effect of automatically evaluating the data cleaning based on the quality improvement rate of the data table can be realized.
In order to implement the foregoing embodiments, the present application further provides an electronic device, and fig. 10 is a schematic structural diagram of the electronic device provided in the seventh embodiment of the present application. The electronic device includes:
memory 1001, processor 1002, and a computer program stored on memory 1001 and executable on processor 1002.
The processor 1002 implements the effect detection method of data cleansing provided in the above-described embodiment when executing the program.
Further, the electronic device further includes:
a communication interface 1003 for communication between the memory 1001 and the processor 1002.
Memory 1001 for storing computer programs that may be run on processor 1002.
Memory 1001 may include high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor 1002 is configured to implement the effect detection method for cleaning data according to the foregoing embodiment when executing the program.
If the memory 1001, the processor 1002, and the communication interface 1003 are implemented independently, the communication interface 1003, the memory 1001, and the processor 1002 may be connected to each other through a bus and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (Peripheral Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 1001, the processor 1002, and the communication interface 1003 are integrated on a chip, the memory 1001, the processor 1002, and the communication interface 1003 may complete communication with each other through internal interfaces.
The processor 1002 may be a central processing unit (Central Processing Unit, abbreviated as CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more integrated circuits configured to implement embodiments of the present application.
In order to achieve the above-described embodiments, the embodiments of the present application also propose a non-transitory computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the effect detection method of data cleansing as provided in the above-described embodiments.
In order to achieve the above embodiments, the embodiments of the present application further provide a computer program product, which when executed by an instruction processor in the computer program product, implements the effect detection method for data cleansing provided in the above embodiments.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (13)

1. A method for detecting the effect of data cleaning, the method comprising:
acquiring a target data table and an original data table; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table;
acquiring quality rules corresponding to the target fields;
comparing the data quantity of the abnormal data of the at least one target field in the target data table and the original data table respectively to obtain the variation quantity of the abnormal data of each target field under the corresponding quality rule;
and determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule.
2. The method according to claim 1, wherein comparing the data amounts of the abnormal data of the at least one target field in the target data table and the original data table, respectively, to obtain the variation amount of the abnormal data of each target field under the corresponding quality rule, comprises:
For any one of the target fields, under the condition that a corresponding newly added field exists in the target data table in the target field, determining the change amount of abnormal data of the target field under a corresponding quality rule according to the total data amount of the newly added field in the target data table and the first abnormal data amount which does not accord with the corresponding quality rule;
and under the condition that the target field does not have a corresponding newly added field in the target data table, determining the change amount of the abnormal data of the target field under a corresponding quality rule according to the second abnormal data amount of the target field in the original data table and the third abnormal data amount in the target data table.
3. The method of claim 1, wherein determining the data table quality improvement rate based on the amount of change in the anomaly data for each of the target fields under the corresponding quality rules comprises:
for any target field, determining the field quality improvement rate of the target field according to the change amount of the abnormal data of the target field under the corresponding quality rule;
and determining the quality improvement rate of the data table according to the first weight of each target field in the original data table and the field quality improvement rate.
4. A method according to claim 3, wherein said determining a field quality improvement rate of the target field according to the amount of change of the abnormal data of the target field under the corresponding quality rule comprises:
according to the variable quantity of the target field under the corresponding quality rule, determining the data quality improvement rate of the target field under the corresponding quality rule;
and determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under the corresponding quality rule and the field rule weight of the corresponding quality rule in the original data table.
5. The method of claim 4, wherein said determining a data quality improvement rate for said target field under a corresponding quality rule based on said amount of change of said target field under a corresponding quality rule comprises:
under the condition that the target field has a corresponding newly added field in the target data table, determining the data quality improvement rate of the target field under a corresponding quality rule according to the change amount of the target field under the corresponding quality rule and the ratio of the total data amount of the newly added field in the target data table;
And under the condition that the target field does not have a corresponding newly added field in the target data table, determining the data quality improvement rate of the target field under the corresponding quality rule according to the ratio of the variable quantity of the target field under the corresponding quality rule to the fourth abnormal data quantity of the target field in the original data table.
6. The method of claim 4, wherein determining the field quality enhancement rate for the target field based on the data quality enhancement rate for the target field under the corresponding quality rule and the field rule weight for the corresponding quality rule in the raw data table comprises:
under the condition that the quality rule corresponding to the target field is one, determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under the corresponding quality rule and the quality rule weight of the corresponding quality rule in the original data table;
and under the condition that the quality rules corresponding to the target field are multiple, determining the field quality improvement rate of the target field according to the data quality improvement rate of the target field under each corresponding quality rule and the field rule weight of each corresponding quality rule in the original data table.
7. The method of claim 4, wherein the field rule weights of the quality rules corresponding to the target fields in the original data table are obtained by:
for any target field, acquiring a rule weight value of a quality rule corresponding to the target field in the original data table;
and determining the field rule weight of the quality rule corresponding to the target field in the original data table according to the rule weight value of the quality rule corresponding to the target field in the original data table.
8. The method according to claim 7, wherein the determining the field rule weight of the quality rule corresponding to the target field in the original data table according to the rule weight value of the quality rule corresponding to the target field in the original data table includes:
under the condition that the quality rule corresponding to the target field is one, determining the field rule weight of the quality rule corresponding to the target field as a set weight value;
determining a first coefficient based on the sum of rule weight values of all the quality rules corresponding to the target field in the original data table under the condition that the quality rules corresponding to the target field are multiple;
And determining the field rule weight of the quality rule corresponding to the target field according to the duty ratio of the rule weight value of the quality rule and the first coefficient aiming at any quality rule corresponding to the target field.
9. The method according to any one of claims 1-8, further comprising:
for any target field, determining whether cleaning abnormality occurs according to the variation of the abnormal data of the target field under the corresponding quality rule;
responding to the occurrence of cleaning abnormality, and generating and sending prompt information according to the change quantity of the abnormal data of the target field under the corresponding quality rule; the prompt message is used for prompting that the cleaning rule is invalid or the cleaning rule is wrong.
10. An effect detection device for data cleaning, the device comprising:
the first acquisition module is used for acquiring a target data table and an original data table; the target data table is obtained by cleaning data corresponding to at least one target field in the original data table;
the second acquisition module is used for acquiring the quality rule corresponding to each target field;
The comparison module is used for comparing the data quantity of the abnormal data of the at least one target field in the target data table and the original data table respectively so as to obtain the variation quantity of the abnormal data of each target field under the corresponding quality rule;
the first determining module is used for determining the quality improvement rate of the data table based on the variation of the abnormal data of each target field under the corresponding quality rule.
11. An electronic device, comprising:
a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the effect detection method of data cleansing according to any one of claims 1-9 when executing the program.
12. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the effect detection method of data cleansing according to any one of claims 1 to 9.
13. A computer program product comprising a computer program which, when executed by a processor of an electronic device, enables the electronic device to perform the effect detection method of data cleaning as claimed in any one of claims 1-9.
CN202310451991.4A 2023-04-23 2023-04-23 Method, device, equipment and medium for detecting effect of data cleaning Pending CN116521662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310451991.4A CN116521662A (en) 2023-04-23 2023-04-23 Method, device, equipment and medium for detecting effect of data cleaning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310451991.4A CN116521662A (en) 2023-04-23 2023-04-23 Method, device, equipment and medium for detecting effect of data cleaning

Publications (1)

Publication Number Publication Date
CN116521662A true CN116521662A (en) 2023-08-01

Family

ID=87404093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310451991.4A Pending CN116521662A (en) 2023-04-23 2023-04-23 Method, device, equipment and medium for detecting effect of data cleaning

Country Status (1)

Country Link
CN (1) CN116521662A (en)

Similar Documents

Publication Publication Date Title
WO2021184727A1 (en) Data abnormality detection method and apparatus, electronic device and storage medium
US20200034749A1 (en) Training corpus refinement and incremental updating
US10789225B2 (en) Column weight calculation for data deduplication
TW202029079A (en) Method and device for identifying irregular group
CN111242793B (en) Medical insurance data abnormality detection method and device
CN112885481A (en) Case grouping method, case grouping device, electronic equipment and storage medium
CN109299085A (en) A kind of data processing method, electronic equipment and storage medium
CN113468034A (en) Data quality evaluation method and device, storage medium and electronic equipment
CN114840531B (en) Data model reconstruction method, device, equipment and medium based on blood edge relation
CN111931047A (en) Artificial intelligence-based black product account detection method and related device
CN116126843A (en) Data quality evaluation method and device, electronic equipment and storage medium
de Mast et al. Modeling and evaluating repeatability and reproducibility of ordinal classifications
CN112949697A (en) Method and device for confirming pipeline abnormity and computer readable storage medium
KR102218374B1 (en) Method and Apparatus for Measuring Quality of De-identified Data for Unstructured Transaction
CN111680083A (en) Intelligent multi-stage government financial data acquisition system and data acquisition method
CN116521662A (en) Method, device, equipment and medium for detecting effect of data cleaning
CN106155866A (en) A kind of method and device of monitoring CPU core frequency
CN112395179B (en) Model training method, disk prediction method, device and electronic equipment
CN105824871B (en) A kind of picture detection method and equipment
CN115034580A (en) Quality evaluation method and device for fusion data set
US11568153B2 (en) Narrative evaluator
CN114840767A (en) Service recommendation method based on artificial intelligence and related equipment
CN109710651B (en) Data type identification method and device
CN114155578A (en) Portrait clustering method, device, electronic equipment and storage medium
Talburt et al. Evaluating and improving data fusion accuracy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination