CN112380204B - Data quality evaluation method and device - Google Patents

Data quality evaluation method and device Download PDF

Info

Publication number
CN112380204B
CN112380204B CN202011281552.6A CN202011281552A CN112380204B CN 112380204 B CN112380204 B CN 112380204B CN 202011281552 A CN202011281552 A CN 202011281552A CN 112380204 B CN112380204 B CN 112380204B
Authority
CN
China
Prior art keywords
data
task
data quality
quality evaluation
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011281552.6A
Other languages
Chinese (zh)
Other versions
CN112380204A (en
Inventor
项颂
何林强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202011281552.6A priority Critical patent/CN112380204B/en
Publication of CN112380204A publication Critical patent/CN112380204A/en
Application granted granted Critical
Publication of CN112380204B publication Critical patent/CN112380204B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Abstract

The invention discloses a method and a device for evaluating data quality, which can solve the problem of unreasonable evaluation caused by adopting a uniform evaluation standard to evaluate the quality of data required to be processed by different tasks in the prior art, thereby improving the flexibility and the rationality of data quality evaluation. The data quality evaluation method comprises the following steps: receiving constraint condition information input aiming at each field value in a plurality of data rows or each data row of the collected data, wherein the constraint condition information corresponds to the requirement of a first task for processing the collected data and is used for generating a data quality evaluation task; and receiving an input operation instruction for setting a relative execution sequence of the data quality evaluation task and the first task, and executing the data quality evaluation task and the first task based on the determined relative execution sequence, so that the data quality evaluation task performs targeted quality evaluation on the data processed by the first task.

Description

Data quality evaluation method and device
Technical Field
The invention relates to the technical field of big data, in particular to a method and a device for evaluating data quality.
Background
With the continuous development of big data technology, the concept of "data is assets" has gained wide acceptance and the importance of data is also continuously improved. However, not all data can be "assets", and whether data is really valuable is germane to data quality. Effective evaluation of data quality is not only a basic data processing work, but also a basis of links such as data cleaning and data mining at the downstream of a data link, and is a necessary premise for a user to develop upper-layer application, mine data value and make a correct decision.
At present, data are collected according to different application scenarios, but the collected data are generally subjected to quality evaluation by adopting a uniform evaluation standard, and the requirement on the data quality under different application scenarios is not considered to be different. Meanwhile, after the data is subjected to quality evaluation, the data evaluated as being unqualified is not used or deleted, resulting in a waste of partial data resources.
It can be seen that, in the prior art, the data under different application scenes are evaluated by using a uniform data quality evaluation standard, which cannot meet the requirements of different scenes on the data quality; meanwhile, once the data quality evaluation fails, the data cannot be used, and the data resources are wasted.
Disclosure of Invention
The invention provides a method and a device for evaluating data quality, which can solve the problem of unreasonable evaluation caused by adopting a uniform evaluation standard to evaluate the quality of data required to be processed by different tasks in the prior art, thereby improving the flexibility and the rationality of data quality evaluation.
In a first aspect, an embodiment of the present invention provides a data quality assessment method, which is applied to perform data quality assessment on collected data processed by a first task, where the collected data includes a plurality of data rows, each of the plurality of data rows includes a plurality of field values, and when the first task is scheduled to be executed, the method includes:
receiving constraint information input for respective field values in a plurality or each of data lines of the acquired data, the constraint information corresponding to requirements of the first task for processing the acquired data and being used to generate a data quality assessment task;
and receiving an input operation instruction for setting a relative execution sequence of the data quality evaluation task and the first task, and executing the data quality evaluation task and the first task based on the determined relative execution sequence so that the data quality evaluation task performs targeted quality evaluation on the data processed by the first task, wherein the relative execution sequence between different first tasks and corresponding data quality evaluation tasks is different.
In an embodiment of the present invention, the collected data may be regarded as including a plurality of data lines, each of the data lines includes a plurality of field values, and the first task may be regarded as a task that needs to process the collected data. When the first task is scheduled to be executed, a constraint condition of the collected data can be set according to the actual requirement of the first task, for example, the data meeting the constraint condition can be used for the first task, and the data not meeting the constraint condition cannot be used for the first task; or, the first task is successfully executed if the constraint condition is satisfied, and the first task is failed if the constraint condition is not satisfied. Meanwhile, the relative execution order of the data quality evaluation task and the first task, which is formed by the above constraint conditions, may also be set, for example, the data quality task is executed before the first task or the first task is executed before the data quality task, which is determined according to the type of the first task, so that the data quality evaluation task can perform quality evaluation on the data required to be processed by the first task in a targeted manner. According to the method, the corresponding data quality evaluation task is generated according to the actual requirement of the first task for processing the data, and the relative execution sequence of the first task and the data quality evaluation task can be set according to the specific type of the first task, so that the problem of unreasonable evaluation caused by the fact that the quality of the data required to be processed by different tasks is evaluated by adopting a uniform evaluation standard in the prior art is solved, and the flexibility and the rationality of data quality evaluation are improved.
Optionally, the constraint condition information at least includes a rule type, a rule strength and a check type, the rule type is used to indicate that the data rows are evaluated as a whole or each field value in each data row is evaluated, the rule strength is used to indicate whether the data quality evaluation task and the first task are independent from each other, and the check type is used to indicate a reference value type, a reference value size and a comparison relationship adopted when the data rows are evaluated as a whole or each field value in each data row is evaluated.
In the embodiment of the invention, the constraint conditions for forming the data quality evaluation task at least comprise a rule type, a rule strength and a verification type, and a user can set the constraint conditions according to the actual requirement of the first task on the data to be processed, so that the process of evaluating the data quality of the acquired data is more flexible and reasonable.
Optionally, if the first task is a data development type task based on the collected data, the operation instruction is used to set the data quality evaluation task to be executed before the first task.
In the embodiment of the present invention, if the first task is a data development type task, that is, the first task needs to depend on the acquired data, the data quality of the acquired data directly affects the execution effect of the first task. Therefore, when the relative execution order of the data quality evaluation task and the first task is set, it is more reasonable to cause the data quality evaluation task to be executed prior to the first task.
Optionally, if the rule type in the constraint condition information indicates to evaluate each field value in each data line, and the rule strength information indicates that the data quality evaluation task and the first task are in a non-independent relationship, executing the data quality evaluation task and the first task based on the determined relative execution order includes:
evaluating each field value in each data line based on the reference value type, the size of the reference value and the comparison relation contained in the check type, and determining an evaluation result of each data line, wherein the evaluation result is used for representing whether each field value in each data line meets a first constraint condition formed by the reference value type, the size of the reference value and the comparison relation;
and if the evaluation result is that the number of the partial data rows which all meet the first constraint condition is larger than or equal to a first preset threshold value, executing the first task based on the partial data rows.
In the embodiment of the present invention, when the first task is a data development task, the data quality evaluation task corresponding to the first task may focus on evaluating each field value of each of a plurality of data lines included in the acquired data during data quality evaluation, and only when each field value in each data line satisfies a first constraint condition formed by a reference value type, a reference value size, and a comparison relationship, the data line may be considered to meet a quality requirement for executing the first task. And due to the non-independent relationship between the data quality assessment task and the first task, i.e. whether the result of the data quality assessment task is associated with the first task being performed or not. The first task may only be executed if the number of data rows that meet the quality requirement for executing the first task reaches a preset threshold. In the method, the collected data are screened based on the data quality evaluation task, and the first task is executed only by using the data meeting the quality requirement of the first task, so that the execution effect of the first task can be improved.
Optionally, if the rule strength information indicates that the data quality assessment task and the first task are in an independent relationship, the method further includes:
and if the evaluation result is determined that the number of the data rows which all meet the first constraint condition is smaller than the first preset threshold, executing the first task based on the acquired data.
In the embodiment of the invention, the data quality evaluation task and the first task are in an independent relationship, namely, the result of the data quality evaluation task is irrelevant to whether the first task is executed or not. Thus, the first task may be performed based on the collected data even if the number of data rows that meet the performance first task quality requirement does not reach the preset threshold. The method aims to know the data quality condition of the collected data through the data quality evaluation task without influencing the normal execution of the first task.
Optionally, the method further includes:
storing at least one input repair instruction corresponding to at least one data quality problem which exists in the acquired data and does not meet the first constraint condition, and establishing a corresponding relation between the at least one data quality problem and the at least one repair instruction;
and if the evaluation result is that a first data quality problem corresponding to a first data line partially meeting the first constraint condition exists in the at least one data quality problem, determining a first repair instruction corresponding to the first data quality problem based on the corresponding relation between the at least one data quality problem and the at least one repair instruction, and calling the first repair instruction to repair the first data quality problem.
In the embodiment of the present invention, a correspondence between a possible data quality problem and a corresponding repair instruction may be pre-established, if a certain field value in a first data line does not satisfy a first constraint condition formed by a reference value type, a reference value size, and a comparison relationship, it may be considered that a data quality problem exists in the first data line, and if the data quality problem exists in the pre-established data quality problem, a repair instruction corresponding to the data quality problem may be queried based on the correspondence between the data quality problem and the repair instruction, and the current data quality problem may be repaired based on the repair instruction. The repaired first data line can be used for executing the first task, so that the problem of data resource waste is avoided.
Optionally, if the first task is a data import task based on the collected data, the operation instruction is used to set the first task to be executed before the data quality evaluation task.
In the embodiment of the present invention, if the first task is a data import type task, for example, the collected data is imported into a database in the server for storage, the data quality of the collected data may be considered to be unrelated to the execution process of the first task. Therefore, when the relative execution sequence of the data quality evaluation task and the first task is set, the collected data is firstly imported into the server, and then the quality evaluation of the collected data is carried out more reasonably based on the server. I.e. it is more reasonable that the first task is performed prior to the data quality assessment task.
Optionally, if the rule type in the constraint condition information indicates that the multiple data rows are evaluated as a whole, sequentially executing the data quality evaluation task and the first task based on the determined execution order includes:
executing the first task based on the acquired data, wherein the first task is used for importing a plurality of data lines included in the acquired data into a preset storage position and numbering the data lines;
after the first task is determined to be executed completely, evaluating the multiple data line integrals based on the reference value type, the size of the reference value and the contrast relation contained in the verification type, and determining the evaluation results of the multiple data line integrals, wherein the evaluation results are used for representing whether the multiple data line integrals meet a second constraint condition formed by the reference value type, the size of the reference value and the contrast relation;
and if the evaluation result is determined to be that the second constraint condition is not met, outputting prompt information, wherein the prompt information is used for reminding a user that the quantity of the acquired data processed by the first task is insufficient or exceeds the standard.
In the embodiment of the present invention, when the first task is a data import task, the data quality evaluation task corresponding to the first task may focus on evaluating the entirety of the plurality of data lines included in the collected data during data quality evaluation. After the first task is completed, the whole of the plurality of data lines included in the imported acquired data can be evaluated by a second constraint condition formed by the type of the reference value, the size of the reference value and the comparison relation, and once the evaluation result does not meet the second constraint condition, corresponding prompt information can be output to a user, for example, the imported data volume is insufficient or the imported data volume exceeds the standard, so that the user can conveniently perform targeted processing.
In a second aspect, an embodiment of the present invention provides an apparatus for evaluating data quality, which is applied to perform data quality evaluation on collected data processed by a first task, where the collected data includes a plurality of data lines, and each data line in the plurality of data lines includes a plurality of field values, the apparatus including:
a receiving unit, configured to receive constraint information for each field value in a plurality of or each data row of the collected data, the constraint information corresponding to a requirement of the first task for processing the collected data and being used to generate a data quality evaluation task, when the first task is scheduled to be executed;
and the execution unit is used for receiving an input operation instruction for setting the relative execution sequence of the data quality evaluation task and the first task, and executing the data quality evaluation task and the first task based on the determined relative execution sequence so that the data quality evaluation task performs targeted quality evaluation on the data processed by the first task, wherein the relative execution sequence between different first tasks and corresponding data quality evaluation tasks is different.
Optionally, the constraint condition information at least includes a rule type, a rule strength, and a check type, where the rule type is used to indicate that the data rows are evaluated as a whole or each field value in each data row is evaluated, the rule strength is used to indicate whether the data quality evaluation task and the first task are independent from each other, and the check type is used to indicate a reference value type, a reference value size, and a comparison relationship that are adopted when the data rows are evaluated as a whole or each field value in each data row is evaluated.
Optionally, if the first task is a data development type task based on the collected data, the operation instruction is used to set the data quality evaluation task to be executed before the first task.
Optionally, if the rule type in the constraint condition information indicates to evaluate each field value in each data line, and the rule strength information indicates that the data quality evaluation task and the first task are in a non-independent relationship, the execution unit is specifically configured to:
evaluating each field value in each data line based on the reference value type, the size of the reference value and the comparison relation contained in the check type, and determining an evaluation result of each data line, wherein the evaluation result is used for representing whether each field value in each data line meets a first constraint condition formed by the reference value type, the size of the reference value and the comparison relation;
and if the evaluation result is that the number of the partial data rows which all meet the first constraint condition is larger than or equal to a first preset threshold value, executing the first task based on the partial data rows.
Optionally, if the rule strength information indicates that the data quality assessment task and the first task are in an independent relationship, the execution unit is further configured to:
and if the evaluation result is determined that the number of the data rows which all meet the first constraint condition is smaller than the first preset threshold, executing the first task based on the acquired data.
Optionally, the apparatus further comprises:
the processing unit is used for storing at least one input repair instruction corresponding to at least one data quality problem which exists in the acquired data and does not meet the first constraint condition, and establishing a corresponding relation between the at least one data quality problem and the at least one repair instruction;
and the repairing unit is used for determining a first repairing instruction corresponding to the first data quality problem based on the corresponding relation between the at least one data quality problem and the at least one repairing instruction and calling the first repairing instruction to repair the first data quality problem when the first data quality problem corresponding to the first data line partially meeting the first constraint condition is determined to exist in the at least one data quality problem.
Optionally, if the first task is a data import type task based on the collected data, the operation instruction is used to set the first task to be executed before the data quality evaluation task.
Optionally, if the rule type in the constraint condition information indicates that the multiple data rows are evaluated as a whole, the execution unit is specifically configured to:
executing the first task based on the acquired data, wherein the first task is used for importing a plurality of data lines included in the acquired data into a preset storage position and numbering the data lines;
after the first task is determined to be executed completely, evaluating the multiple data line integrals based on the reference value type, the size of the reference value and the contrast relation contained in the verification type, and determining the evaluation results of the multiple data line integrals, wherein the evaluation results are used for representing whether the multiple data line integrals meet a second constraint condition formed by the reference value type, the size of the reference value and the contrast relation;
and if the evaluation result is determined to be that the second constraint condition is not met, outputting prompt information, wherein the prompt information is used for reminding a user that the quantity of the acquired data processed by the first task is insufficient or exceeds the standard.
In a third aspect, an embodiment of the present invention provides an apparatus for evaluating data quality, where the apparatus includes a processor and a memory, and the processor is configured to execute a computer program stored in the memory to implement the steps of the method according to the embodiment of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method as described in the embodiment of the first aspect.
Drawings
Fig. 1 is a schematic flowchart of a method for evaluating data quality according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an apparatus for evaluating data quality according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an apparatus for evaluating data quality according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments.
In the prior art, data collected under different application scenarios are generally evaluated by using a uniform data quality evaluation standard, and differences of the data quality requirements of the different application scenarios are not considered, so that results of data quality evaluation may be unreasonable.
In view of this, the present invention provides a data quality evaluation method, in which a corresponding data quality evaluation task can be generated according to an actual requirement for processing data required by a first task that is currently scheduled to be executed, and a relative execution order of the first task and the data quality evaluation task can be set according to a specific type of the first task, so that the data quality evaluation task can perform quality evaluation on data that the first task needs to process, thereby improving flexibility and rationality of data quality evaluation.
The technical scheme provided by the implementation of the invention is described below with reference to the accompanying drawings. Referring to fig. 1, an embodiment of the present invention provides a method for evaluating data quality, where the flow of the method is described as follows:
step 101: constraint information for respective field values in a plurality or each of the data rows of the collected data is received, the constraint information corresponding to a requirement of the first task to process the collected data and being used to generate a data quality assessment task.
In an embodiment of the present invention, the collected data may be considered to include a plurality of data lines, and each data line includes a plurality of field values. Each data line may be considered a data sample, and the field values in each data line may be considered to correspond to respective attribute information, such as name, age, gender, identification number, telephone number, and the like. The first task may be considered as a task for processing the collected data, and when the first task is scheduled to be executed, it is required to make the entirety of the plurality of data lines included in the collected data or the respective field values in each data line meet the requirement for executing the first task. Therefore, when it is determined that the task scheduled to be executed is the first task, the data quality evaluation task may be generated according to the actual demand of the first task for the processed data.
As one possible implementation, the system executing the first task may receive constraint information input for respective field values in a plurality of data lines or each data line of the collected data, the constraint information corresponding to a requirement of the first task to process the collected data and being used to generate the data quality assessment task. For example, the constraints for constructing the data quality evaluation task may include rule type, rule strength, and check type, etc. The specific meanings of the rule type, the rule strength, and the check type are described below.
The rule type is used to indicate whether the actual requirements of the first task on the collected data are for the entirety of the plurality of data rows or for the respective field values in each data row. For example, if the first task requires that the quantity of the collected data reach a preset threshold, for example, 1000 data samples, the rule type should be set to be specific to the whole of the plurality of data rows; if the first task needs to process data samples aged in the [20,30] interval, the rule type should be set to the respective field value for each data line.
The rule strength is used to indicate whether the data quality assessment task is independent from the first task, i.e. whether the result of the data quality assessment task is relevant for performing the first task. For example, if the requirement of executing the first task on the data quality of the collected data is high, the rule type may be set as that the data quality evaluation task is not independent from the first task, and once the result of the data quality evaluation task shows that the data quality of the collected data does not meet the requirement of the first task, the first task is not executed; if the data quality requirement of the acquired data is low when the first task is executed, the rule type can be set to be independent from the data quality evaluation task and the first task can be executed even if the data quality evaluation task result shows that the data quality of the acquired data does not meet the requirement of the first task.
The check type is used to indicate a reference value type, a reference value size, and a comparison relationship when evaluating the entirety of the plurality of data lines or evaluating each field value in each data line, the reference value type may be an interval value (e.g., [20,30]) or a fixed value (e.g., 30), and the comparison relationship may be greater than, greater than or equal to, less than, and less than or equal to. For example, if the first task needs to process a data sample aged in the [20,30] section, the reference value type may be set as the section value, the reference value may be set to 20 and 30, and the comparison relationship may be set to 20 or more and 30 or less. For another example, if the first task needs to process a data sample aged 30 years, the reference value type may be set to a fixed value, the reference value may be set to 30, and the comparison relationship may be set equal.
It should be understood that the constraints for constituting the data quality evaluation task are not limited to those exemplified above, and the user may add corresponding constraints according to the actual requirements when the first task processes the collected data, and is not particularly limited herein.
Step 102: the method comprises the steps of receiving an input operation instruction used for setting a relative execution sequence of a data quality evaluation task and a first task, and executing the data quality evaluation task and the first task based on the determined relative execution sequence, so that the data quality evaluation task performs targeted quality evaluation on data processed by the first task, wherein the relative execution sequence between different first tasks and corresponding data quality evaluation tasks is different.
In the embodiment of the present invention, after the data quality assessment task is generated according to the actual requirement of the first task, the relative execution order of the data quality assessment task and the first task may be further set according to the specific type of the first task, so as to ensure the reasonability of the execution order between the data quality assessment task and the first task.
As a possible implementation, the system for executing the first task may receive an input of an operation instruction for setting the relative execution order of the data quality evaluation task and the first task. Specifically, if the first task is a data development type task based on collected data, for example, for a training task for training a model for predicting heights of different ages by using age and height data of students in a certain area, the data quality of the collected data directly affects the final effect of the training task, and at this time, the operation command should set the data quality evaluation task to be executed before the first task. If the first task is a data import task, for example, for an import task of importing collected data into a database in a server for storage, the user does not care about the specific number of data samples growing in the database before the import task is finished, but often needs to determine the specific number of data samples after the import task is finished, and at this time, the operation instruction should set the first task to be executed before the data quality evaluation task.
The specific process of executing the data quality evaluation task and the first task based on the determined relative execution order when the first task is a data development type task and a data import type task is described below.
In the first case: the first task is a data development type task, and the data quality evaluation task is executed before the first task.
When the first task is a data development type task, then when a data quality evaluation task corresponding to the first task is performing data quality evaluation, the rule type may be set to evaluate each field value in a plurality of data lines.
If the user has a higher requirement on the execution effect of the first task, taking the example of training a training task of a preset model for predicting heights of different age groups by using the age and height data of students in a certain area, the higher requirement on the execution effect of the first task by the user can be regarded as higher requirement on the accuracy of a prediction model which the user wishes to finally form, and then the data quality evaluation task and the first task can be set to be in a non-independent relationship when the rule strength is set, so that the first task can be executed only when the number of data lines meeting the quality requirement of executing the first task reaches a preset threshold value, and the execution effect of the first task is improved.
Specifically, the system for executing the first task may evaluate the respective field values in each data line based on the reference value type included in the check type set in the constraint condition, the size of the reference value, and the comparison relationship, thereby determining the evaluation result of each data line. It should be understood that since each data row includes a plurality of field values, the current data row may be considered to be in compliance with the quality requirement for performing the first task only if the respective field values in each data row satisfy the first constraint consisting of the type of the reference value, the size of the reference value, and the comparison relationship. If the number of partial data rows in the acquired data, which all satisfy the first constraint condition, is greater than or equal to the first preset threshold, then the first task may be executed based on the partial data rows that satisfy the quality requirement.
If the user has a low requirement on the execution effect of the first task, continuing to use the training task of training a preset model for predicting heights of different age groups by using the age and height data of students in a certain area as an example, the low requirement on the execution effect of the first task by the user can be regarded as that the user has a low requirement on the accuracy of the finally formed prediction model, and only hopes to know the overall condition of the data quality in the training data in the process of executing the first task, then the data quality evaluation task and the first task can be set to be in an independent relationship when the rule strength is set, and at this time, even if the number of partial data lines which all meet the first constraint condition in the collected data does not reach the first preset threshold, the first task can still be executed based on the collected data.
In some embodiments, it is considered that in the prior art, once data is determined not to meet the requirements of the current task in the data quality evaluation task, the data cannot be used, and the data resources are wasted. Therefore, in the embodiment of the present invention, the data quality problem displayed by the evaluation result of the data quality evaluation task can be repaired, and the repaired data can be used for executing the first task, thereby avoiding waste of data resources.
As a possible implementation manner, the system for executing the first task may store at least one repair instruction corresponding to at least one data quality problem that does not satisfy the first constraint condition and exists in the collected data input by the user, so as to establish a correspondence relationship between the at least one data quality problem and the at least one repair instruction. Once it is determined in the data quality evaluation task that a part of field values in the fields in the first data row satisfy the first constraint condition, and the part of field values do not satisfy the first constraint condition, for example, the first field value does not satisfy the first constraint condition, the first field value can be considered to have a first data quality problem. If the first data quality problem exists in the at least one data quality, determining a first repair instruction corresponding to the current first data quality problem according to the corresponding relation between the at least one data quality problem and the at least one modification instruction, and calling the first repair instruction to repair the current first data quality problem.
For example, the first field value in the first data row should be a null value, but the data quality evaluation result indicates that the first field value is not a null value, that is, it may be determined that a first data quality problem exists in the first field value, and if the first data quality problem exists in the pre-established data quality problem, a first repair instruction corresponding to the first data quality problem may be determined based on a correspondence relationship between the data quality problem and the repair instruction, and the first repair instruction may be regarded as a delete instruction for deleting the existing first field value, so that the first field value is restored to a null value, that is, the first data row meets the requirement of the first task on data quality.
It should be appreciated that, because it is not possible to confirm which data quality problems may occur in the collected data prior to the data quality assessment task, it is believed that only a few more general repair instructions are stored in the current system. In order to continuously improve the data quality repair function, new data quality problems occurring after a data quality assessment task is executed can be collected, a corresponding repair mode is determined, and then the data quality problems are stored in the current system.
In the second case: the first task is a data import type task, and the first task is executed before the data quality evaluation task.
When the first task is a data import type task, then when the data quality evaluation task corresponding to the first task is performing data quality evaluation, the rule type may be set to evaluate the plurality of data lines as a whole. Specifically, it may be considered that the first task may be executed to import the acquired data to a preset storage location, for example, the preset storage location is a server, and meanwhile, the imported acquired data may be numbered. In order to determine whether the quantity of the collected data imported to the server meets the quality requirement, after the first task is executed, the whole of the plurality of data rows can be evaluated based on the reference value type contained in the check type set in the constraint condition, the size of the reference value and the comparison relation. When the evaluation result shows that the whole of the plurality of data lines does not satisfy the second constraint condition formed by the type of the reference value, the size of the reference value and the comparison relation, prompt information can be output to a user, and the message information shows that the number of the plurality of data lines included in the acquired data processed by the first task is insufficient or exceeds the number.
Referring to fig. 2, based on the same inventive concept, an embodiment of the present invention provides an apparatus for evaluating data quality, which is applied to perform data quality evaluation on collected data processed by a first task, where the collected data includes a plurality of data lines, and each data line in the plurality of data lines includes a plurality of field values, the apparatus including: a receiving unit 201 and an executing unit 202.
A receiving unit 201, configured to receive constraint condition information for each field value in a plurality of or each data line of the collected data, the constraint condition information corresponding to a requirement of the first task for processing the collected data and being used to generate a data quality evaluation task, when the first task is scheduled to be executed;
and the execution unit 202 is used for receiving an input operation instruction for setting a relative execution sequence of the data quality evaluation task and the first task, and executing the data quality evaluation task and the first task based on the determined relative execution sequence, so that the data quality evaluation task performs targeted quality evaluation on the data processed by the first task, wherein the relative execution sequence between different first tasks and corresponding data quality evaluation tasks is different.
Optionally, the constraint condition information at least includes a rule type, a rule strength, and a check type, where the rule type is used to indicate that the whole of the multiple data rows are evaluated or that each field value in each data row is evaluated, the rule strength is used to indicate whether the data quality evaluation task and the first task are independent of each other, and the check type is used to indicate a reference value type, a size of the reference value, and a comparison relationship that are adopted when the whole of the multiple data rows are evaluated or each field value in each data row is evaluated.
Optionally, if the first task is a data development type task based on the collected data, the operation instruction is used to set the data quality evaluation task to be executed before the first task.
Optionally, if the rule type in the constraint condition information indicates to evaluate each field value in each data line, and the rule strength information indicates that the data quality evaluation task and the first task are in a non-independent relationship, the execution unit 202 is specifically configured to:
evaluating each field value in each data line based on the reference value type, the size of the reference value and the comparison relation contained in the check type, and determining the evaluation result of each data line, wherein the evaluation result is used for representing whether each field value in each data line meets a first constraint condition formed by the reference value type, the size of the reference value and the comparison relation;
and if the evaluation result is determined that the number of the partial data rows which all meet the first constraint condition is larger than or equal to a first preset threshold, executing a first task based on the partial data rows.
Optionally, if the rule strength information indicates that the data quality assessment task is independent from the first task, the execution unit 202 is further specifically configured to:
and if the evaluation result is determined that the number of the data rows which all meet the first constraint condition is smaller than a first preset threshold value, executing a first task based on the collected data.
Optionally, the data quality evaluation apparatus further includes:
the processing unit is used for storing at least one input repair instruction corresponding to at least one data quality problem which exists in the collected data and does not meet the first constraint condition, and establishing a corresponding relation between the at least one data quality problem and the at least one repair instruction;
and the repairing unit is used for determining a first repairing instruction corresponding to the first data quality problem based on the corresponding relation between the at least one data quality problem and the at least one repairing instruction and calling the first repairing instruction to repair the first data quality problem when the first data quality problem corresponding to the first data line partially meeting the first constraint condition exists in the at least one data quality problem.
Optionally, if the first task is a data import type task based on the collected data, the operation instruction is used to set the first task to be executed before the data quality evaluation task.
Optionally, if the rule type in the constraint condition information indicates that the multiple data rows are evaluated as a whole, the execution unit 202 is specifically configured to:
executing a first task based on the acquired data, wherein the first task is used for importing a plurality of data lines included in the acquired data into a preset storage position and numbering the data lines;
after the first task is determined to be executed, evaluating the multiple data line integrals based on the reference value type, the size of the reference value and the comparison relation contained in the check type, and determining the evaluation result of the multiple data line integrals, wherein the evaluation result is used for representing whether the multiple data line integrals meet a second constraint condition formed by the reference value type, the size of the reference value and the comparison relation;
and if the evaluation result is determined to be that the second constraint condition is not met, outputting prompt information, wherein the prompt information is used for reminding a user that the quantity of the acquired data processed by the first task is insufficient or exceeds the standard.
Referring to fig. 3, based on the same inventive concept, an embodiment of the present invention provides an apparatus for evaluating data quality, which includes at least one processor 301, where the processor 301 is configured to execute a computer program stored in a memory to implement the steps of the method for evaluating data quality shown in fig. 1 provided in the embodiment of the present invention.
Alternatively, the processor 301 may be specifically a central processing unit, a specific ASIC, and may be one or more integrated circuits for controlling the execution of programs.
Optionally, the apparatus may further comprise a memory 302 connected to the at least one processor 301, the memory 302 may comprise ROM, RAM and disk memory. The memory 302 is used for storing data required by the processor 301 during operation, that is, storing instructions executable by the at least one processor 301, and the at least one processor 301 executes the method shown in fig. 1 by executing the instructions stored in the memory 302. The number of the memories 302 is one or more. The memory 302 is also shown in fig. 3, but it should be understood that the memory 302 is not an optional functional block, and is therefore shown in fig. 3 by a dotted line.
The physical devices corresponding to the receiving unit 201 and the executing unit 202 may be the processor 301. The apparatus may be used to perform the method provided by the embodiment shown in fig. 1. Therefore, regarding the functions that can be realized by each functional module in the apparatus, reference may be made to the corresponding description in the embodiment shown in fig. 1, and details are not repeated.
Embodiments of the present invention also provide a computer storage medium, where the computer storage medium stores computer instructions, and when the computer instructions are executed on a computer, the computer is caused to execute the method as described in fig. 1.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. An evaluation method of data quality, applied to data quality evaluation of collected data processed by a first task, the collected data including a plurality of data rows, each of the plurality of data rows including a plurality of field values, the method comprising, when the first task is scheduled to be executed:
receiving constraint information input for respective field values in a plurality or each of data lines of the collected data, the constraint information corresponding to a requirement of the first task to process the collected data and being used for generating a data quality assessment task, the constraint information including at least: the data quality evaluation task comprises a rule type, a rule strength and a check type, wherein the rule type is used for indicating that the whole of the multiple data rows are evaluated or each field value in each data row is evaluated, the rule strength is used for indicating whether the data quality evaluation task and the first task are mutually independent, and the check type is used for indicating a reference value type, a reference value size and a comparison relation adopted when the whole of the multiple data rows are evaluated or each field value in each data row is evaluated;
receiving an input operation instruction for setting a relative execution sequence of the data quality evaluation task and the first task, and executing the data quality evaluation task and the first task based on the determined relative execution sequence, so that the data quality evaluation task performs targeted quality evaluation on the data processed by the first task, wherein the relative execution sequence between different first tasks and corresponding data quality evaluation tasks is different.
2. The method of claim 1, wherein the operational instructions are configured to configure the data quality assessment task to be performed prior to the first task if the first task is a data development type task based on the collected data.
3. The method of claim 2, wherein if the rule type in the constraint information indicates that field values in each data row are evaluated, the rule strength information indicates that the data quality assessment task is not an independent relationship with the first task, executing the data quality assessment task and the first task based on the determined relative order of execution comprises:
evaluating each field value in each data line based on the reference value type, the size of the reference value and the comparison relation contained in the check type, and determining an evaluation result of each data line, wherein the evaluation result is used for representing whether each field value in each data line meets a first constraint condition formed by the reference value type, the size of the reference value and the comparison relation;
and if the evaluation result is that the number of the partial data rows which all meet the first constraint condition is larger than or equal to a first preset threshold value, executing the first task based on the partial data rows.
4. The method of claim 3, wherein if the rule strength information indicates an independent relationship between the data quality assessment task and the first task, the method further comprises:
and if the evaluation result is determined that the number of the data rows which all meet the first constraint condition is smaller than the first preset threshold, executing the first task based on the acquired data.
5. The method of claim 3, further comprising:
storing at least one input repair instruction corresponding to at least one data quality problem which exists in the acquired data and does not meet the first constraint condition, and establishing a corresponding relation between the at least one data quality problem and the at least one repair instruction;
and if the evaluation result is that a first data quality problem corresponding to a first data line partially meeting the first constraint condition exists in the at least one data quality problem, determining a first repair instruction corresponding to the first data quality problem based on the corresponding relation between the at least one data quality problem and the at least one repair instruction, and calling the first repair instruction to repair the first data quality problem.
6. The method of claim 1, wherein the operating instructions are configured to configure the first task to be performed prior to the data quality assessment task if the first task is a data import type task based on the collected data.
7. The method of claim 6, wherein, if the rule type in the constraint information indicates that the plurality of data rows are evaluated as a whole, sequentially performing the data quality evaluation task and the first task based on the determined execution order comprises:
executing the first task based on the acquired data, wherein the first task is used for importing a plurality of data lines included in the acquired data into a preset storage position and numbering the data lines;
after the first task is determined to be executed completely, evaluating the multiple data line integrals based on the reference value type, the size of the reference value and the contrast relation contained in the verification type, and determining the evaluation results of the multiple data line integrals, wherein the evaluation results are used for representing whether the multiple data line integrals meet a second constraint condition formed by the reference value type, the size of the reference value and the contrast relation;
and if the evaluation result is determined that the second constraint condition is not met, outputting prompt information, wherein the prompt information is used for reminding a user that the quantity of the acquired data processed by the first task is insufficient or exceeds the standard.
8. An apparatus for evaluating data quality, the apparatus being adapted to perform data quality evaluation on collected data processed by a first task, the collected data including a plurality of data lines, each of the plurality of data lines including a plurality of field values, the apparatus comprising:
a receiving unit, configured to receive constraint information for respective field values in a plurality of or each data row of the collected data, the constraint information corresponding to a requirement of the first task for processing the collected data and being used for generating a data quality evaluation task, when the first task is scheduled to be executed, the constraint information including at least: the data quality evaluation task comprises a rule type, a rule strength and a check type, wherein the rule type is used for indicating that the whole of the multiple data rows are evaluated or each field value in each data row is evaluated, the rule strength is used for indicating whether the data quality evaluation task and the first task are mutually independent, and the check type is used for indicating a reference value type, a reference value size and a comparison relation adopted when the whole of the multiple data rows are evaluated or each field value in each data row is evaluated;
and the execution unit is used for receiving an input operation instruction for setting the relative execution sequence of the data quality evaluation task and the first task, and executing the data quality evaluation task and the first task based on the determined relative execution sequence so that the data quality evaluation task performs targeted quality evaluation on the data processed by the first task, wherein the relative execution sequence between different first tasks and corresponding data quality evaluation tasks is different.
9. An apparatus for assessing the quality of data, the apparatus comprising at least one processor and a memory coupled to the at least one processor, the at least one processor being configured to implement the steps of the method according to any one of claims 1 to 7 when executing a computer program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202011281552.6A 2020-11-16 2020-11-16 Data quality evaluation method and device Active CN112380204B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011281552.6A CN112380204B (en) 2020-11-16 2020-11-16 Data quality evaluation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011281552.6A CN112380204B (en) 2020-11-16 2020-11-16 Data quality evaluation method and device

Publications (2)

Publication Number Publication Date
CN112380204A CN112380204A (en) 2021-02-19
CN112380204B true CN112380204B (en) 2022-08-19

Family

ID=74584821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011281552.6A Active CN112380204B (en) 2020-11-16 2020-11-16 Data quality evaluation method and device

Country Status (1)

Country Link
CN (1) CN112380204B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7571069B1 (en) * 2006-12-22 2009-08-04 Hewlett-Packard Development Company, L.P. Data assurance workflow derivation and execution
CN101576893A (en) * 2008-05-09 2009-11-11 北京世纪拓远软件科技发展有限公司 Method and system for analyzing data quality
CN109992576A (en) * 2019-03-01 2019-07-09 苏州龙石信息科技有限公司 A kind of government data quality evaluation and abnormal data recovery technique based on big data technology
CN110309131A (en) * 2019-04-12 2019-10-08 北京星网锐捷网络技术有限公司 The method for evaluating quality and device of massive structured data

Also Published As

Publication number Publication date
CN112380204A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN108875289B (en) Algorithm debugging method, client, background server and system
CN107992410B (en) Software quality monitoring method and device, computer equipment and storage medium
EP4075281A1 (en) Ann-based program test method and test system, and application
US20160335551A1 (en) Optimization of fraud detection strategies
CN111930357B (en) Construction method of visual modeling job flow scheduling engine
CN110471945B (en) Active data processing method, system, computer equipment and storage medium
CN111680085A (en) Data processing task analysis method and device, electronic equipment and readable storage medium
US20050278301A1 (en) System and method for determining an optimized process configuration
CN111222553B (en) Training data processing method and device of machine learning model and computer equipment
CN113268335B (en) Model training and execution duration estimation method, device, equipment and storage medium
CN109493958A (en) A kind of follow-up ways to draw up the plan, device, server and medium
CN112380204B (en) Data quality evaluation method and device
CN111106953B (en) Method and device for analyzing abnormal root cause
CN112365156A (en) Data processing method, data processing device, terminal and storage medium
CN112396430A (en) Processing method and system for enterprise evaluation
CN114996519B (en) Data processing method, device, electronic equipment, storage medium and product
JP2011198300A (en) Process improvement measure evaluation device and method
CN116993396B (en) Risk early warning method based on vehicle user tag and computer equipment
CN113742226B (en) Software performance test method and device, medium and electronic equipment
CN114254764B (en) Feedback-based machine learning model searching method, system, equipment and medium
CN111562982A (en) Request data processing method and device, computer readable storage medium and electronic equipment
CN111105059B (en) Attribute conflict discovery method, device and computer-readable storage medium
CN114692647A (en) Data processing method, device, equipment and medium
JP2023056111A (en) Software failure analytic device and software failure analytic method
CN116485133A (en) Intelligent shift setting method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant