CN111026742A - Data quality evaluation method and device, computer equipment and storage medium - Google Patents

Data quality evaluation method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111026742A
CN111026742A CN201911231211.5A CN201911231211A CN111026742A CN 111026742 A CN111026742 A CN 111026742A CN 201911231211 A CN201911231211 A CN 201911231211A CN 111026742 A CN111026742 A CN 111026742A
Authority
CN
China
Prior art keywords
evaluation result
data
accuracy
timeliness
uniqueness
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911231211.5A
Other languages
Chinese (zh)
Inventor
韩超
李勇波
卢子忱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cloud Computing Center of CAS
Cloud Computing Industry Technology Innovation and Incubation Center of CAS
Original Assignee
Cloud Computing Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloud Computing Center of CAS filed Critical Cloud Computing Center of CAS
Priority to CN201911231211.5A priority Critical patent/CN111026742A/en
Publication of CN111026742A publication Critical patent/CN111026742A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

The application relates to a data quality evaluation method, a data quality evaluation device, computer equipment and a storage medium. The method comprises the following steps: the server acquires a data set, wherein the data set comprises a plurality of data rows, and the data rows comprise a plurality of field values; for each data row, the server evaluates the data quality of the data row according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value in the data row to obtain an evaluation result of the data row; and the server evaluates the data quality of the data set according to the evaluation result of each data row to obtain the evaluation result of the data set. The method can improve the accuracy of data quality evaluation.

Description

Data quality evaluation method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data quality assessment method, apparatus, computer device, and storage medium.
Background
With the development of computer technology, the application of big data is more and more extensive, and the application of big data necessarily needs a large amount of data support, and the quality of data can directly affect the application effect of big data, so the evaluation of data quality becomes a research hotspot.
In the related art, data is stored in a database, when data quality needs to be evaluated, the data needing to be evaluated is called from the database, the data can be a plurality of field values, the field values are evaluated, and further, evaluation results of the field values are obtained.
However, only each field value is evaluated, and a comprehensive and comprehensive evaluation result cannot be obtained, so that the accuracy of the data quality evaluation result is reduced.
Disclosure of Invention
In view of the above, it is necessary to provide a data quality assessment method, apparatus, computer device and storage medium capable of improving the accuracy of data quality assessment in view of the above technical problems.
In a first aspect, a data quality assessment method is provided, which includes:
obtaining a data set, wherein the data set comprises a plurality of data rows, and the data rows comprise a plurality of field values;
for each data row, evaluating the data quality of the data row according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value in the data row to obtain an evaluation result of the data row, wherein the integrity of the field value is used for representing whether the field value is missing, the accuracy of the field value is used for representing the proximity degree between the field value and a preset field value, the correctness of the field value is used for representing the degree of the field value following a preset syntax rule, the uniqueness of the field value is used for representing whether the field value is unique in the data row, and the timeliness of the field value is used for representing the time length from generation to use of the field value;
and evaluating the data quality of the data set according to the evaluation result of each data row to obtain the evaluation result of the data set.
In one embodiment, the evaluating the data quality of the data line according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value in the data line to obtain an evaluation result of the data line includes:
evaluating the integrity of the data line according to the integrity of each field value in the data line to obtain an integrity evaluation result;
evaluating the accuracy of the data line according to the accuracy of each field value in the data line to obtain an accuracy evaluation result;
evaluating the correctness of the data line according to the correctness of each field value in the data line to obtain a correctness evaluation result;
evaluating the uniqueness of the data line according to the uniqueness of each field value in the data line to obtain a uniqueness evaluation result;
evaluating the timeliness of the data row according to the timeliness of each field value in the data row to obtain a timeliness evaluation result;
and obtaining the evaluation result of the data line according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result.
In one embodiment, the evaluating the integrity of the data line according to the integrity of each field value in the data line to obtain an integrity evaluation result includes:
acquiring integrity scores corresponding to all field values in the data row, wherein the integrity scores are used for indicating whether the corresponding field values are missing or not;
and adding the integrity scores corresponding to the field values in the data row to obtain an integrity total score, and taking the integrity total score as the integrity evaluation result.
In one embodiment, the evaluating the accuracy of the data line according to the accuracy of each field value in the data line to obtain an accuracy evaluation result includes:
acquiring accuracy scores corresponding to all field values in the data row, wherein the accuracy scores are used for representing the closeness degree between the corresponding field values and preset field values;
and adding the accuracy scores corresponding to the field values in the data row to obtain an accuracy total score, and taking the accuracy total score as the accuracy evaluation result.
In one embodiment, the evaluating the correctness of the data line according to the correctness of each field value in the data line to obtain a correctness evaluation result includes:
acquiring correctness scores corresponding to all field values in the data line, wherein the correctness scores are used for indicating whether the corresponding field values follow a preset grammar rule or not;
and adding the correctness scores corresponding to the field values in the data row to obtain a total correctness score, and taking the total correctness score as the correctness evaluation result.
In one embodiment, the evaluating the uniqueness of the data line according to the uniqueness of each field value in the data line to obtain a uniqueness evaluation result includes:
acquiring uniqueness scores corresponding to all field values in the data row, wherein the uniqueness scores are used for representing whether the corresponding field values are unique in the data row;
and multiplying the uniqueness scores corresponding to the field values in the data row to obtain a total uniqueness score, and taking the total uniqueness score as the uniqueness evaluation result.
In one embodiment, the evaluating the timeliness of the data row according to the timeliness of each field value in the data row to obtain a timeliness evaluation result includes:
acquiring timeliness scores corresponding to all field values in the data row, wherein the timeliness scores are used for representing the duration from generation to use of the corresponding field values;
and adding the timeliness scores corresponding to the field values in the data row to obtain a timeliness total score, and taking the timeliness total score as the timeliness evaluation result.
In one embodiment, the evaluating result of each data line is a numerical value, and the evaluating the data quality of the data set according to the evaluating result of each data line to obtain the evaluating result of the data set includes:
calculating a weighted average value of the evaluation result of each data line;
and obtaining the evaluation result of the data set according to the calculation result.
In one embodiment, the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result, and the timeliness evaluation result are respectively a numerical value, and the obtaining the evaluation result of the data line according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result, and the timeliness evaluation result includes:
performing weighted calculation on the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result;
and obtaining the evaluation result of the data line according to the calculation result.
In a second aspect, there is provided a data quality evaluation apparatus, the apparatus comprising:
a first obtaining module, configured to obtain a data set, where the data set includes a plurality of data lines, and the data lines include a plurality of field values;
the first evaluation module is used for evaluating the data quality of each data row according to the integrity, the accuracy, the uniqueness and the timeliness of each field value in the data row to obtain an evaluation result of the data row, wherein the integrity of the field value is used for representing whether the field value is missing, the accuracy of the field value is used for representing the proximity degree between the field value and a preset field value, the accuracy of the field value is used for representing the degree of the field value following a preset syntax rule, the uniqueness of the field value is used for representing whether the field value is unique in the data row, and the timeliness of the field value is used for representing the time length from generation to use of the field value;
and the second evaluation module is used for evaluating the data quality of the data set according to the evaluation result of each data row to obtain the evaluation result of the data set.
In one embodiment, the first evaluation module comprises:
the integrity evaluation submodule is used for evaluating the integrity of the data row according to the integrity of each field value in the data row to obtain an integrity evaluation result;
the accuracy evaluation submodule is used for evaluating the accuracy of the data line according to the accuracy of each field value in the data line to obtain an accuracy evaluation result;
the correctness evaluation submodule is used for evaluating the correctness of the data line according to the correctness of each field value in the data line to obtain a correctness evaluation result;
the uniqueness evaluation submodule is used for evaluating the uniqueness of the data line according to the uniqueness of each field value in the data line to obtain a uniqueness evaluation result;
the timeliness evaluation submodule is used for evaluating the timeliness of the data line according to the timeliness of each field value in the data line to obtain a timeliness evaluation result;
and the data line evaluation submodule is used for obtaining the evaluation result of the data line according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result.
In one embodiment, the integrity evaluation sub-module is specifically configured to:
acquiring integrity scores corresponding to all field values in the data row, wherein the integrity scores are used for indicating whether the corresponding field values are missing or not;
and adding the integrity scores corresponding to the field values in the data row to obtain an integrity total score, and taking the integrity total score as the integrity evaluation result.
In one embodiment, the accuracy evaluation sub-module is specifically configured to:
acquiring accuracy scores corresponding to all field values in the data row, wherein the accuracy scores are used for representing the closeness degree between the corresponding field values and preset field values;
and adding the accuracy scores corresponding to the field values in the data row to obtain an accuracy total score, and taking the accuracy total score as the accuracy evaluation result.
In one embodiment, the correctness evaluation sub-module is specifically configured to:
acquiring correctness scores corresponding to all field values in the data line, wherein the correctness scores are used for indicating whether the corresponding field values follow a preset grammar rule or not;
and adding the correctness scores corresponding to the field values in the data row to obtain a total correctness score, and taking the total correctness score as the correctness evaluation result.
In one embodiment, the uniqueness evaluation submodule is specifically configured to:
acquiring uniqueness scores corresponding to all field values in the data row, wherein the uniqueness scores are used for representing whether the corresponding field values are unique in the data row;
and multiplying the uniqueness scores corresponding to the field values in the data row to obtain a total uniqueness score, and taking the total uniqueness score as the uniqueness evaluation result.
In one embodiment, the timeliness assessment sub-module is specifically configured to:
acquiring timeliness scores corresponding to all field values in the data row, wherein the timeliness scores are used for representing the duration from generation to use of the corresponding field values;
and adding the timeliness scores corresponding to the field values in the data row to obtain a timeliness total score, and taking the timeliness total score as the timeliness evaluation result.
In one embodiment, the second evaluation module is specifically configured to: calculating a weighted average value of the evaluation result of each data line;
and obtaining the evaluation result of the data set according to the calculation result.
In one embodiment, the first evaluation module further comprises a computation submodule for:
performing weighted calculation on the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result;
and obtaining the evaluation result of the data line according to the calculation result.
The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:
the embodiment of the application provides a data quality evaluation method which can solve the problems in the related technology. In the data quality evaluation method, a server acquires a data set, wherein the data set comprises a plurality of data lines, and the data lines comprise a plurality of field values; for each data row, the server evaluates the data quality of the data row according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value in the data row to obtain an evaluation result of the data row, wherein the integrity of the field value is used for representing whether the field value is missing, the accuracy of the field value is used for representing the proximity degree between the field value and a preset field value, the correctness of the field value is used for representing the degree of the field value following a preset syntax rule, the uniqueness of the field value is used for representing whether the field value is unique in the data row, and the timeliness of the field value is used for representing the time length from generation to use of the field value; the server evaluates the data quality of the data set according to the evaluation result of each data row to obtain the evaluation result of the data set; according to the method and the device, the data quality of the data row is evaluated according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value to obtain the evaluation result of the data row, the data set is evaluated according to the evaluation result of the data row, and finally the evaluation result of the data set is obtained.
Drawings
FIG. 1 is a diagram of an exemplary embodiment of a data quality assessment method;
FIG. 2 is a schematic flow chart diagram illustrating a data quality assessment method according to one embodiment;
FIG. 3 is a schematic flow chart of the data quality evaluation step in another embodiment;
FIG. 4 is a schematic flow chart diagram of a data quality assessment method in another embodiment;
FIG. 5 is a schematic flow chart diagram of a data quality assessment method in another embodiment;
FIG. 6 is a schematic flow chart diagram illustrating a data quality assessment method according to another embodiment;
FIG. 7 is a block diagram showing the structure of a data quality evaluating apparatus according to an embodiment;
FIG. 8 is a block diagram showing the construction of a data quality evaluating apparatus according to another embodiment;
FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The data quality evaluation method provided by the application can be applied to the application environment shown in fig. 1. The application environment can comprise a server 101 and a terminal 102, wherein a database is maintained in the server, data are stored in the data, the server carries out quality evaluation on the data, and an evaluation result is sent to the terminal 102; the server 101 establishes a communication connection with the terminal 102 by wire or wireless.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 101 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a data quality evaluation method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:
in step 201, a server obtains a data set, where the data set includes a plurality of data lines, and the data lines include a plurality of field values.
The data set is a set composed of data, the set can present the data in a table manner, the data set includes a plurality of data rows, the data rows are composed of a plurality of field values, the field values are used for characterizing attribute values of fields, the field values can be numbers or words, for example, the fields are names, and the field values corresponding to the fields can be specific names of three.
The server calls a data set to be evaluated from a database, the data set can present data in a table mode, the table has a plurality of rows and a plurality of columns, the first row from top to bottom is used for storing fields, each column value is represented as a field value corresponding to one field, a plurality of field values in one row can form one data row, and a plurality of data rows can form the data set.
Step 202, the server evaluates the data quality of each data row according to the integrity, accuracy, correctness, uniqueness and timeliness of each field value in the data row, so as to obtain an evaluation result of the data row.
The integrity of the field value is used for representing whether the field value is missing or not, the accuracy of the field value is used for representing the proximity degree between the field value and a preset field value, the correctness of the field value is used for representing the degree of the field value following a preset syntax rule, the uniqueness of the field value is used for representing whether the field value is unique in the data row, and the timeliness of the field value is used for representing the time length from generation to use of the field value.
The completeness of the field value refers to whether the field value is complete or not, and whether the field value is missing or not is judged; the accuracy of a field value refers to the proximity between the field value and a preset field value, for example, the field value is "zhangSan" and the preset field value is "zhangSan", and thus, the field value is deviated from the preset field value; the correctness of the field value refers to whether the field value conforms to the specified field value or not, and is within the range conforming to the specified field value, for example, for the height of an adult, generally not more than 2.5 m, the blood pressure range of a human is generally 90mmHg < systolic blood pressure <140mmHg and 60mmHg < diastolic blood pressure <90 mmHg; the uniqueness of the field value refers to whether the field value is unique in the data line where the field value is located; the timeliness of the field values refers to the length of time from generation to use of the field values.
Since the data row comprises a plurality of field values, the evaluation of each data row needs to evaluate each field value in the data row first, and for each field value, the integrity, the accuracy, the correctness, the uniqueness and the timeliness of the field value can be evaluated, and the obtained evaluation result is the evaluation result of the data row.
And step 203, the server evaluates the data quality of the data set according to the evaluation result of each data row to obtain the evaluation result of the data set.
Since the data set includes a plurality of data rows, after the evaluation result of the data row is obtained, if the evaluation result of the data set is to be obtained, the data quality of the data set can be evaluated according to the evaluation result of each data row, and finally, the evaluation result of the data set is obtained.
In the data quality evaluation method, a server acquires a data set, wherein the data set comprises a plurality of data lines, and the data lines comprise a plurality of field values; for each data row, the server evaluates the data quality of the data row according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value in the data row to obtain an evaluation result of the data row, wherein the integrity of the field value is used for representing whether the field value is missing, the accuracy of the field value is used for representing the proximity degree between the field value and a preset field value, the correctness of the field value is used for representing the degree of the field value following a preset syntax rule, the uniqueness of the field value is used for representing whether the field value is unique in the data row, and the timeliness of the field value is used for representing the time length from generation to use of the field value; the server evaluates the data quality of the data set according to the evaluation result of each data row to obtain the evaluation result of the data set; according to the method and the device, the data quality of the data row is evaluated according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value to obtain the evaluation result of the data row, the data set is evaluated according to the evaluation result of the data row, and finally the evaluation result of the data set is obtained.
In one embodiment, as shown in fig. 3, a step of evaluating the data quality of the data line according to the integrity, accuracy, correctness, uniqueness and timeliness of each field value in the data line to obtain an evaluation result of the data line is proposed, including:
step 301, the server evaluates the integrity of the data line according to the integrity of each field value in the data line, and obtains an integrity evaluation result.
The integrity evaluation of the data row can not be separated from the integrity of each field value, so that the server evaluates the integrity of each field value, and can evaluate the integrity of the data row according to the integrity of each field value in the data row, and finally, an integrity evaluation result is obtained.
Step 302, the server evaluates the accuracy of the data line according to the accuracy of each field value in the data line, and obtains an accuracy evaluation result.
It is necessary to evaluate not only from the point of view of the integrity of the data line, but also from the point of view of the accuracy of the data line; the accuracy of each field value cannot be estimated according to the accuracy of the data line, so that the server estimates the accuracy of each field value, the accuracy of the data line can be estimated according to the accuracy of each field value in the data line, and finally, an accuracy estimation result is obtained.
Step 303, the server evaluates the correctness of the data line according to the correctness of each field value in the data line, and obtains a correctness evaluation result.
The correctness is an important aspect for the evaluation of the data line, and since the correctness of the data line cannot be estimated from the correctness of each field value, the server evaluates the correctness of each field value, and can evaluate the correctness of the data line according to the correctness of each field value in the data line, and finally, a correctness evaluation result is obtained.
And step 304, the server evaluates the uniqueness of the data line according to the uniqueness of each field value in the data line to obtain a uniqueness evaluation result.
The evaluation needs to be carried out not only from the aspects of integrity and accuracy of the data line, but also from the aspects of uniqueness of the data line; the uniqueness of each field value can not be estimated in the uniqueness estimation of the data line, so that the server estimates the uniqueness of each field value, and can estimate the uniqueness of the data line according to the uniqueness of each field value in the data line to obtain a uniqueness estimation result finally.
And 305, the server evaluates the timeliness of the data line according to the timeliness of each field value in the data line to obtain a timeliness evaluation result.
If the timeliness of the data row is too poor, the evaluation result of the final data set does not have a reference value, and considering this, it is necessary to evaluate the timeliness of the data row, but since the timeliness of the data row cannot be evaluated from the timeliness of each field value, the server evaluates the timeliness of each field value, and can evaluate the timeliness of the data row according to the timeliness of each field value in the data row, and finally, the timeliness evaluation result is obtained.
Step 306, obtaining an evaluation result of the data row according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result, and the timeliness evaluation result.
And after obtaining an integrity evaluation result, an accuracy evaluation result, a correctness evaluation result, a uniqueness evaluation result and a timeliness evaluation result, integrating the read evaluation results to finally obtain an evaluation result of the data line.
In the data quality evaluation model provided in this embodiment, the integrity of the data line is evaluated according to the integrity of each field value in the data line, so as to obtain an integrity evaluation result; evaluating the accuracy of the data line according to the accuracy of each field value in the data line to obtain an accuracy evaluation result; evaluating the correctness of the data line according to the correctness of each field value in the data line to obtain a correctness evaluation result; evaluating the uniqueness of the data line according to the uniqueness of each field value in the data line to obtain a uniqueness evaluation result; evaluating the timeliness of the data row according to the timeliness of each field value in the data row to obtain a timeliness evaluation result; and obtaining the evaluation result of the data row according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result, and further providing data support for obtaining the evaluation result of the data set.
In one embodiment, as shown in fig. 4, there is further provided a step of evaluating the integrity, accuracy, correctness, uniqueness and timeliness of the data line according to the integrity, accuracy, correctness, uniqueness and timeliness of each field value in the data line, and obtaining the evaluation result of the integrity, accuracy, correctness, uniqueness and timeliness, including:
in step 401, the server obtains an integrity score corresponding to each field value in the data row, where the integrity score is used to indicate whether the corresponding field value is missing.
The server obtains the integrity score corresponding to each field value in the data row, and calculates according to a calculation method of the integrity score, wherein the calculation method specifically refers to the formula (1):
Figure BDA0002303591360000111
wherein p isiFor the field value in the data line,n denotes the number of field values, the field value in the data row being expressed as the set P ═ P1,p2,p3,…,pn},S(pi) For determining whether the field value is null.
Illustratively, there are 4 field values in the data row, the 4 fields are "name", "age", "hobby" and "achievement", the corresponding field values are "third of ten", "20 years", "running" and "95 minutes", respectively, if one field value "running" is missing, the integrity score corresponding to the field value is 0, and the rest is 1.
Step 402, the server adds the integrity scores corresponding to the field values in the data row to obtain a total integrity score, and the total integrity score is used as the integrity evaluation result.
After the integrity scores corresponding to the field values in the data row are obtained, the integrity scores are added to obtain an integrity total score, and the integrity total score can be used as an integrity evaluation result.
Please refer to formula (2):
Figure BDA0002303591360000112
wherein p isiFor the field values in the data row, n represents the number of field values, and the field values in the data row are represented as the set P ═ P1,p2,p3,…,pn},S(pi) For determining whether the field value is null.
Illustratively, as described above, the field values are "zhangsan", "20 years old", "run", and "95", where if one field value "run" is missing, the integrity score corresponding to the field value is 0, and the rest are 1, then according to equation (2), the total integrity score is 1/4 (1+1+0+1) ═ 3/4.
In step 403, the server obtains accuracy scores corresponding to the field values in the data row, where the accuracy scores are used to represent the closeness between the corresponding field values and preset field values.
The server obtains the accuracy score corresponding to each field value in the data row, and the server calculates according to the accuracy score calculating method.
When it is evaluated that the field value is a numerical value, the accuracy score is calculated using equation (3):
Figure BDA0002303591360000113
wherein p isiFor the field values in the data row, n represents the number of field values, and the field values in the data row are represented as the set P ═ P1,p2,p3,…,pn},p0Is a value of a preset field, and is,
Figure BDA0002303591360000121
is piAnd p0The greater the distance, the lower the accuracy score.
Illustratively, with respect to a drug in which a drug component is present in an amount of 0.88 mg and the predetermined amount for the drug component is 0.78 mg, which results in a 0.10 mg error, the drug is administered in a single dose, and the like
Figure BDA0002303591360000122
Is composed of
Figure BDA0002303591360000123
By adopting the calculation method for the field values in the data rows, the accuracy score when the evaluated field values are numerical values can be obtained finally.
When it is evaluated that the field value is a character, the accuracy score is calculated using equation (4):
Figure BDA0002303591360000124
wherein v isiV is the number of characters in the field value that is evaluated and that matches the preset field value.
In an exemplary manner, the first and second electrodes are,one field value is 'running ABC', the preset field value is 'running ABD', one letter or one Chinese character is one character, wherein 'running', 'A' and 'B' are characters in the preset field value, the number of the characters in the preset field value is 3, and v is 5, the number of the characters in the preset field value is 3, and the number of the characters in the preset field value is 5
Figure BDA0002303591360000125
To 3/5, this calculation method is applied to the field values in the data line, and finally, the accuracy score when the evaluated field values are characters can be obtained.
In step 404, the server adds the accuracy scores corresponding to the field values in the data row to obtain an accuracy total score, and the accuracy total score is used as the accuracy evaluation result.
After the accuracy scores corresponding to the field values in the data row are obtained, the accuracy scores are added to obtain an accuracy total score, and the accuracy total score can be used as an accuracy evaluation result.
The specific calculation accuracy total score can be expressed by the following equations (3) and (4).
In step 405, the server obtains correctness scores corresponding to the field values in the data row, where the correctness scores are used to indicate whether the corresponding field values comply with a predetermined syntax rule.
The server obtains the correctness score corresponding to each field value in the data row, and the server calculates according to the calculation method of the correctness score, specifically, please refer to formula (5):
Figure BDA0002303591360000131
wherein p isiFor field values in a data line, n represents the number of field values, and the field value in the data line is represented as a set P ═ P1,p2,p3,…,pn},S(pi) For judging whether the field value is within a preset range of the field value.
Illustratively, a field value of "32031119910213 xxxxx" represents an identification number, which is typically 15-18 bits, and the field value is 19 bits, and thus, the predetermined syntax rules are not followed.
In step 406, the server adds the correctness scores corresponding to the field values in the data row to obtain a total correctness score, and uses the total correctness score as the correctness evaluation result.
After the correctness scores corresponding to the field values in the data row are obtained, the correctness scores are added to obtain a total correctness score, and the total correctness score can be used as a correctness evaluation result.
Please refer to formula (6):
Figure BDA0002303591360000132
wherein p isiFor the field values in the data row, n represents the number of field values, and the field values in the data row are represented as the set P ═ P1,p2,p3,…,pn},S(pi) For judging whether the field value is within a preset range of the field value.
In step 407, the server obtains the uniqueness score corresponding to each field value in the data row, where the uniqueness score is used to characterize whether the corresponding field value is unique in the data row.
The server obtains the uniqueness score corresponding to each field value in the data row, and calculates according to the calculation method of the uniqueness score, specifically referring to formula (7):
Figure BDA0002303591360000133
wherein p isiFor the field value in the data line, n is denoted as the number of lines, and the field value in the data line is denoted as the set P ═ P1,p2,p3,…,pn},S(pi) For judging whether the field value is consistent with the preset field value.
In step 408, the server multiplies the uniqueness scores corresponding to the field values in the data row to obtain a total uniqueness score, and uses the total uniqueness score as the uniqueness evaluation result.
After the uniqueness scores corresponding to the field values in the data row are obtained, the uniqueness scores are added to obtain a total uniqueness score, and the total uniqueness score can be used as a uniqueness evaluation result.
Please refer to formula (8):
Figure BDA0002303591360000141
wherein p isiFor the field values in the data line, n represents the number of field values, and the field values in the data line are represented as a set P ═ P1,p2,p3,…,pn},S(pi) For judging whether the field value is consistent with the preset field value.
In step 409, the server obtains the timeliness score corresponding to each field value in the data row, and the timeliness score is used for representing the time length from generation to use of the corresponding field value.
The server obtains timeliness scores corresponding to all field values in the data row, and the server calculates according to a calculation method of the timeliness scores, and specifically calculates according to the following formula (9):
Figure BDA0002303591360000142
wherein, t0Expressed as the current time, and t represents the time the line of data was generated, which is in days.
Illustratively, data is generated on the first day, and by the third day when the data is used, then
Figure BDA0002303591360000143
And step 410, the server adds the timeliness scores corresponding to the field values in the data row to obtain a total timeliness score, and the total timeliness score is used as the timeliness evaluation result.
After the timeliness scores corresponding to the field values in the data row are obtained, the timeliness scores are added, so that a timeliness total score can be obtained, and the timeliness total score can serve as a timeliness evaluation result.
In the data quality evaluation method provided by this embodiment, the server evaluates the integrity, accuracy, correctness, uniqueness and timeliness aspects of the data line according to the integrity, accuracy, correctness, uniqueness and timeliness aspects of each field value in the data line to obtain integrity, accuracy, correctness, uniqueness and timeliness evaluation results, thereby providing data support for subsequent evaluation of the data set.
In one embodiment, as shown in fig. 5, since the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result, and the timeliness evaluation result are respectively a numerical value, a step of obtaining the evaluation result of the data line according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result, and the timeliness evaluation result is provided, which includes:
in step 501, the server calculates a weighted average of the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result, and the timeliness evaluation result.
After obtaining the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result, calculating a weighted average of the evaluation results.
Please refer to formula (10):
Figure BDA0002303591360000151
wherein, U is a uniqueness evaluation result, W is a correctness evaluation result, A is an accuracy evaluation result, T is a timeliness evaluation result, and I is an integrity evaluation result. If the uniqueness evaluation result is 0, then R is 0.
Step 502, the server obtains the evaluation result of the data line according to the calculation result.
From U, W, A, T and I, R is the calculation result, which is the evaluation result of the data line.
In the data quality evaluation method provided by this embodiment, the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result, and the timeliness evaluation result are subjected to weighted calculation; and acquiring the evaluation result of the data line according to the calculation result, wherein the evaluation result of the data line is comprehensively acquired according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result, so that the acquired evaluation result of the data line is more accurate.
In one embodiment, as shown in fig. 6, since the evaluation result of each of the data lines is a numerical value, a step of evaluating the data quality of the data set according to the evaluation result of each of the data lines to obtain the evaluation result of the data set is provided, which includes:
step 601, the server calculates the weighted average of the evaluation results of each data line.
Please refer to formula (11):
Figure BDA0002303591360000152
wherein r isiFor the evaluation result of each data line, m is expressed as a number of lines, and the evaluation result of each data line is expressed as a set R ═ { R ═ R1,r2,r3,…,rm}。
In step 602, the server obtains the evaluation result of the data set according to the calculation result.
Q is obtained according to the equation (11), and a calculation result is obtained, and the calculation result is an evaluation result of the data set.
In the data quality evaluation method provided by this embodiment, the server calculates a weighted average of the evaluation results of each data line; the server obtains the evaluation result of the data set according to the calculation result, and the evaluation result of the data set is obtained according to the evaluation result of each data row, so that the evaluation result of the data set is more comprehensive and has better comprehensiveness.
It should be understood that although the various steps in the flow charts of fig. 2-6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 7, there is provided a data quality evaluation apparatus 700 including: a first obtaining module 701, a first evaluating module 702, and a second evaluating module 703, wherein:
a first obtaining module 701, configured to obtain a data set, where the data set includes a plurality of data lines, and the data lines include a plurality of field values;
a first evaluation module 702, configured to evaluate, for each data row, data quality of the data row according to integrity, accuracy, correctness, uniqueness, and timeliness of each field value in the data row, to obtain an evaluation result of the data row, where the integrity of a field value is used to characterize whether a field value is missing, the accuracy of a field value is used to characterize how close a field value is to a preset field value, the correctness of a field value is used to characterize how well a field value follows a predetermined syntax rule, the uniqueness of a field value is used to characterize whether a field value is unique in a data row where the field value is located, and the timeliness of a field value is used to characterize how long a field value is used from generation to usage;
the second evaluation module 703 is configured to evaluate the data quality of the data set according to the evaluation result of each data line, so as to obtain the evaluation result of the data set.
Referring to fig. 8, in addition to the modules of the data quality evaluation apparatus 700, the data quality evaluation apparatus 800 further includes an integrity evaluation sub-module 704, an accuracy evaluation sub-module 705, a correctness evaluation sub-module 706, a uniqueness evaluation sub-module 707, a timeliness evaluation sub-module 708, a data line evaluation sub-module 709, and a calculation sub-module 710.
In one embodiment, the first evaluation module 702 comprises:
the integrity evaluation submodule 704 is configured to evaluate the integrity of the data line according to the integrity of each field value in the data line, so as to obtain an integrity evaluation result;
the accuracy evaluation sub-module 705 is configured to evaluate the accuracy of the data line according to the accuracy of each field value in the data line, so as to obtain an accuracy evaluation result;
the correctness evaluating sub-module 706 is configured to evaluate correctness of the data line according to correctness of each field value in the data line, so as to obtain a correctness evaluation result;
the uniqueness evaluation submodule 707 is configured to evaluate uniqueness of the data line according to uniqueness of each field value in the data line, so as to obtain a uniqueness evaluation result;
the timeliness evaluation submodule 708 is configured to evaluate timeliness of the data line according to timeliness of each field value in the data line, so as to obtain a timeliness evaluation result;
the data line evaluation submodule 709 is configured to obtain an evaluation result of the data line according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result, and the timeliness evaluation result.
In one embodiment, the integrity evaluation sub-module 704 is specifically configured to:
acquiring integrity scores corresponding to all field values in the data row, wherein the integrity scores are used for indicating whether the corresponding field values are missing or not;
and adding the integrity scores corresponding to the field values in the data row to obtain an integrity total score, and taking the integrity total score as the integrity evaluation result.
The accuracy evaluation sub-module 705 is specifically configured to:
acquiring accuracy scores corresponding to all field values in the data row, wherein the accuracy scores are used for representing the closeness degree between the corresponding field values and preset field values;
and adding the accuracy scores corresponding to the field values in the data row to obtain an accuracy total score, and taking the accuracy total score as the accuracy evaluation result.
The correctness evaluation sub-module 706 is specifically configured to:
acquiring correctness scores corresponding to all field values in the data line, wherein the correctness scores are used for indicating whether the corresponding field values follow a preset grammar rule or not;
and adding the correctness scores corresponding to the field values in the data row to obtain a total correctness score, and taking the total correctness score as the correctness evaluation result.
The uniqueness evaluation submodule 707 is specifically configured to:
acquiring uniqueness scores corresponding to all field values in the data row, wherein the uniqueness scores are used for representing whether the corresponding field values are unique in the data row;
and multiplying the uniqueness scores corresponding to the field values in the data row to obtain a total uniqueness score, and taking the total uniqueness score as the uniqueness evaluation result.
The timeliness assessment sub-module 708 is specifically configured to:
acquiring timeliness scores corresponding to all field values in the data row, wherein the timeliness scores are used for representing the duration from generation to use of the corresponding field values;
and adding the timeliness scores corresponding to the field values in the data row to obtain a timeliness total score, and taking the timeliness total score as the timeliness evaluation result.
The data line evaluation submodule 709 is specifically configured to:
and obtaining the evaluation result of the data line according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result.
In one embodiment, the second evaluation module is specifically configured to: calculating a weighted average value of the evaluation result of each data line;
and obtaining the evaluation result of the data set according to the calculation result.
In one embodiment, the first evaluation module further comprises a computation submodule for:
performing weighted calculation on the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result;
and obtaining the evaluation result of the data line according to the calculation result.
For specific limitations of the data quality evaluation device, reference may be made to the above limitations of the data quality evaluation method, which are not described herein again. The respective modules in the data quality evaluation apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data quality evaluation data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data quality assessment method.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
obtaining a data set, wherein the data set comprises a plurality of data rows, and the data rows comprise a plurality of field values;
for each data row, evaluating the data quality of the data row according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value in the data row to obtain an evaluation result of the data row, wherein the integrity of the field value is used for representing whether the field value is missing, the accuracy of the field value is used for representing the proximity degree between the field value and a preset field value, the correctness of the field value is used for representing the degree of the field value following a preset syntax rule, the uniqueness of the field value is used for representing whether the field value is unique in the data row, and the timeliness of the field value is used for representing the time length from generation to use of the field value;
and evaluating the data quality of the data set according to the evaluation result of each data row to obtain the evaluation result of the data set.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
evaluating the integrity of the data line according to the integrity of each field value in the data line to obtain an integrity evaluation result;
evaluating the accuracy of the data line according to the accuracy of each field value in the data line to obtain an accuracy evaluation result;
evaluating the correctness of the data line according to the correctness of each field value in the data line to obtain a correctness evaluation result;
evaluating the uniqueness of the data line according to the uniqueness of each field value in the data line to obtain a uniqueness evaluation result;
evaluating the timeliness of the data row according to the timeliness of each field value in the data row to obtain a timeliness evaluation result;
and obtaining the evaluation result of the data line according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result. In one embodiment, the processor, when executing the computer program, further performs the steps of:
acquiring integrity scores corresponding to all field values in the data row, wherein the integrity scores are used for indicating whether the corresponding field values are missing or not;
and adding the integrity scores corresponding to the field values in the data row to obtain an integrity total score, and taking the integrity total score as the integrity evaluation result.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
acquiring accuracy scores corresponding to all field values in the data row, wherein the accuracy scores are used for representing the closeness degree between the corresponding field values and preset field values;
and adding the accuracy scores corresponding to the field values in the data row to obtain an accuracy total score, and taking the accuracy total score as the accuracy evaluation result.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
acquiring correctness scores corresponding to all field values in the data line, wherein the correctness scores are used for indicating whether the corresponding field values follow a preset grammar rule or not;
and adding the correctness scores corresponding to the field values in the data row to obtain a total correctness score, and taking the total correctness score as the correctness evaluation result.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
acquiring uniqueness scores corresponding to all field values in the data row, wherein the uniqueness scores are used for representing whether the corresponding field values are unique in the data row;
and multiplying the uniqueness scores corresponding to the field values in the data row to obtain a total uniqueness score, and taking the total uniqueness score as the uniqueness evaluation result.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
acquiring timeliness scores corresponding to all field values in the data row, wherein the timeliness scores are used for representing the duration from generation to use of the corresponding field values;
and adding the timeliness scores corresponding to the field values in the data row to obtain a timeliness total score, and taking the timeliness total score as the timeliness evaluation result.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
calculating the weighted average value of the evaluation result of each data line;
and obtaining the evaluation result of the data set according to the calculation result.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
performing weighted calculation on the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result;
and obtaining the evaluation result of the data line according to the calculation result.
In one embodiment, a readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, performs the steps of:
obtaining a data set, wherein the data set comprises a plurality of data rows, and the data rows comprise a plurality of field values;
for each data row, evaluating the data quality of the data row according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value in the data row to obtain an evaluation result of the data row, wherein the integrity of the field value is used for representing whether the field value is missing, the accuracy of the field value is used for representing the proximity degree between the field value and a preset field value, the correctness of the field value is used for representing the degree of the field value following a preset syntax rule, the uniqueness of the field value is used for representing whether the field value is unique in the data row, and the timeliness of the field value is used for representing the time length from generation to use of the field value;
and evaluating the data quality of the data set according to the evaluation result of each data row to obtain the evaluation result of the data set.
In one embodiment, the computer program when executed by the processor further performs the steps of:
evaluating the integrity of the data line according to the integrity of each field value in the data line to obtain an integrity evaluation result;
evaluating the accuracy of the data line according to the accuracy of each field value in the data line to obtain an accuracy evaluation result;
evaluating the correctness of the data line according to the correctness of each field value in the data line to obtain a correctness evaluation result;
evaluating the uniqueness of the data line according to the uniqueness of each field value in the data line to obtain a uniqueness evaluation result;
evaluating the timeliness of the data row according to the timeliness of each field value in the data row to obtain a timeliness evaluation result;
and obtaining the evaluation result of the data line according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result. In one embodiment, the computer program when executed by the processor further performs the steps of:
acquiring integrity scores corresponding to all field values in the data row, wherein the integrity scores are used for indicating whether the corresponding field values are missing or not;
and adding the integrity scores corresponding to the field values in the data row to obtain an integrity total score, and taking the integrity total score as the integrity evaluation result.
In one embodiment, the computer program when executed by the processor further performs the steps of:
acquiring accuracy scores corresponding to all field values in the data row, wherein the accuracy scores are used for representing the closeness degree between the corresponding field values and preset field values;
and adding the accuracy scores corresponding to the field values in the data row to obtain an accuracy total score, and taking the accuracy total score as the accuracy evaluation result.
In one embodiment, the computer program when executed by the processor further performs the steps of:
acquiring correctness scores corresponding to all field values in the data line, wherein the correctness scores are used for indicating whether the corresponding field values follow a preset grammar rule or not;
and adding the correctness scores corresponding to the field values in the data row to obtain a total correctness score, and taking the total correctness score as the correctness evaluation result.
In one embodiment, the computer program when executed by the processor further performs the steps of:
acquiring uniqueness scores corresponding to all field values in the data row, wherein the uniqueness scores are used for representing whether the corresponding field values are unique in the data row;
and multiplying the uniqueness scores corresponding to the field values in the data row to obtain a total uniqueness score, and taking the total uniqueness score as the uniqueness evaluation result.
In one embodiment, the computer program when executed by the processor further performs the steps of:
acquiring timeliness scores corresponding to all field values in the data row, wherein the timeliness scores are used for representing the duration from generation to use of the corresponding field values;
and adding the timeliness scores corresponding to the field values in the data row to obtain a timeliness total score, and taking the timeliness total score as the timeliness evaluation result.
In one embodiment, the computer program when executed by the processor further performs the steps of:
calculating the weighted average value of the evaluation result of each data line;
and obtaining the evaluation result of the data set according to the calculation result.
In one embodiment, the computer program when executed by the processor further performs the steps of:
performing weighted calculation on the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result;
and obtaining the evaluation result of the data line according to the calculation result.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (12)

1. A method for evaluating data quality, the method comprising:
obtaining a data set, wherein the data set comprises a plurality of data lines, and the data lines comprise a plurality of field values;
for each data row, evaluating the data quality of the data row according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value in the data row to obtain an evaluation result of the data row, wherein the integrity of the field value is used for representing whether the field value is missing, the accuracy of the field value is used for representing the proximity degree between the field value and a preset field value, the correctness of the field value is used for representing the degree of the field value following a preset syntax rule, the uniqueness of the field value is used for representing whether the field value is unique in the data row, and the timeliness of the field value is used for representing the time length from generation to use of the field value;
and evaluating the data quality of the data set according to the evaluation result of each data row to obtain the evaluation result of the data set.
2. The method according to claim 1, wherein the evaluating the data quality of the data row according to the integrity, the accuracy, the correctness, the uniqueness and the timeliness of each field value in the data row to obtain the evaluation result of the data row comprises:
evaluating the integrity of the data row according to the integrity of each field value in the data row to obtain an integrity evaluation result;
evaluating the accuracy of the data row according to the accuracy of each field value in the data row to obtain an accuracy evaluation result;
evaluating the correctness of the data line according to the correctness of each field value in the data line to obtain a correctness evaluation result;
evaluating the uniqueness of the data line according to the uniqueness of each field value in the data line to obtain a uniqueness evaluation result;
evaluating the timeliness of the data row according to the timeliness of each field value in the data row to obtain a timeliness evaluation result;
and obtaining the evaluation result of the data line according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result.
3. The method according to claim 2, wherein the evaluating the integrity of the data line according to the integrity of each field value in the data line to obtain an integrity evaluation result comprises:
acquiring integrity scores corresponding to all field values in the data row, wherein the integrity scores are used for indicating whether the corresponding field values are missing or not;
and adding the integrity scores corresponding to the field values in the data row to obtain an integrity total score, and taking the integrity total score as the integrity evaluation result.
4. The method of claim 2, wherein the evaluating the accuracy of the data row according to the accuracy of each field value in the data row, and obtaining an accuracy evaluation result comprises:
acquiring accuracy scores corresponding to all field values in the data row, wherein the accuracy scores are used for representing the closeness degree between the corresponding field values and preset field values;
and adding the accuracy scores corresponding to the field values in the data row to obtain an accuracy total score, and taking the accuracy total score as the accuracy evaluation result.
5. The method according to claim 2, wherein said evaluating the correctness of the data line according to the correctness of each field value in the data line to obtain a correctness evaluation result comprises:
acquiring correctness scores corresponding to all field values in the data row, wherein the correctness scores are used for indicating whether the corresponding field values follow a preset grammar rule or not;
and adding the correctness scores corresponding to the field values in the data row to obtain a total correctness score, and taking the total correctness score as the correctness evaluation result.
6. The method according to claim 2, wherein the evaluating uniqueness of the data line according to uniqueness of each field value in the data line to obtain a uniqueness evaluation result comprises:
acquiring uniqueness scores corresponding to all field values in the data rows, wherein the uniqueness scores are used for representing whether the corresponding field values are unique in the data rows;
and multiplying the uniqueness scores corresponding to the field values in the data row to obtain a total uniqueness score, and taking the total uniqueness score as the uniqueness evaluation result.
7. The method according to claim 2, wherein the evaluating the timeliness of the data row according to the timeliness of the respective field values in the data row, and obtaining a timeliness evaluation result, comprises:
acquiring timeliness scores corresponding to all field values in the data row, wherein the timeliness scores are used for representing the duration from generation to use of the corresponding field values;
and adding the timeliness scores corresponding to the field values in the data row to obtain a total timeliness score, and taking the total timeliness score as the timeliness evaluation result.
8. The method of claim 1, wherein the evaluation result of each of the data lines is a numerical value, and the evaluating the data quality of the data set according to the evaluation result of each of the data lines to obtain the evaluation result of the data set comprises:
calculating a weighted average value of the evaluation result of each data line;
and obtaining the evaluation result of the data set according to the calculation result.
9. The method according to claim 2, wherein the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result, and the timeliness evaluation result are respectively a numerical value, and the obtaining the evaluation result of the data row according to the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result, and the timeliness evaluation result comprises:
performing weighted calculation on the integrity evaluation result, the accuracy evaluation result, the correctness evaluation result, the uniqueness evaluation result and the timeliness evaluation result;
and obtaining the evaluation result of the data line according to the calculation result.
10. An apparatus for evaluating data quality, the apparatus comprising:
a first obtaining module, configured to obtain a data set, where the data set includes a plurality of data lines, and the data lines include a plurality of field values;
the first evaluation module is used for evaluating the data quality of the data rows according to the integrity, the accuracy, the uniqueness and the timeliness of all field values in the data rows to obtain the evaluation result of the data rows, wherein the integrity of the field values is used for representing whether the field values are missing, the accuracy of the field values is used for representing the proximity degree between the field values and preset field values, the accuracy of the field values is used for representing the degree of the field values following a preset syntax rule, the uniqueness of the field values is used for representing whether the field values are unique in the data rows, and the timeliness of the field values is used for representing the time length from generation to use of the field values;
and the second evaluation module is used for evaluating the data quality of the data set according to the evaluation result of each data row to obtain the evaluation result of the data set.
11. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 9 when executing the computer program.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9.
CN201911231211.5A 2019-12-05 2019-12-05 Data quality evaluation method and device, computer equipment and storage medium Pending CN111026742A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911231211.5A CN111026742A (en) 2019-12-05 2019-12-05 Data quality evaluation method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911231211.5A CN111026742A (en) 2019-12-05 2019-12-05 Data quality evaluation method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111026742A true CN111026742A (en) 2020-04-17

Family

ID=70208024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911231211.5A Pending CN111026742A (en) 2019-12-05 2019-12-05 Data quality evaluation method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111026742A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112286899A (en) * 2020-10-30 2021-01-29 南方电网科学研究院有限责任公司 Meter data quality evaluation method, meter reading center terminal, system, equipment and medium
CN112488528A (en) * 2020-12-01 2021-03-12 东莞中国科学院云计算产业技术创新与育成中心 Data set processing method, device, equipment and storage medium
CN113094031A (en) * 2021-03-16 2021-07-09 上海晓途网络科技有限公司 Factor generation method and device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462744A (en) * 2014-10-09 2015-03-25 广东工业大学 Data quality control method suitable for cardiovascular remote monitoring system
US20150134591A1 (en) * 2013-09-24 2015-05-14 Here Global B.V. Method, apparatus, and computer program product for data quality analysis
CN106383984A (en) * 2016-08-30 2017-02-08 南京邮电大学 Big data quality effective evaluation method based on MMTD
CN108345985A (en) * 2018-01-09 2018-07-31 国网瑞盈电力科技(北京)有限公司 A kind of power distribution network Data Quality Assessment Methodology and system
CN108764705A (en) * 2018-05-24 2018-11-06 国信优易数据有限公司 A kind of data quality accessment platform and method
CN109299062A (en) * 2018-07-02 2019-02-01 北京市天元网络技术股份有限公司 A kind of quality evaluating method and system towards document category digital resource metadata

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150134591A1 (en) * 2013-09-24 2015-05-14 Here Global B.V. Method, apparatus, and computer program product for data quality analysis
CN104462744A (en) * 2014-10-09 2015-03-25 广东工业大学 Data quality control method suitable for cardiovascular remote monitoring system
CN106383984A (en) * 2016-08-30 2017-02-08 南京邮电大学 Big data quality effective evaluation method based on MMTD
CN108345985A (en) * 2018-01-09 2018-07-31 国网瑞盈电力科技(北京)有限公司 A kind of power distribution network Data Quality Assessment Methodology and system
CN108764705A (en) * 2018-05-24 2018-11-06 国信优易数据有限公司 A kind of data quality accessment platform and method
CN109299062A (en) * 2018-07-02 2019-02-01 北京市天元网络技术股份有限公司 A kind of quality evaluating method and system towards document category digital resource metadata

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周强: "PDCA循环理论在外汇数据质量管理上的应用研究与实践", 微型电脑应用, vol. 33, no. 01, pages 62 - 66 *
田仲 等: "通用数据质量评分系统的研究与设计", 标准科学, no. 5, pages 95 - 99 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112286899A (en) * 2020-10-30 2021-01-29 南方电网科学研究院有限责任公司 Meter data quality evaluation method, meter reading center terminal, system, equipment and medium
CN112488528A (en) * 2020-12-01 2021-03-12 东莞中国科学院云计算产业技术创新与育成中心 Data set processing method, device, equipment and storage medium
CN113094031A (en) * 2021-03-16 2021-07-09 上海晓途网络科技有限公司 Factor generation method and device, computer equipment and storage medium
CN113094031B (en) * 2021-03-16 2024-02-20 上海晓途网络科技有限公司 Factor generation method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111026742A (en) Data quality evaluation method and device, computer equipment and storage medium
JP7167306B2 (en) Neural network model training method, apparatus, computer equipment and storage medium
US20190057224A1 (en) Differentially private linear queries on histograms
CN109710677B (en) Experiment data processing method and device, computer equipment and storage medium
CN110796161A (en) Recognition model training method, recognition device, recognition equipment and recognition medium for eye ground characteristics
WO2021036449A1 (en) Dimension data processing method and apparatus, computer device, and storage medium
CN111143462A (en) Data export method and device, computer equipment and storage medium
CN109284289B (en) Data set processing method and device, computer equipment and storage medium
CN109192258B (en) Medical data conversion method, medical data conversion device, computer equipment and storage medium
CN110910864A (en) Training sample selection method and device, computer equipment and storage medium
CN109377388B (en) Medical insurance application method, medical insurance application device, computer equipment and storage medium
CN109408708A (en) Method, apparatus, computer equipment and the storage medium that user recommends
CN111783830A (en) Retina classification method and device based on OCT, computer equipment and storage medium
CN109542962B (en) Data processing method, data processing device, computer equipment and storage medium
CN110503296B (en) Test method, test device, computer equipment and storage medium
CN110688400A (en) Data processing method, data processing device, computer equipment and storage medium
CN112711739A (en) Data processing method and device, server and storage medium
CN114297585B (en) Method and device for ordering important nodes in social network and computer equipment
JP2020095518A (en) Information processing device, information processing method, and program
CN112766067A (en) Method and system for acquiring 3D face recognition module calibration data, computer and storage medium
CN112488528A (en) Data set processing method, device, equipment and storage medium
CN113692014A (en) APP flow analysis method and device, computer equipment and storage medium
CN113610558A (en) Resource distribution method and device, electronic equipment and storage medium
CN112783866A (en) Data reading method and device, computer equipment and storage medium
CN110704437A (en) Method, device, equipment and storage medium for modifying database query statement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination