CN106056287A

CN106056287A - Equipment and method for carrying out data quality evaluation on data set based on context

Info

Publication number: CN106056287A
Application number: CN201610388931.2A
Authority: CN
Inventors: 阮彤; 申翔宇; 叶琪; 李阳; 赵亮
Original assignee: Shanghai Data Trading Center Ltd; East China University of Science and Technology
Current assignee: Shanghai Data Trading Center Ltd; East China University of Science and Technology
Priority date: 2016-06-03
Filing date: 2016-06-03
Publication date: 2016-10-26

Abstract

The invention provides equipment and a method for carrying out data quality evaluation on a data set based on a context. The method comprises the steps of: acquiring the data set to be evaluated and a domain context corresponding to the data set; selecting an evaluation metric used for evaluating data quality according to the data set and the domain context; sampling the data set, and determining data subsets to be evaluated; calculating evaluation results obtained based on the evaluation metric according to the data subsets to be evaluated, the evaluation metric and the domain context; and aggregating and ranking the evaluation results to obtain an evaluation result of the data set. Compared with the prior art, the equipment and the method are used for evaluating the data subsets obtained through sampling the data set according to the obtained domain context and the evaluation metric selected by a user, fully reflect the user requirements, can be used for evaluating the data set objectively and comprehensively, and obtain intuitive and comparable evaluation result.

Description

Device and method for evaluating data quality of data set based on context

Technical Field

The present invention relates to data quality assessment technologies, and in particular, to an apparatus and method for performing data quality assessment on a data set based on context.

Background

With the maturity and development of big data technology, big data is more and more widely applied to businesses, and interaction, integration, exchange and even transaction related to big data are increasing. Although current big data storage and mining technologies have matured gradually, the existence of a large number of "data islands" restricts the circulation and the emergence of data. Only by evaluating the quality of the data, reasonably pricing the data and realizing big data transaction, the industry information barrier can be broken, the production efficiency can be optimized and improved, and the industrial innovation can be deeply promoted.

In the field of data transactions, data is bought and sold as commodities. Data is a logical entity, and has abstraction, and the function, performance and other characteristics of the data can be understood through operation, observation, analysis, thinking and judgment. In addition, the data has obvious non-visual characteristics. Therefore, the most important index for evaluating data in the field of data transaction is the evaluation of data quality. Existing studies for data quality assessment are generally divided into three categories: (1) for data in a specific field or a specific source, quality evaluation aims at a certain enterprise or a certain organization, for example, a fire risk data evaluation method for an electric vehicle charging and battery replacing station, with the Chinese patent application number of 201310714474.8, and the invention name of the method discloses technical content for carrying out data quality evaluation on data in the specific source of the electric vehicle charging and battery replacing station; (2) research aiming at specific problems in the general field is focused on finding some new metric, such as a metric related to data complexity, or an automatic calculation method focusing on some metric, such as error rate and the like; (3) research oriented towards a general data quality framework, such as: data quality standard of ISO 8000. The existing research can not solve the problem of evaluating the data quality widely in the field due to the complex data source oriented to a big data transaction platform.

In addition, the degree of correlation between the data quality evaluation and the application scenario is high, and the data quality evaluation is separated from the quality evaluation of the application scenario and cannot meet the requirements of future data buyers of a transaction platform. However, the quality assessment completely depending on specific requirements and user preferences has strong subjectivity, and the objectivity of the quality is lost. From a quality definition perspective, ISO8000 refers to ISO9000: 2005' definition "how well to meet the required internal feature set (Degree to which a set of the intrinsic characteristics fuels)". The academic community also generally recognizes the notion that "high-quality data should be data that can sufficiently meet the user's usage requirements". In the prior art, in a paper "enterprise-oriented informatization data quality assessment research" published in "computer technology and development" 2011, No. 1, a service framework for data quality assessment is designed by introducing a reusable service idea of an SOA context, Services such as input and output, process management, automatic assessment and the like are explained based on the framework, and all functional requirements are realized by using a form of Web Services service components. In addition, in a paper "Quality evaluation of Web document content data extracted based on facts" in "computer science" 2014 11, a Fact-based Quality evaluation method (FQA) is proposed, which constructs a target document context on the Web and extracts the facts of the Web document content; then, establishing references of accuracy and integrity dimensions by adopting voting and graph iteration strategies respectively; finally, the facts of the target document and the dimensional references are compared, and accuracy and completeness are quantified. However, the existing data quality evaluation techniques described above still need to be further improved and improved.

Disclosure of Invention

In view of the above-mentioned drawbacks of the data quality assessment apparatus in the prior art, the present invention provides an apparatus and method for performing data quality assessment on a data set based on context.

According to one aspect of the present invention, there is provided a computer-implemented method for data quality assessment of a data set based on context, comprising the steps of:

acquiring a data set to be evaluated and a domain context corresponding to the data set;

selecting an evaluation metric for evaluating data quality based on the data set and the domain context;

sampling the data set and determining a data subset to be evaluated;

calculating an evaluation result obtained based on the evaluation metric according to the to-be-evaluated data subset, the evaluation metric and the domain context; and

and aggregating and sorting the evaluation results to obtain the evaluation results of the data set.

In one embodiment, between the step of sampling the data set and the step of calculating the evaluation result, the method further comprises the steps of: and performing mode alignment on the data subset to be evaluated and the field context by adopting a mode alignment library.

In one embodiment, the data set is sampled using hierarchical sampling, systematic sampling, or random sampling to determine the subset of data to be evaluated.

In one embodiment, the evaluation result is calculated according to the subset of data to be evaluated, the evaluation metric and the domain context by at least one of:

-directly calculating from the definition of the evaluation metric;

-automatically detecting according to a metric formula of said evaluation metric;

-manual evaluation.

In one embodiment, automatically detecting according to the metric formula of the evaluation metric comprises: defining field constraints or templates of constraints among fields of the data subset to be evaluated; instantiating the defined template according to the specific data of the data subset to be evaluated to generate a test case for querying the data subset to be evaluated; executing the test case to obtain a query result, and returning error data by the query result; and calculating the evaluation result according to the error data and the measurement formula of the evaluation measurement.

In one embodiment, the calculating the evaluation result in a manual evaluation manner includes: according to the data subset to be evaluated and the evaluation metric, randomly distributing evaluation tasks to N evaluators, wherein N is an odd number which is greater than or equal to 3; setting an evaluation period according to the size of the data subset to be evaluated, and acquiring respective evaluation results of the evaluators in the evaluation period; correcting deviations in the evaluation results according to the respective evaluation results to obtain corrected evaluation results; and calculating an average value from the corrected evaluation result to obtain an evaluation result based on the evaluation metric.

In accordance with another aspect of the present invention, there is provided an apparatus for data quality assessment of a data set based on context, comprising:

the display module is used for acquiring a data set to be evaluated and a field context corresponding to the data set;

a selection module to select an evaluation metric for evaluating data quality based on the data set and the domain context;

the sampling module is used for sampling the data set and determining a data subset to be evaluated;

the calculation module is used for calculating an evaluation result obtained based on the evaluation metric according to the to-be-evaluated data subset, the evaluation metric and the domain context; and

and the aggregation sequencing module is used for aggregating and sequencing the evaluation results to obtain the evaluation results of the data set.

In an embodiment of the foregoing, the device further includes a pattern alignment module, configured to perform pattern alignment on the to-be-evaluated data subset and the domain context according to a pattern alignment library, so as to obtain an aligned to-be-evaluated data subset.

In one embodiment, the calculation module calculates the evaluation result by at least one of the following methods:

-directly calculating from the definition of the evaluation metric;

-manual evaluation.

In one embodiment, the domain context includes a context name, a reference schema, a reference dataset, a dictionary dataset, a use-case set, and a metrics aggregation library.

Compared with the prior art, the context-based data quality evaluation device and method provided by the invention can evaluate the data subset obtained by sampling the data set according to the obtained domain context and the evaluation metric selected by the user, so that the user requirements are fully reflected, the data set can be evaluated comprehensively and objectively, and an intuitive and comparable evaluation result is obtained.

Other features and aspects of the present invention will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The various aspects of the present invention will become more apparent to the reader after reading the detailed description of the invention with reference to the attached drawings. Wherein,

FIG. 1 illustrates a block flow diagram of a method for data quality assessment of a data set based on context, in accordance with an embodiment of the present invention;

FIG. 2 illustrates a preferred embodiment of the data quality assessment method of FIG. 1;

FIG. 3A shows a first embodiment of calculating an evaluation result from a subset of data to be evaluated, an evaluation metric, and a domain context in the data quality evaluation method of FIG. 1;

FIG. 3B illustrates a second embodiment of computing an evaluation result based on a subset of data to be evaluated, an evaluation metric, and a domain context in the data quality evaluation method of FIG. 1; and

FIG. 4 shows a block diagram of an apparatus for data quality assessment of a data set based on context, in accordance with another embodiment of the present invention.

Detailed Description

In order to make the present disclosure more complete and complete, reference is made to the accompanying drawings, in which like references indicate similar or analogous elements, and to the various embodiments of the invention described below. However, it will be understood by those of ordinary skill in the art that the examples provided below are not intended to limit the scope of the present invention. In addition, the drawings are only for illustrative purposes and are not drawn to scale.

Specific embodiments of various aspects of the present invention are described in further detail below with reference to the accompanying drawings.

FIG. 1 illustrates a block flow diagram of a method for data quality assessment of a data set based on context, in accordance with an embodiment of the present invention.

Referring to fig. 1, in this embodiment, the method for evaluating data quality of a data set based on context according to the present invention is implemented through steps S110 to S150.

In detail, steps S110 and S120 are first performed, a data set to be evaluated and a domain context corresponding to the data set are obtained, and then an evaluation metric for evaluating data quality is selected according to the data set and the domain context. For example, when obtaining a corresponding domain context/application context, if the system lacks a context that is consistent with the evaluator's requirements, then the context is customized; if the system is substantially consistent with the evaluator requirements, customization is performed based on the user requirements based on the existing context. An evaluation metric is then selected based on the domain context and the dataset to be evaluated.

In the embodiment of the present invention, the data set refers to a data set to be evaluated, including but not limited to a relational database. For example, the data set to be evaluated may also be a knowledge base or the like. The evaluation metric is a metric indicative of the quality of data that the user intends to evaluate the data set. The evaluation metrics include specific metric indicators for each quality dimension. For example, an assessment dimension can be richness, accuracy, completeness, consistency, timeliness, availability, data service access performance, queryability, informativeness, and the like. Furthermore, the richness can be divided into data size, mode size or class level depth and other sub-dimensions, and the measurement indexes of the data size sub-dimensions include the number of tables, the number of instances, the number of main entity records, the number of facts and the like. The domain context includes a context name, a reference schema, a reference dataset, a dictionary dataset, a set of use cases, and a metrics aggregation library. The name of the context describes the domain to which the context belongs; the reference pattern comprises a standard pattern of data in the field, illustrating which fields the data should include and which constraints these fields have; the reference data set comprises a sample data set in the field and is a group of sample sets with better quality; the data dictionary comprises a standard dictionary library of the field; the use case set comprises test cases used for calculating the use quality; the metric aggregation library includes weights for the metrics, illustrating how important the metrics are, i.e., the weight values between the metrics.

Next, step S130 is performed to sample the data set to determine a subset of data to be evaluated. That is, a data subset suitable for evaluation is constructed by sampling a large data set by using a data sampling method, and then metric calculation is performed on the data subset. Preferably, the data set is sampled using a hierarchical sampling method, a systematic sampling method, or a random sampling method. The hierarchical sampling method divides a data set into a plurality of layers according to certain characteristics, determines the total data capacity of each layer, extracts a certain amount of observation data from each layer, and combines the observation data extracted from each layer to form a sample. The systematic sampling method is to divide the data into n parts (n is the total data amount/sample amount), then randomly extract the observation unit number k from the first part, and mechanically extract one observation unit from each part to form a sample by using equal intervals in turn. The random sampling method is to adopt a non-return extraction method to randomly extract a required number of observation units from the population to form a sample according to a random principle.

Then, step S140 is executed to calculate an evaluation result based on the evaluation metric according to the data subset to be evaluated, the evaluation metric and the domain context. In different embodiments of the present invention, the evaluation result obtained by calculation may be directly calculated according to the definition of the evaluation metric; or automatically detecting according to a measurement formula of the evaluation measurement; or a manual evaluation method. Hereinafter, a detailed description will be made in conjunction with fig. 3A and 3B.

Finally, step S150 is executed to obtain the evaluation result of the data set according to the aggregation and the ordering of the evaluation results of each evaluation metric. For example, after all selected evaluation metrics have been calculated, the data subset has a percentile score on each evaluation metric, all scores are then aggregated into a final data quality score, and the data set is sorted by score. Preferably, the aggregate ranking includes three ways, one of which is according to scoring criteria set by domain experts in the domain context; secondly, the data evaluator sets the weight of each evaluation metric; and thirdly, learning the weight based on the importance degree of each dimension in the context by using a machine learning method.

FIG. 2 illustrates a preferred embodiment of the data quality assessment method of FIG. 1. Comparing fig. 2 with fig. 1, in this embodiment, the main difference is that a step S160 is added between step S130 and step S140, and a pattern alignment library is used to perform pattern alignment on the to-be-evaluated data subset and the domain context. Namely, the mode aligns the data subset to be evaluated and the domain context, searches the data subset to be evaluated and the domain context according to the field mapping relation in the mode alignment library, and sets the field with the mapping relation as the same field.

Here, the schema alignment library is constructed as follows: and constructing a synonym library which comprises a Chinese synonym library, an English synonym library and a Chinese and English comparison library. When the data mode provided by the data supplier contains pinyin or pinyin initial letters, the data supplier is required to provide corresponding Chinese full names and add the Chinese full names into a synonym library; using a synonym library to represent the fields in the two data set modes by using a unified Chinese language, and then calculating the character similarity of the fields in the two modes and the similarity of the constraints corresponding to the fields (such as the similarity of value ranges, the similarity of data types and the like); according to the calculated similarity, finding out a field pair with high similarity between the fields of the two modes, and constructing mapping between the fields of the two modes; and (4) auditing and supplementing the constructed mapping by a domain expert, and eliminating wrong mapping relation to obtain a mode alignment library.

Therefore, the data subsets to be evaluated are aligned with the domain context according to the established pattern alignment library, so that more accurate data sets and domain contexts can be obtained, and the data corresponding to the fields with the same meaning and different names can be effectively and accurately evaluated.

referring to fig. 3A, when the error data is automatically detected according to the metric formula of the evaluation metric, data that does not meet the metric requirement in the to-be-evaluated data subset is counted, and the evaluation result of the to-be-evaluated data subset is calculated. The automatic detection method is realized through steps S210 to S240, and mainly comprises template definition, template instantiation, query execution to obtain error data and calculation evaluation results. In particular, the amount of the solvent to be used,

in step S210, a template is defined, i.e. a template defining field constraints or constraints between fields of the data subset to be evaluated, such as a value domain template, a comparison template, a regular template, etc. Wherein, the value range template indicates that the value of a certain field should be within a certain range, such as the gender of a person is male or female; comparing templates to indicate the relationship between the value of one field and the value of another field in a record, such as the death date of a person is later than the birth date of the person);

in step S220, a template is instantiated, where the defined template is instantiated according to specific data in the to-be-evaluated data subset, and a test case (SQL query) capable of being queried on the to-be-evaluated data subset is generated. Here, there are various ways to instantiate a template to generate a test case, including: automatically generating a test case by using a mode in a data set; or, selecting a corresponding template by a domain expert, and then instantiating the template according to the knowledge of the data set to generate a test case; or writing a template by a field expert, and then instantiating the template to obtain a test case;

in step S230, the query is executed to obtain error data — the test case is executed according to the data subset to be evaluated to obtain a query result, and the query result returns the error data. For example, there are two different outcomes to execute each test case: if the returned result shows that the error data exists, the error data is obtained;

in step S240, an evaluation result is calculated — an evaluation result based on the evaluation metric is calculated according to the error data and the metric formula of the evaluation metric.

Fig. 3B shows a second embodiment of calculating an evaluation result according to the data subset to be evaluated, the evaluation metric, and the domain context in the data quality evaluation method of fig. 1.

Referring to fig. 3B, the manual evaluation process is to perform a manual evaluation process according to the evaluation metric and the data subset to be evaluated, and calculate an evaluation result based on the evaluation metric according to the evaluation result of the manual evaluation. The manual evaluation mode can be realized through the steps S310 to S340, and mainly comprises task allocation, acquisition of a manual evaluation result, correction of an evaluation result deviation and calculation of an evaluation result. In particular, the amount of the solvent to be used,

in step S310, task allocation, namely, the evaluation tasks are randomly allocated to a plurality of evaluators according to the data subsets to be evaluated and the evaluation metrics. For example, the number of evaluators is greater than or equal to 3, and the number of evaluators is odd;

in step S320, manual evaluation results are obtained — an evaluation period, such as a limited time period of 4 hours, 8 hours, 24 hours, or 48 hours, is set according to the size of the data subset to be evaluated, and the respective evaluation results of the evaluators are obtained within this time period. If the time range is exceeded, the evaluation task of the evaluator is cancelled;

in step S330, the evaluation result deviation is corrected, that is, the deviation in the evaluation result is corrected according to the evaluation result, resulting in a corrected evaluation result. If the evaluation results are inconsistent, that is, the deviation of each evaluation result is greater than or equal to 0.15, returning to step S310 to re-distribute the tasks;

in step S340, an evaluation result is calculated — from the corrected evaluation result, an average value is calculated to obtain an evaluation result based on the evaluation metric.

Those skilled in the art will appreciate that in some embodiments, other ways of calculating the evaluation result than that shown in fig. 3A and 3B may be used, such as a direct calculation method, in which the evaluation result of the metric is obtained by direct calculation according to the definition of the evaluation metric and the data subset to be evaluated. In the process of calculating the metrics, some metrics can be directly calculated, for example, the number of tables and the number of entities can be directly counted by a computer. In addition, a direct calculation mode, an automatic detection mode and a manual evaluation mode can be comprehensively used.

Referring to fig. 4, in this embodiment, an apparatus for data quality assessment of a data set based on context includes a presentation module, a selection module, a sampling module, a calculation module, and an aggregation ranking module.

The display module and the selection module can be independently arranged or integrated in the same functional module, and are used for acquiring a data set to be evaluated and a domain context corresponding to the data set, and then selecting an evaluation metric for evaluating the data quality according to the data set and the domain context. In addition, as shown in fig. 4, the presentation module may also provide an interface for inputting the data set to be evaluated, the evaluation metric, and the user configuration parameter, and display the received evaluation result and analysis chart, and provide an input and display interface of the domain context. Here, the domain context includes a context name, a reference pattern, a reference data set, a data dictionary, a metric aggregation library, and an example library.

And the sampling module performs sampling processing on the data set to be evaluated obtained by the display module so as to obtain a data subset to be evaluated. And the calculation module is connected with the sampling module and used for calculating an evaluation result obtained based on the evaluation metric according to the data subset to be evaluated, the evaluation metric and the field context. And the aggregation sequencing module is used for aggregating and sequencing the evaluation results to obtain the evaluation results of the data set.

Hereinbefore, specific embodiments of the present invention are described with reference to the drawings. However, those skilled in the art will appreciate that various modifications and substitutions can be made to the specific embodiments of the present invention without departing from the spirit and scope of the invention. Such modifications and substitutions are intended to be included within the scope of the present invention as defined by the appended claims.

Claims

1. A computer-implemented method for data quality assessment of a data set based on context, the method comprising the steps of:

sampling the data set and determining a data subset to be evaluated;

2. The method of claim 1, wherein between the step of sampling the data set and the step of calculating an assessment result, the method further comprises the steps of:

and performing mode alignment on the data subset to be evaluated and the field context by adopting a mode alignment library.

3. The method of claim 1, wherein the data set is sampled using hierarchical sampling, systematic sampling, or random sampling to determine the subset of data to be evaluated.

4. The method of claim 1, wherein computing the evaluation result from the subset of data to be evaluated, the evaluation metric, and the domain context is performed in at least one of:

-directly calculating from the definition of the evaluation metric;

-manual evaluation.

5. The method of claim 4, wherein automatically detecting according to the metric formula of the evaluation metric comprises:

defining field constraints or templates of constraints among fields of the data subset to be evaluated;

instantiating the defined template according to the specific data of the data subset to be evaluated to generate a test case for querying the data subset to be evaluated;

executing the test case to obtain a query result, and returning error data by the query result; and

and calculating the evaluation result according to the error data and the measurement formula of the evaluation measurement.

6. The method of claim 4, wherein computing the evaluation result using a manual evaluation comprises:

according to the data subset to be evaluated and the evaluation metric, randomly distributing evaluation tasks to N evaluators, wherein N is an odd number which is greater than or equal to 3;

setting an evaluation period according to the size of the data subset to be evaluated, and acquiring respective evaluation results of the evaluators in the evaluation period;

correcting deviations in the evaluation results according to the respective evaluation results to obtain corrected evaluation results; and

calculating an average value from the corrected evaluation result to obtain an evaluation result based on the evaluation metric.

7. An apparatus for data quality assessment of a data set based on context, the apparatus comprising:

8. The device of claim 7, further comprising a pattern alignment module configured to perform pattern alignment on the subset of data to be evaluated and the domain context according to a pattern alignment library to obtain an aligned subset of data to be evaluated.

9. The apparatus of claim 7 or 8, wherein the calculation module calculates the evaluation result in at least one of:

-directly calculating from the definition of the evaluation metric;

-manual evaluation.

10. The apparatus of claim 7, wherein the domain context comprises a context name, a reference schema, a reference dataset, a dictionary dataset, a use-case set, and a metrics aggregation library.