CN106056287A - Equipment and method for carrying out data quality evaluation on data set based on context - Google Patents

Equipment and method for carrying out data quality evaluation on data set based on context Download PDF

Info

Publication number
CN106056287A
CN106056287A CN201610388931.2A CN201610388931A CN106056287A CN 106056287 A CN106056287 A CN 106056287A CN 201610388931 A CN201610388931 A CN 201610388931A CN 106056287 A CN106056287 A CN 106056287A
Authority
CN
China
Prior art keywords
evaluation
data
evaluated
data set
metric
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610388931.2A
Other languages
Chinese (zh)
Inventor
阮彤
申翔宇
叶琪
李阳
赵亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Data Trading Center Ltd
East China University of Science and Technology
Original Assignee
Shanghai Data Trading Center Ltd
East China University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Data Trading Center Ltd, East China University of Science and Technology filed Critical Shanghai Data Trading Center Ltd
Priority to CN201610388931.2A priority Critical patent/CN106056287A/en
Publication of CN106056287A publication Critical patent/CN106056287A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides equipment and a method for carrying out data quality evaluation on a data set based on a context. The method comprises the steps of: acquiring the data set to be evaluated and a domain context corresponding to the data set; selecting an evaluation metric used for evaluating data quality according to the data set and the domain context; sampling the data set, and determining data subsets to be evaluated; calculating evaluation results obtained based on the evaluation metric according to the data subsets to be evaluated, the evaluation metric and the domain context; and aggregating and ranking the evaluation results to obtain an evaluation result of the data set. Compared with the prior art, the equipment and the method are used for evaluating the data subsets obtained through sampling the data set according to the obtained domain context and the evaluation metric selected by a user, fully reflect the user requirements, can be used for evaluating the data set objectively and comprehensively, and obtain intuitive and comparable evaluation result.

Description

Device and method for evaluating data quality of data set based on context
Technical Field
The present invention relates to data quality assessment technologies, and in particular, to an apparatus and method for performing data quality assessment on a data set based on context.
Background
With the maturity and development of big data technology, big data is more and more widely applied to businesses, and interaction, integration, exchange and even transaction related to big data are increasing. Although current big data storage and mining technologies have matured gradually, the existence of a large number of "data islands" restricts the circulation and the emergence of data. Only by evaluating the quality of the data, reasonably pricing the data and realizing big data transaction, the industry information barrier can be broken, the production efficiency can be optimized and improved, and the industrial innovation can be deeply promoted.
In the field of data transactions, data is bought and sold as commodities. Data is a logical entity, and has abstraction, and the function, performance and other characteristics of the data can be understood through operation, observation, analysis, thinking and judgment. In addition, the data has obvious non-visual characteristics. Therefore, the most important index for evaluating data in the field of data transaction is the evaluation of data quality. Existing studies for data quality assessment are generally divided into three categories: (1) for data in a specific field or a specific source, quality evaluation aims at a certain enterprise or a certain organization, for example, a fire risk data evaluation method for an electric vehicle charging and battery replacing station, with the Chinese patent application number of 201310714474.8, and the invention name of the method discloses technical content for carrying out data quality evaluation on data in the specific source of the electric vehicle charging and battery replacing station; (2) research aiming at specific problems in the general field is focused on finding some new metric, such as a metric related to data complexity, or an automatic calculation method focusing on some metric, such as error rate and the like; (3) research oriented towards a general data quality framework, such as: data quality standard of ISO 8000. The existing research can not solve the problem of evaluating the data quality widely in the field due to the complex data source oriented to a big data transaction platform.
In addition, the degree of correlation between the data quality evaluation and the application scenario is high, and the data quality evaluation is separated from the quality evaluation of the application scenario and cannot meet the requirements of future data buyers of a transaction platform. However, the quality assessment completely depending on specific requirements and user preferences has strong subjectivity, and the objectivity of the quality is lost. From a quality definition perspective, ISO8000 refers to ISO9000: 2005' definition "how well to meet the required internal feature set (Degree to which a set of the intrinsic characteristics fuels)". The academic community also generally recognizes the notion that "high-quality data should be data that can sufficiently meet the user's usage requirements". In the prior art, in a paper "enterprise-oriented informatization data quality assessment research" published in "computer technology and development" 2011, No. 1, a service framework for data quality assessment is designed by introducing a reusable service idea of an SOA context, Services such as input and output, process management, automatic assessment and the like are explained based on the framework, and all functional requirements are realized by using a form of Web Services service components. In addition, in a paper "Quality evaluation of Web document content data extracted based on facts" in "computer science" 2014 11, a Fact-based Quality evaluation method (FQA) is proposed, which constructs a target document context on the Web and extracts the facts of the Web document content; then, establishing references of accuracy and integrity dimensions by adopting voting and graph iteration strategies respectively; finally, the facts of the target document and the dimensional references are compared, and accuracy and completeness are quantified. However, the existing data quality evaluation techniques described above still need to be further improved and improved.
Disclosure of Invention
In view of the above-mentioned drawbacks of the data quality assessment apparatus in the prior art, the present invention provides an apparatus and method for performing data quality assessment on a data set based on context.
According to one aspect of the present invention, there is provided a computer-implemented method for data quality assessment of a data set based on context, comprising the steps of:
acquiring a data set to be evaluated and a domain context corresponding to the data set;
selecting an evaluation metric for evaluating data quality based on the data set and the domain context;
sampling the data set and determining a data subset to be evaluated;
calculating an evaluation result obtained based on the evaluation metric according to the to-be-evaluated data subset, the evaluation metric and the domain context; and
and aggregating and sorting the evaluation results to obtain the evaluation results of the data set.
In one embodiment, between the step of sampling the data set and the step of calculating the evaluation result, the method further comprises the steps of: and performing mode alignment on the data subset to be evaluated and the field context by adopting a mode alignment library.
In one embodiment, the data set is sampled using hierarchical sampling, systematic sampling, or random sampling to determine the subset of data to be evaluated.
In one embodiment, the evaluation result is calculated according to the subset of data to be evaluated, the evaluation metric and the domain context by at least one of:
-directly calculating from the definition of the evaluation metric;
-automatically detecting according to a metric formula of said evaluation metric;
-manual evaluation.
In one embodiment, automatically detecting according to the metric formula of the evaluation metric comprises: defining field constraints or templates of constraints among fields of the data subset to be evaluated; instantiating the defined template according to the specific data of the data subset to be evaluated to generate a test case for querying the data subset to be evaluated; executing the test case to obtain a query result, and returning error data by the query result; and calculating the evaluation result according to the error data and the measurement formula of the evaluation measurement.
In one embodiment, the calculating the evaluation result in a manual evaluation manner includes: according to the data subset to be evaluated and the evaluation metric, randomly distributing evaluation tasks to N evaluators, wherein N is an odd number which is greater than or equal to 3; setting an evaluation period according to the size of the data subset to be evaluated, and acquiring respective evaluation results of the evaluators in the evaluation period; correcting deviations in the evaluation results according to the respective evaluation results to obtain corrected evaluation results; and calculating an average value from the corrected evaluation result to obtain an evaluation result based on the evaluation metric.
In accordance with another aspect of the present invention, there is provided an apparatus for data quality assessment of a data set based on context, comprising:
the display module is used for acquiring a data set to be evaluated and a field context corresponding to the data set;
a selection module to select an evaluation metric for evaluating data quality based on the data set and the domain context;
the sampling module is used for sampling the data set and determining a data subset to be evaluated;
the calculation module is used for calculating an evaluation result obtained based on the evaluation metric according to the to-be-evaluated data subset, the evaluation metric and the domain context; and
and the aggregation sequencing module is used for aggregating and sequencing the evaluation results to obtain the evaluation results of the data set.
In an embodiment of the foregoing, the device further includes a pattern alignment module, configured to perform pattern alignment on the to-be-evaluated data subset and the domain context according to a pattern alignment library, so as to obtain an aligned to-be-evaluated data subset.
In one embodiment, the calculation module calculates the evaluation result by at least one of the following methods:
-directly calculating from the definition of the evaluation metric;
-automatically detecting according to a metric formula of said evaluation metric;
-manual evaluation.
In one embodiment, the domain context includes a context name, a reference schema, a reference dataset, a dictionary dataset, a use-case set, and a metrics aggregation library.
Compared with the prior art, the context-based data quality evaluation device and method provided by the invention can evaluate the data subset obtained by sampling the data set according to the obtained domain context and the evaluation metric selected by the user, so that the user requirements are fully reflected, the data set can be evaluated comprehensively and objectively, and an intuitive and comparable evaluation result is obtained.
Other features and aspects of the present invention will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The various aspects of the present invention will become more apparent to the reader after reading the detailed description of the invention with reference to the attached drawings. Wherein,
FIG. 1 illustrates a block flow diagram of a method for data quality assessment of a data set based on context, in accordance with an embodiment of the present invention;
FIG. 2 illustrates a preferred embodiment of the data quality assessment method of FIG. 1;
FIG. 3A shows a first embodiment of calculating an evaluation result from a subset of data to be evaluated, an evaluation metric, and a domain context in the data quality evaluation method of FIG. 1;
FIG. 3B illustrates a second embodiment of computing an evaluation result based on a subset of data to be evaluated, an evaluation metric, and a domain context in the data quality evaluation method of FIG. 1; and
FIG. 4 shows a block diagram of an apparatus for data quality assessment of a data set based on context, in accordance with another embodiment of the present invention.
Detailed Description
In order to make the present disclosure more complete and complete, reference is made to the accompanying drawings, in which like references indicate similar or analogous elements, and to the various embodiments of the invention described below. However, it will be understood by those of ordinary skill in the art that the examples provided below are not intended to limit the scope of the present invention. In addition, the drawings are only for illustrative purposes and are not drawn to scale.
Specific embodiments of various aspects of the present invention are described in further detail below with reference to the accompanying drawings.
FIG. 1 illustrates a block flow diagram of a method for data quality assessment of a data set based on context, in accordance with an embodiment of the present invention.
Referring to fig. 1, in this embodiment, the method for evaluating data quality of a data set based on context according to the present invention is implemented through steps S110 to S150.
In detail, steps S110 and S120 are first performed, a data set to be evaluated and a domain context corresponding to the data set are obtained, and then an evaluation metric for evaluating data quality is selected according to the data set and the domain context. For example, when obtaining a corresponding domain context/application context, if the system lacks a context that is consistent with the evaluator's requirements, then the context is customized; if the system is substantially consistent with the evaluator requirements, customization is performed based on the user requirements based on the existing context. An evaluation metric is then selected based on the domain context and the dataset to be evaluated.
In the embodiment of the present invention, the data set refers to a data set to be evaluated, including but not limited to a relational database. For example, the data set to be evaluated may also be a knowledge base or the like. The evaluation metric is a metric indicative of the quality of data that the user intends to evaluate the data set. The evaluation metrics include specific metric indicators for each quality dimension. For example, an assessment dimension can be richness, accuracy, completeness, consistency, timeliness, availability, data service access performance, queryability, informativeness, and the like. Furthermore, the richness can be divided into data size, mode size or class level depth and other sub-dimensions, and the measurement indexes of the data size sub-dimensions include the number of tables, the number of instances, the number of main entity records, the number of facts and the like. The domain context includes a context name, a reference schema, a reference dataset, a dictionary dataset, a set of use cases, and a metrics aggregation library. The name of the context describes the domain to which the context belongs; the reference pattern comprises a standard pattern of data in the field, illustrating which fields the data should include and which constraints these fields have; the reference data set comprises a sample data set in the field and is a group of sample sets with better quality; the data dictionary comprises a standard dictionary library of the field; the use case set comprises test cases used for calculating the use quality; the metric aggregation library includes weights for the metrics, illustrating how important the metrics are, i.e., the weight values between the metrics.
Next, step S130 is performed to sample the data set to determine a subset of data to be evaluated. That is, a data subset suitable for evaluation is constructed by sampling a large data set by using a data sampling method, and then metric calculation is performed on the data subset. Preferably, the data set is sampled using a hierarchical sampling method, a systematic sampling method, or a random sampling method. The hierarchical sampling method divides a data set into a plurality of layers according to certain characteristics, determines the total data capacity of each layer, extracts a certain amount of observation data from each layer, and combines the observation data extracted from each layer to form a sample. The systematic sampling method is to divide the data into n parts (n is the total data amount/sample amount), then randomly extract the observation unit number k from the first part, and mechanically extract one observation unit from each part to form a sample by using equal intervals in turn. The random sampling method is to adopt a non-return extraction method to randomly extract a required number of observation units from the population to form a sample according to a random principle.
Then, step S140 is executed to calculate an evaluation result based on the evaluation metric according to the data subset to be evaluated, the evaluation metric and the domain context. In different embodiments of the present invention, the evaluation result obtained by calculation may be directly calculated according to the definition of the evaluation metric; or automatically detecting according to a measurement formula of the evaluation measurement; or a manual evaluation method. Hereinafter, a detailed description will be made in conjunction with fig. 3A and 3B.
Finally, step S150 is executed to obtain the evaluation result of the data set according to the aggregation and the ordering of the evaluation results of each evaluation metric. For example, after all selected evaluation metrics have been calculated, the data subset has a percentile score on each evaluation metric, all scores are then aggregated into a final data quality score, and the data set is sorted by score. Preferably, the aggregate ranking includes three ways, one of which is according to scoring criteria set by domain experts in the domain context; secondly, the data evaluator sets the weight of each evaluation metric; and thirdly, learning the weight based on the importance degree of each dimension in the context by using a machine learning method.
FIG. 2 illustrates a preferred embodiment of the data quality assessment method of FIG. 1. Comparing fig. 2 with fig. 1, in this embodiment, the main difference is that a step S160 is added between step S130 and step S140, and a pattern alignment library is used to perform pattern alignment on the to-be-evaluated data subset and the domain context. Namely, the mode aligns the data subset to be evaluated and the domain context, searches the data subset to be evaluated and the domain context according to the field mapping relation in the mode alignment library, and sets the field with the mapping relation as the same field.
Here, the schema alignment library is constructed as follows: and constructing a synonym library which comprises a Chinese synonym library, an English synonym library and a Chinese and English comparison library. When the data mode provided by the data supplier contains pinyin or pinyin initial letters, the data supplier is required to provide corresponding Chinese full names and add the Chinese full names into a synonym library; using a synonym library to represent the fields in the two data set modes by using a unified Chinese language, and then calculating the character similarity of the fields in the two modes and the similarity of the constraints corresponding to the fields (such as the similarity of value ranges, the similarity of data types and the like); according to the calculated similarity, finding out a field pair with high similarity between the fields of the two modes, and constructing mapping between the fields of the two modes; and (4) auditing and supplementing the constructed mapping by a domain expert, and eliminating wrong mapping relation to obtain a mode alignment library.
Therefore, the data subsets to be evaluated are aligned with the domain context according to the established pattern alignment library, so that more accurate data sets and domain contexts can be obtained, and the data corresponding to the fields with the same meaning and different names can be effectively and accurately evaluated.
FIG. 3A shows a first embodiment of calculating an evaluation result from a subset of data to be evaluated, an evaluation metric, and a domain context in the data quality evaluation method of FIG. 1;
referring to fig. 3A, when the error data is automatically detected according to the metric formula of the evaluation metric, data that does not meet the metric requirement in the to-be-evaluated data subset is counted, and the evaluation result of the to-be-evaluated data subset is calculated. The automatic detection method is realized through steps S210 to S240, and mainly comprises template definition, template instantiation, query execution to obtain error data and calculation evaluation results. In particular, the amount of the solvent to be used,
in step S210, a template is defined, i.e. a template defining field constraints or constraints between fields of the data subset to be evaluated, such as a value domain template, a comparison template, a regular template, etc. Wherein, the value range template indicates that the value of a certain field should be within a certain range, such as the gender of a person is male or female; comparing templates to indicate the relationship between the value of one field and the value of another field in a record, such as the death date of a person is later than the birth date of the person);
in step S220, a template is instantiated, where the defined template is instantiated according to specific data in the to-be-evaluated data subset, and a test case (SQL query) capable of being queried on the to-be-evaluated data subset is generated. Here, there are various ways to instantiate a template to generate a test case, including: automatically generating a test case by using a mode in a data set; or, selecting a corresponding template by a domain expert, and then instantiating the template according to the knowledge of the data set to generate a test case; or writing a template by a field expert, and then instantiating the template to obtain a test case;
in step S230, the query is executed to obtain error data — the test case is executed according to the data subset to be evaluated to obtain a query result, and the query result returns the error data. For example, there are two different outcomes to execute each test case: if the returned result shows that the error data exists, the error data is obtained;
in step S240, an evaluation result is calculated — an evaluation result based on the evaluation metric is calculated according to the error data and the metric formula of the evaluation metric.
Fig. 3B shows a second embodiment of calculating an evaluation result according to the data subset to be evaluated, the evaluation metric, and the domain context in the data quality evaluation method of fig. 1.
Referring to fig. 3B, the manual evaluation process is to perform a manual evaluation process according to the evaluation metric and the data subset to be evaluated, and calculate an evaluation result based on the evaluation metric according to the evaluation result of the manual evaluation. The manual evaluation mode can be realized through the steps S310 to S340, and mainly comprises task allocation, acquisition of a manual evaluation result, correction of an evaluation result deviation and calculation of an evaluation result. In particular, the amount of the solvent to be used,
in step S310, task allocation, namely, the evaluation tasks are randomly allocated to a plurality of evaluators according to the data subsets to be evaluated and the evaluation metrics. For example, the number of evaluators is greater than or equal to 3, and the number of evaluators is odd;
in step S320, manual evaluation results are obtained — an evaluation period, such as a limited time period of 4 hours, 8 hours, 24 hours, or 48 hours, is set according to the size of the data subset to be evaluated, and the respective evaluation results of the evaluators are obtained within this time period. If the time range is exceeded, the evaluation task of the evaluator is cancelled;
in step S330, the evaluation result deviation is corrected, that is, the deviation in the evaluation result is corrected according to the evaluation result, resulting in a corrected evaluation result. If the evaluation results are inconsistent, that is, the deviation of each evaluation result is greater than or equal to 0.15, returning to step S310 to re-distribute the tasks;
in step S340, an evaluation result is calculated — from the corrected evaluation result, an average value is calculated to obtain an evaluation result based on the evaluation metric.
Those skilled in the art will appreciate that in some embodiments, other ways of calculating the evaluation result than that shown in fig. 3A and 3B may be used, such as a direct calculation method, in which the evaluation result of the metric is obtained by direct calculation according to the definition of the evaluation metric and the data subset to be evaluated. In the process of calculating the metrics, some metrics can be directly calculated, for example, the number of tables and the number of entities can be directly counted by a computer. In addition, a direct calculation mode, an automatic detection mode and a manual evaluation mode can be comprehensively used.
FIG. 4 shows a block diagram of an apparatus for data quality assessment of a data set based on context, in accordance with another embodiment of the present invention.
Referring to fig. 4, in this embodiment, an apparatus for data quality assessment of a data set based on context includes a presentation module, a selection module, a sampling module, a calculation module, and an aggregation ranking module.
The display module and the selection module can be independently arranged or integrated in the same functional module, and are used for acquiring a data set to be evaluated and a domain context corresponding to the data set, and then selecting an evaluation metric for evaluating the data quality according to the data set and the domain context. In addition, as shown in fig. 4, the presentation module may also provide an interface for inputting the data set to be evaluated, the evaluation metric, and the user configuration parameter, and display the received evaluation result and analysis chart, and provide an input and display interface of the domain context. Here, the domain context includes a context name, a reference pattern, a reference data set, a data dictionary, a metric aggregation library, and an example library.
And the sampling module performs sampling processing on the data set to be evaluated obtained by the display module so as to obtain a data subset to be evaluated. And the calculation module is connected with the sampling module and used for calculating an evaluation result obtained based on the evaluation metric according to the data subset to be evaluated, the evaluation metric and the field context. And the aggregation sequencing module is used for aggregating and sequencing the evaluation results to obtain the evaluation results of the data set.
Compared with the prior art, the context-based data quality evaluation device and method provided by the invention can evaluate the data subset obtained by sampling the data set according to the obtained domain context and the evaluation metric selected by the user, so that the user requirements are fully reflected, the data set can be evaluated comprehensively and objectively, and an intuitive and comparable evaluation result is obtained.
Hereinbefore, specific embodiments of the present invention are described with reference to the drawings. However, those skilled in the art will appreciate that various modifications and substitutions can be made to the specific embodiments of the present invention without departing from the spirit and scope of the invention. Such modifications and substitutions are intended to be included within the scope of the present invention as defined by the appended claims.

Claims (10)

1. A computer-implemented method for data quality assessment of a data set based on context, the method comprising the steps of:
acquiring a data set to be evaluated and a domain context corresponding to the data set;
selecting an evaluation metric for evaluating data quality based on the data set and the domain context;
sampling the data set and determining a data subset to be evaluated;
calculating an evaluation result obtained based on the evaluation metric according to the to-be-evaluated data subset, the evaluation metric and the domain context; and
and aggregating and sorting the evaluation results to obtain the evaluation results of the data set.
2. The method of claim 1, wherein between the step of sampling the data set and the step of calculating an assessment result, the method further comprises the steps of:
and performing mode alignment on the data subset to be evaluated and the field context by adopting a mode alignment library.
3. The method of claim 1, wherein the data set is sampled using hierarchical sampling, systematic sampling, or random sampling to determine the subset of data to be evaluated.
4. The method of claim 1, wherein computing the evaluation result from the subset of data to be evaluated, the evaluation metric, and the domain context is performed in at least one of:
-directly calculating from the definition of the evaluation metric;
-automatically detecting according to a metric formula of said evaluation metric;
-manual evaluation.
5. The method of claim 4, wherein automatically detecting according to the metric formula of the evaluation metric comprises:
defining field constraints or templates of constraints among fields of the data subset to be evaluated;
instantiating the defined template according to the specific data of the data subset to be evaluated to generate a test case for querying the data subset to be evaluated;
executing the test case to obtain a query result, and returning error data by the query result; and
and calculating the evaluation result according to the error data and the measurement formula of the evaluation measurement.
6. The method of claim 4, wherein computing the evaluation result using a manual evaluation comprises:
according to the data subset to be evaluated and the evaluation metric, randomly distributing evaluation tasks to N evaluators, wherein N is an odd number which is greater than or equal to 3;
setting an evaluation period according to the size of the data subset to be evaluated, and acquiring respective evaluation results of the evaluators in the evaluation period;
correcting deviations in the evaluation results according to the respective evaluation results to obtain corrected evaluation results; and
calculating an average value from the corrected evaluation result to obtain an evaluation result based on the evaluation metric.
7. An apparatus for data quality assessment of a data set based on context, the apparatus comprising:
the display module is used for acquiring a data set to be evaluated and a field context corresponding to the data set;
a selection module to select an evaluation metric for evaluating data quality based on the data set and the domain context;
the sampling module is used for sampling the data set and determining a data subset to be evaluated;
the calculation module is used for calculating an evaluation result obtained based on the evaluation metric according to the to-be-evaluated data subset, the evaluation metric and the domain context; and
and the aggregation sequencing module is used for aggregating and sequencing the evaluation results to obtain the evaluation results of the data set.
8. The device of claim 7, further comprising a pattern alignment module configured to perform pattern alignment on the subset of data to be evaluated and the domain context according to a pattern alignment library to obtain an aligned subset of data to be evaluated.
9. The apparatus of claim 7 or 8, wherein the calculation module calculates the evaluation result in at least one of:
-directly calculating from the definition of the evaluation metric;
-automatically detecting according to a metric formula of said evaluation metric;
-manual evaluation.
10. The apparatus of claim 7, wherein the domain context comprises a context name, a reference schema, a reference dataset, a dictionary dataset, a use-case set, and a metrics aggregation library.
CN201610388931.2A 2016-06-03 2016-06-03 Equipment and method for carrying out data quality evaluation on data set based on context Pending CN106056287A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610388931.2A CN106056287A (en) 2016-06-03 2016-06-03 Equipment and method for carrying out data quality evaluation on data set based on context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610388931.2A CN106056287A (en) 2016-06-03 2016-06-03 Equipment and method for carrying out data quality evaluation on data set based on context

Publications (1)

Publication Number Publication Date
CN106056287A true CN106056287A (en) 2016-10-26

Family

ID=57170004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610388931.2A Pending CN106056287A (en) 2016-06-03 2016-06-03 Equipment and method for carrying out data quality evaluation on data set based on context

Country Status (1)

Country Link
CN (1) CN106056287A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562725A (en) * 2017-08-31 2018-01-09 新华三大数据技术有限公司 The method of calibration and device of index extraction
CN107608888A (en) * 2017-09-15 2018-01-19 郑州云海信息技术有限公司 A kind of software test case reviewing method and system
CN107633257A (en) * 2017-08-15 2018-01-26 上海数据交易中心有限公司 Data Quality Assessment Methodology and device, computer-readable recording medium, terminal
CN109951374A (en) * 2019-02-22 2019-06-28 上海掌门科技有限公司 A kind of method and apparatus of virtual resource object distribution
CN111612783A (en) * 2020-05-28 2020-09-01 中国科学技术大学 Data quality evaluation method and system
CN112395279A (en) * 2021-01-18 2021-02-23 浙江口碑网络技术有限公司 Quality guarantee data obtaining method and device and electronic equipment
CN113779150A (en) * 2021-09-14 2021-12-10 杭州数梦工场科技有限公司 Data quality evaluation method and device
US11587012B2 (en) 2019-03-08 2023-02-21 Walmart Apollo, Llc Continuous data quality assessment and monitoring for big data
CN117114819A (en) * 2023-10-23 2023-11-24 临沂大学 Evaluation body-based data transaction reputation evaluation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TONG RUAN 等: ""From Queriability to Informativity, Assessing "Quality in Use" of DBpedia and YAGO"", 《THE SEMANTIC WEB》 *
韩京宇等: "基于事实抽取的Web文档内容数据质量评估", 《计算机科学》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107633257A (en) * 2017-08-15 2018-01-26 上海数据交易中心有限公司 Data Quality Assessment Methodology and device, computer-readable recording medium, terminal
CN107633257B (en) * 2017-08-15 2020-04-17 上海数据交易中心有限公司 Data quality evaluation method and device, computer readable storage medium and terminal
CN107562725B (en) * 2017-08-31 2020-10-09 新华三大数据技术有限公司 Index extraction verification method and device
CN107562725A (en) * 2017-08-31 2018-01-09 新华三大数据技术有限公司 The method of calibration and device of index extraction
CN107608888A (en) * 2017-09-15 2018-01-19 郑州云海信息技术有限公司 A kind of software test case reviewing method and system
CN109951374A (en) * 2019-02-22 2019-06-28 上海掌门科技有限公司 A kind of method and apparatus of virtual resource object distribution
CN109951374B (en) * 2019-02-22 2021-06-08 上海掌门科技有限公司 Virtual resource object allocation method and equipment
US11587012B2 (en) 2019-03-08 2023-02-21 Walmart Apollo, Llc Continuous data quality assessment and monitoring for big data
CN111612783A (en) * 2020-05-28 2020-09-01 中国科学技术大学 Data quality evaluation method and system
CN111612783B (en) * 2020-05-28 2023-10-24 中国科学技术大学 Data quality assessment method and system
CN112395279A (en) * 2021-01-18 2021-02-23 浙江口碑网络技术有限公司 Quality guarantee data obtaining method and device and electronic equipment
CN112395279B (en) * 2021-01-18 2021-11-02 浙江口碑网络技术有限公司 Quality guarantee data obtaining method and device and electronic equipment
CN113779150A (en) * 2021-09-14 2021-12-10 杭州数梦工场科技有限公司 Data quality evaluation method and device
CN117114819A (en) * 2023-10-23 2023-11-24 临沂大学 Evaluation body-based data transaction reputation evaluation method

Similar Documents

Publication Publication Date Title
CN106056287A (en) Equipment and method for carrying out data quality evaluation on data set based on context
Sugimoto et al. Measuring research: What everyone needs to know
Ding et al. Quickinsights: Quick and automatic discovery of insights from multi-dimensional data
Shahbazi et al. Representation bias in data: A survey on identification and resolution techniques
Tang et al. Extracting top-k insights from multi-dimensional data
US9881080B2 (en) System for enhancing expert-based computerized analysis of a set of digital documents and methods useful in conjunction therewith
Torvik et al. Author name disambiguation in MEDLINE
Mingers et al. Counting the citations: A comparison of Web of Science and Google Scholar in the field of business and management
US8706742B1 (en) System for enhancing expert-based computerized analysis of a set of digital documents and methods useful in conjunction therewith
US10095766B2 (en) Automated refinement and validation of data warehouse star schemas
AU2017250467B2 (en) Query optimizer for combined structured and unstructured data records
EP2645309B1 (en) Automatic combination and mapping of text-mining services
CN102160066A (en) Search engine and method, particularly applicable to patent literature
US10360239B2 (en) Automated definition of data warehouse star schemas
Feng et al. Practical duplicate bug reports detection in a large web-based development community
Shahbazi et al. A survey on techniques for identifying and resolving representation bias in data
Visengeriyeva et al. Anatomy of metadata for data curation
Wildgaard A critical cluster analysis of 44 indicators of author-level performance
CN114840531B (en) Data model reconstruction method, device, equipment and medium based on blood edge relation
US10803124B2 (en) Technological emergence scoring and analysis platform
US8650180B2 (en) Efficient optimization over uncertain data
Li et al. Crowdsourced top-k queries by pairwise preference judgments with confidence and budget control
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes
CN116933130A (en) Enterprise industry classification method, system, equipment and medium based on big data
Aljumaili et al. Data quality assessment using multi-attribute maintenance perspective

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161026