CN111612783B - Data quality assessment method and system - Google Patents

Data quality assessment method and system Download PDF

Info

Publication number
CN111612783B
CN111612783B CN202010472680.2A CN202010472680A CN111612783B CN 111612783 B CN111612783 B CN 111612783B CN 202010472680 A CN202010472680 A CN 202010472680A CN 111612783 B CN111612783 B CN 111612783B
Authority
CN
China
Prior art keywords
data set
data
quality
quality requirement
minimum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010472680.2A
Other languages
Chinese (zh)
Other versions
CN111612783A (en
Inventor
李安然
张兰
李向阳
谢筠庭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202010472680.2A priority Critical patent/CN111612783B/en
Publication of CN111612783A publication Critical patent/CN111612783A/en
Application granted granted Critical
Publication of CN111612783B publication Critical patent/CN111612783B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a data quality evaluation method and a system, wherein the method comprises the following steps: evaluating internal characteristics of the data irrelevant to tasks on the data set to obtain the data set meeting the minimum internal quality requirement; extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data; performing context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result; and sorting the quality evaluation results. The invention can comprehensively consider the internal quality irrelevant to the task, the context quality relevant to the task and the requirement on large-scale data quality evaluation when evaluating the data quality, thereby effectively improving the comprehensiveness, accuracy and efficiency of the data quality evaluation.

Description

Data quality assessment method and system
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data quality evaluation method and system.
Background
Today, with the rapid development of mobile networks, sensor networks and crowd sensing technologies, a wide variety of data is being generated in large quantities. At the same time, a large number of data-based information services are also emerging, in which the quality of the data plays a vital role. 1) The high quality data may provide sufficient and accurate information to accomplish a particular task, such as training a high quality machine learning model; helping smart city systems make informed decisions. 2) A large number of services provide the data itself as a product to users on demand, for example, crowd-sourced services. For these services, the quality of the data determines the satisfaction of the user. 3) High data quality helps to optimize system resource utilization. Limited resources (e.g., bandwidth, storage and computing resources) should be preferentially allocated to high quality data to ensure system performance and quality of service. Taking crowd sensing application as an example, a large number of participants upload images in a mobile phone, effective data quality assessment, especially effective quality assessment of a large image set, can remarkably promote quality of the uploaded images, so that bandwidth loss caused by low-quality image transmission is avoided.
Data quality assessment has attracted attention from researchers, however, existing assessment methods suffer from the following drawbacks when faced with specific tasks and large amounts of data. First, existing work mostly focuses on the inherent quality of data, while important context quality is ignored. With the same data, one task may perform well while another task may perform poorly. For example, a high quality image dataset for training face recognition may be a poor quality dataset for object detection tasks. Second, existing works mostly aim at single data units (such as a picture and a text) when evaluating the quality of data, and lack an evaluation method for the overall quality of the data set. If the overall quality of the data set is obtained simply by the quality statistics of the individual data units, such as the statistics of the minimum or average value of the quality of all the data units, the influence of the relationship between the data units on the quality of the data set is ignored. Finally, although data quality has been proposed for various dimensions, it remains a challenge to fuse these dimensions to obtain a comprehensive overall quality result.
Therefore, how to evaluate the quality of data more comprehensively and accurately is a problem to be solved urgently.
Disclosure of Invention
In view of the above, the present invention provides a data quality evaluation method, which can comprehensively consider the internal quality irrelevant to the task, the context quality relevant to the task, and the requirement for large-scale data quality evaluation during data quality evaluation, thereby effectively improving the comprehensiveness, accuracy and efficiency of data quality evaluation.
The invention provides a data quality assessment method, which comprises the following steps:
evaluating internal characteristics of the data irrelevant to tasks on the data set to obtain the data set meeting the minimum internal quality requirement;
extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data;
performing context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result;
and carrying out quality sorting on the quality evaluation results.
Preferably, the evaluating the internal features of the data independent of the task to obtain the data set meeting the minimum internal quality requirement includes:
evaluating the correctness, reliability and error-free degree of the data set by a pattern matching method to obtain an accuracy quantized value;
evaluating the data acquisition and storage precision of the data set to obtain an accuracy quantization value;
evaluating the unbiased degree of the data set to obtain an objectivity quantification value;
evaluating the trusted degree of the data source of the data set to obtain a reliability quantification value;
and obtaining a data set meeting the minimum internal quality requirement based on the accuracy quantized value, the objectivity quantized value, the reliability quantized value, the accuracy minimum quality requirement, the objectivity minimum quality requirement and the reliability minimum quality requirement.
Preferably, the feature extraction is performed on each data in the data set and the sample data set meeting the minimum intrinsic quality requirement, so as to obtain a feature vector of each data, which includes:
and extracting the eighth layer of features from each picture data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a VGG-16 model as feature vectors of the picture data.
Preferably, the feature extraction is performed on each data in the data set and the sample data set meeting the minimum intrinsic quality requirement, so as to obtain a feature vector of each data, which includes:
and extracting the penultimate layer of features from each text data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a BERT model as feature vectors of the text data.
Preferably, performing a context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum intrinsic quality requirement to obtain a quality assessment result, including:
calculating the ratio of the number of similar point pairs to the distance in the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a task correlation evaluation result;
calculating the average distance between the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a content diversity evaluation result;
calculating the ratio of the number of non-empty data in the data set meeting the minimum internal quality requirement and the sample data set to the total data amount to obtain an integrity evaluation result;
evaluating whether the data volume in the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a suitable degree evaluation result of the data volume;
and evaluating whether the service cycle of the data set meeting the minimum internal quality requirement and the service cycle of the sample data set meet the requirement of a given task or not, and obtaining a time-efficiency evaluation result.
A data quality assessment system, comprising:
the intrinsic quality evaluation module is used for evaluating the internal characteristics of the data irrelevant to the task on the data set to obtain the data set meeting the minimum intrinsic quality requirement;
the feature extraction module is used for carrying out feature extraction on each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a feature vector of each data;
the context quality evaluation module is used for performing context quality evaluation on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality evaluation result;
and the quality sorting module is used for sorting the quality of the quality evaluation result.
Preferably, the intrinsic quality assessment module comprises:
the accuracy evaluation unit is used for evaluating the accuracy, the reliability and the degree of no error of the data set through a pattern matching method to obtain an accuracy quantized value;
the accuracy evaluation unit is used for evaluating the data acquisition and storage accuracy of the data set to obtain an accuracy quantized value;
the objectivity evaluation unit is used for evaluating the unbiased degree of the data set to obtain an objectivity quantification value;
the dependability evaluation unit is used for evaluating the trusted degree of the data source of the data set to obtain a dependability quantized value;
and the determining unit is used for obtaining a data set meeting the minimum internal quality requirement based on the accuracy quantized value, the objectivity quantized value, the reliability quantized value, the accuracy minimum quality requirement value, the objectivity minimum quality requirement value and the reliability minimum quality requirement value.
Preferably, the feature extraction module is specifically configured to:
and extracting the eighth layer of features from each picture data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a VGG-16 model as feature vectors of the picture data.
Preferably, the feature extraction module is specifically configured to:
and extracting the penultimate layer of features from each text data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a BERT model as feature vectors of the text data.
Preferably, the context quality assessment module comprises:
the task relevance evaluation unit is used for calculating the ratio of the number of similar point pairs to the distance in the data set and the sample data set meeting the minimum internal quality requirement by adopting a method based on local sensitive hash to obtain a task relevance evaluation result;
the content diversity evaluation unit is used for calculating the average distance between the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a content diversity evaluation result;
the integrity evaluation unit is used for calculating the ratio of the number of non-empty data in the data set meeting the minimum internal quality requirement and the sample data set to the total data amount to obtain an integrity evaluation result;
the data volume fitness evaluation unit is used for evaluating whether the data volume in the data set meeting the minimum internal quality requirement and the sample data set meets the requirement of a given task or not, and obtaining a data volume fitness evaluation result;
and the timeliness evaluation unit is used for evaluating whether the service cycle of the data set meeting the minimum internal quality requirement and the sample data set meets the requirement of a given task or not, so as to obtain a timeliness evaluation result.
In summary, the invention discloses a data quality evaluation method, when the data quality is required to be evaluated, firstly evaluating the internal characteristics of data irrelevant to tasks on a data set to obtain the data set meeting the minimum internal quality requirement; then, extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data; and carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result, and carrying out quality sequencing on the quality assessment result. The invention can comprehensively consider the internal quality irrelevant to the task, the context quality relevant to the task and the requirement on large-scale data quality evaluation when evaluating the data quality, thereby effectively improving the comprehensiveness, accuracy and efficiency of the data quality evaluation.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of embodiment 1 of a data quality assessment method of the present disclosure;
FIG. 2 is a flow chart of a method of embodiment 2 of a data quality assessment method of the present disclosure;
FIG. 3 is a schematic diagram of a data quality evaluation system according to an embodiment 1 of the present disclosure;
fig. 4 is a schematic structural diagram of an embodiment 2 of a data quality evaluation system according to the present disclosure.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, a method flowchart of an embodiment 1 of a data quality evaluation method disclosed in the present invention may include:
s101, evaluating internal characteristics of data irrelevant to tasks on the data set to obtain the data set meeting the minimum internal quality requirement;
when the data quality needs to be evaluated, the accuracy, the precision, the objectivity and the reliability are firstly in four dimensionsThe data set is evaluated for internal features of the data that are not related to the task. For data set D, the quantitative values for the four dimensions of accuracy, precision, objectivity and reliability are respectively Minimum quality requirements for four dimensions of accuracy, precision, objectivity and reliability are θ respectively c ,θ p ,θ o ,θ r . The data set D must meet the minimum intrinsic quality requirementInferior datasets that do not meet the lowest intrinsic quality requirement R will be placed directly at the bottom of the ordered list without further evaluation.
S102, extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data;
for the data set M meeting the minimum internal quality requirement R, the quality evaluator extracts the characteristics of each data in the data set M and the sample data set S to obtain the characteristic vector of each data.
S103, carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result;
the context quality assessment dataset is adapted to the extent of a given task. In this embodiment, the data consumer expresses the task' S need for data by providing a small sample data set S. And then carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement from five dimensions of task relativity, content diversity, integrity, appropriateness of data volume and timeliness to obtain a quality assessment result.
S104, quality sorting is carried out on the quality evaluation results.
Finally, by minimizing Kendall tau distance, using quality ordering method (rank aggregation), a best quality ordered data set sequence is calculated given multiple input data set quality assessment result sequences, with the higher ordered data set having higher data quality on a given task.
In summary, in the above embodiment, when the data quality needs to be evaluated, firstly, evaluating the internal features of the data unrelated to the task on the data set to obtain the data set meeting the minimum internal quality requirement; then, extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data; and carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result, and carrying out quality sequencing on the quality assessment result. The invention can comprehensively consider the internal quality irrelevant to the task, the context quality relevant to the task and the requirement on large-scale data quality evaluation when evaluating the data quality, thereby effectively improving the comprehensiveness, accuracy and efficiency of the data quality evaluation.
As shown in fig. 2, a method flowchart of an embodiment 2 of a data quality evaluation method disclosed in the present invention may include:
s201, evaluating the correctness, reliability and error-free degree of a data set by a pattern matching method to obtain an accuracy quantized value;
when the data quality needs to be evaluated, the data set is firstly evaluated for the internal characteristics of the data irrelevant to tasks from four dimensions of accuracy, precision, objectivity and reliability.
Specifically, the accuracy, reliability and error-free degree of the data are evaluated by using a pattern matching method, and an accuracy quantized value is obtainedFor example, for text data, its spelling and grammar are evaluated for correctness.
S202, evaluating the data acquisition and storage precision of a data set to obtain an accuracy quantization value;
meanwhile, the data acquisition and storage precision of the data set is evaluated to obtain an accuracy quantized valueFor example, a pre-trained convolutional neural network is utilized to estimate the accuracy of the image, including the JPEG compression rate of the picture, the degree of blurring, etc.
S203, evaluating the unbiased degree of the data set to obtain an objectivity quantification value;
meanwhile, the unbiased degree of the data set is evaluated by adopting a method for checking history objective records and questionnaire investigation, so as to obtain an objectivity quantification value
S204, evaluating the trusted degree of the data source of the data set to obtain a reliability quantification value;
meanwhile, the method for checking the history objective record and questionnaire is adopted to evaluate the trusted degree of the data source of the data set, and a reliability quantification value is obtained
S205, obtaining a data set meeting the minimum internal quality requirement based on the accuracy quantized value, the objectivity quantized value, the reliability quantized value, the accuracy minimum quality requirement, the objectivity minimum quality requirement and the reliability minimum quality requirement;
then, the value is quantized according to the accuracyAccuracy quantization value->Objectivity quantification value->Reliability quantization value->Minimum accuracy quality requirement value θ c Minimum accuracy quality requirement value θ p Objective minimum quality requirement value theta o And a reliability minimum quality requirement value theta r To meet the minimum internal quality requirementIs a data set of the (c).
S206, extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data;
for the data set M meeting the minimum internal quality requirement R, the quality evaluator extracts the characteristics of each data in the data set M and the sample data set S to obtain the characteristic vector of each data.
Specifically, for the picture data, extracting the eighth layer of features of the picture data by using a VGG-16 model as feature vectors of the picture; for text data, its penultimate layer features are extracted as feature expressions of the text using the BERT (Bidirectional Encoder Representations from Transformers) model from the bi-directional encoder characterizer of the transformer.
S207, calculating the ratio of the number of similar point pairs to the distance in a data set and a sample data set meeting the minimum internal quality requirement by adopting a method based on local sensitive hash, and obtaining a task relevance evaluation result;
the context quality assessment dataset is adapted to the extent of a given task. In this embodiment, the data consumer expresses the task' S need for data by providing a small sample data set S. And then carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement from five dimensions of task relativity, content diversity, integrity, appropriateness of data volume and timeliness to obtain a quality assessment result.
Wherein, a method based on local sensitive hash is adopted to calculate the ratio of the number X (M, S) of similar point pairs in the data set M and the sample data set S to the |D|, and the ratio is used to approximate the value of the task relevance. Specific:
the feature vectors are hashed locally and with high probability, similar data points are mapped into the same bucket and dissimilar data points are mapped into different buckets.
Calculating data point d in the same bucket i E M and data point d j Distance Dis (d) of E S i ,d j ) (e.g., based on the Euclidean distance or cosine distance of the feature vector), when the distance is less than the threshold delta, then data point d i And data point d j Are considered to be similar pairs of points.
X (M, S)/|d| (X (M, S) is the number of pairs of similar points) is calculated and used to approximate the value of the expressed task relevance.
S208, calculating the average distance between the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash, and obtaining a content diversity evaluation result;
meanwhile, an average distance between the data set M and the sample data set S is calculated by adopting a method based on local sensitive hash sampling, and the distance is used for approximating the value of expressing the content diversity. Specific:
the feature vectors are hashed locally and with high probability, similar data points are mapped into the same bucket and dissimilar data points are mapped into different buckets.
Uniformly sampling the data in all barrels to obtain a set G, and calculating the distance Dis (d) of all data points in the set G i ,d j ),d i ,d j E G, and approximating the value of content diversity with the mean (the higher the sampling rate, the closer the calculated mean is to the value of true content diversity).
S209, calculating the ratio of the number of non-empty data in the data set and the sample data set meeting the minimum internal quality requirement to the total data amount to obtain an integrity evaluation result;
and meanwhile, calculating the ratio of the number of non-null data in the data set and the sample data set meeting the minimum internal quality requirement to the total data amount to obtain an integrity evaluation result.
S210, evaluating whether the data volume in the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a suitability evaluation result of the data volume;
and simultaneously, evaluating whether the data volume in the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task, and obtaining a suitable degree evaluation result of the data volume.
S211, evaluating whether the service periods of the data set and the sample data set meeting the minimum internal quality requirement meet the requirement of a given task or not, and obtaining a timeliness evaluation result;
and simultaneously, evaluating whether the service cycle of the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a time-efficiency evaluation result.
S212, quality sorting is carried out on the quality evaluation results.
Finally, by minimizing Kendall tau distance, using quality ordering method (rank aggregation), a best quality ordered data set sequence is calculated given multiple input data set quality assessment result sequences, with the higher ordered data set having higher data quality on a given task.
In summary, in the data quality evaluation method disclosed by the invention, the evaluation index is comprehensive, and the internal quality irrelevant to the task and the context quality relevant to the task are comprehensively considered; the evaluation process is efficient and is suitable for large-scale data collection; the evaluation method is high in universality and suitable for various types of data; the evaluation result has interpretability.
As shown in fig. 3, a schematic structural diagram of an embodiment 1 of a data quality evaluation system disclosed in the present invention may include:
the intrinsic quality evaluation module 301 is configured to evaluate the internal features of the data set that are not related to the task, so as to obtain a data set that meets the minimum intrinsic quality requirement;
when the data quality needs to be evaluated, the data set is firstly evaluated for the internal characteristics of the data irrelevant to tasks from four dimensions of accuracy, precision, objectivity and reliability. For data set D, the quantitative values for the four dimensions of accuracy, precision, objectivity and reliability are respectively Minimum quality requirements for four dimensions of accuracy, precision, objectivity and reliability are θ respectively c ,θ p ,θ o ,θ r . The data set D must meet the minimum intrinsic quality requirementInferior datasets that do not meet the lowest intrinsic quality requirement R will be placed directly at the bottom of the ordered list without further evaluation.
The feature extraction module 302 is configured to perform feature extraction on each data in the data set and the sample data set that meet the minimum intrinsic quality requirement, so as to obtain a feature vector of each data;
for the data set M meeting the minimum internal quality requirement R, the quality evaluator extracts the characteristics of each data in the data set M and the sample data set S to obtain the characteristic vector of each data.
A context quality evaluation module 303, configured to perform context quality evaluation on the feature vector of each data in the data set and the sample data set that meet the minimum intrinsic quality requirement, to obtain a quality evaluation result;
the context quality assessment dataset is adapted to the extent of a given task. In this embodiment, the data consumer expresses the task' S need for data by providing a small sample data set S. And then carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement from five dimensions of task relativity, content diversity, integrity, appropriateness of data volume and timeliness to obtain a quality assessment result.
The quality sorting module 304 is configured to sort the quality of the quality evaluation result.
Finally, by minimizing Kendall tau distance, using quality ordering method (rank aggregation), a best quality ordered data set sequence is calculated given multiple input data set quality assessment result sequences, with the higher ordered data set having higher data quality on a given task.
In summary, in the above embodiment, when the data quality needs to be evaluated, firstly, evaluating the internal features of the data unrelated to the task on the data set to obtain the data set meeting the minimum internal quality requirement; then, extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data; and carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result, and carrying out quality sequencing on the quality assessment result. The invention can comprehensively consider the internal quality irrelevant to the task, the context quality relevant to the task and the requirement on large-scale data quality evaluation when evaluating the data quality, thereby effectively improving the comprehensiveness, accuracy and efficiency of the data quality evaluation.
As shown in fig. 4, a schematic structural diagram of an embodiment 2 of a data quality evaluation system disclosed in the present invention may include:
an accuracy evaluation unit 401, configured to evaluate accuracy, reliability and error-free degree of the data set by using a pattern matching method, so as to obtain an accuracy quantized value;
when the data quality needs to be evaluated, the data set is firstly evaluated for the internal characteristics of the data irrelevant to tasks from four dimensions of accuracy, precision, objectivity and reliability.
Specifically, the correctness, reliability and absence of data are evaluated by using a pattern matching methodThe degree of error, obtaining an accuracy quantized valueFor example, for text data, its spelling and grammar are evaluated for correctness.
An accuracy evaluation unit 402, configured to evaluate accuracy of data collection and storage of the data set, and obtain an accuracy quantization value;
meanwhile, the data acquisition and storage precision of the data set is evaluated to obtain an accuracy quantized valueFor example, a pre-trained convolutional neural network is utilized to estimate the accuracy of the image, including the JPEG compression rate of the picture, the degree of blurring, etc.
An objectivity evaluation unit 403, configured to evaluate the unbiased degree of the data set, so as to obtain an objectivity quantification value;
meanwhile, the unbiased degree of the data set is evaluated by adopting a method for checking history objective records and questionnaire investigation, so as to obtain an objectivity quantification value
A dependability evaluation unit 404, configured to evaluate the degree of trust of the data source of the data set, and obtain a quantized reliability value;
meanwhile, the method for checking the history objective record and questionnaire is adopted to evaluate the trusted degree of the data source of the data set, and a reliability quantification value is obtained
A determining unit 405, configured to obtain a dataset that meets a minimum intrinsic quality requirement based on the accuracy quantization value, the objectivity quantization value, the reliability quantization value, the accuracy minimum quality requirement, the objectivity minimum quality requirement, and the reliability minimum quality requirement;
then, the value is quantized according to the accuracyAccuracy quantization value->Objectivity quantification value->Reliability quantization value->Minimum accuracy quality requirement value θ c Minimum accuracy quality requirement value θ p Objective minimum quality requirement value theta o And a reliability minimum quality requirement value theta r To meet the minimum internal quality requirementIs a data set of the (c).
A feature extraction module 406, configured to perform feature extraction on each data in the data set and the sample data set that meet the minimum intrinsic quality requirement, so as to obtain a feature vector of each data;
for the data set M meeting the minimum internal quality requirement R, the quality evaluator extracts the characteristics of each data in the data set M and the sample data set S to obtain the characteristic vector of each data.
Specifically, for the picture data, extracting the eighth layer of features of the picture data by using a VGG-16 model as feature vectors of the picture; for text data, its penultimate layer features are extracted as feature expressions of the text using the BERT (Bidirectional Encoder Representations from Transformers) model from the bi-directional encoder characterizer of the transformer.
A task relevance evaluation unit 407, configured to calculate a ratio of the number of similar point pairs to the distance in the data set and the sample data set that satisfy the minimum internal quality requirement by using a method based on local sensitive hashing, to obtain a task relevance evaluation result;
the context quality assessment dataset is adapted to the extent of a given task. In this embodiment, the data consumer expresses the task' S need for data by providing a small sample data set S. And then carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement from five dimensions of task relativity, content diversity, integrity, appropriateness of data volume and timeliness to obtain a quality assessment result.
Wherein, a method based on local sensitive hash is adopted to calculate the ratio of the number X (M, S) of similar point pairs in the data set M and the sample data set S to the |D|, and the ratio is used to approximate the value of the task relevance. Specific:
the feature vectors are hashed locally and with high probability, similar data points are mapped into the same bucket and dissimilar data points are mapped into different buckets.
Calculating data point d in the same bucket i E M and data point d j Distance Dis (d) of E S i ,d j ) (e.g., based on the Euclidean distance or cosine distance of the feature vector), when the distance is less than the threshold delta, then data point d i And data point d j Are considered to be similar pairs of points.
X (M, S)/|d| (X (M, S) is the number of pairs of similar points) is calculated and used to approximate the value of the expressed task relevance.
A content diversity evaluation unit 408, configured to calculate an average distance between the data set and the sample data set that meets the minimum internal quality requirement by using a method based on local sensitive hash, so as to obtain a content diversity evaluation result;
meanwhile, an average distance between the data set M and the sample data set S is calculated by adopting a method based on local sensitive hash sampling, and the distance is used for approximating the value of expressing the content diversity. Specific:
the feature vectors are hashed locally and with high probability, similar data points are mapped into the same bucket and dissimilar data points are mapped into different buckets.
Uniformly sampling the data in all barrels to obtainTo set G, and calculate the distance Dis (d) of all data points in set G i ,d j ),d i ,d j E G, and approximating the value of content diversity with the mean (the higher the sampling rate, the closer the calculated mean is to the value of true content diversity).
An integrity evaluation unit 409, configured to calculate a ratio of the number of non-null data in the data set and the sample data set that satisfy the minimum internal quality requirement to the total data amount, to obtain an integrity evaluation result;
and meanwhile, calculating the ratio of the number of non-null data in the data set and the sample data set meeting the minimum internal quality requirement to the total data amount to obtain an integrity evaluation result.
A fitness evaluation unit 410, configured to evaluate whether the data volume in the data set and the sample data set that meet the minimum intrinsic quality requirement meets the requirement of the given task, and obtain a fitness evaluation result of the data volume;
and simultaneously, evaluating whether the data volume in the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task, and obtaining a suitable degree evaluation result of the data volume.
A timeliness evaluation unit 411, configured to evaluate whether a service cycle of a data set and a sample data set that meet a minimum intrinsic quality requirement meets a requirement of a given task, to obtain a timeliness evaluation result;
and simultaneously, evaluating whether the service cycle of the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a time-efficiency evaluation result.
A quality ranking module 412, configured to rank the quality evaluation results.
Finally, by minimizing Kendall tau distance, using quality ordering method (rank aggregation), a best quality ordered data set sequence is calculated given multiple input data set quality assessment result sequences, with the higher ordered data set having higher data quality on a given task.
In summary, in the data quality evaluation method disclosed by the invention, the evaluation index is comprehensive, and the internal quality irrelevant to the task and the context quality relevant to the task are comprehensively considered; the evaluation process is efficient and is suitable for large-scale data collection; the evaluation method is high in universality and suitable for various types of data; the evaluation result has interpretability.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A method of data quality assessment, comprising:
evaluating the correctness, reliability and error-free degree of the data set by a pattern matching method to obtain an accuracy quantized value;
evaluating the data acquisition and storage precision of the data set to obtain an accuracy quantization value;
evaluating the unbiased degree of the data set to obtain an objectivity quantification value;
evaluating the trusted degree of the data source of the data set to obtain a reliability quantification value;
obtaining a data set meeting the minimum internal quality requirement based on the accuracy quantized value, the objectivity quantized value, the reliability quantized value, the accuracy minimum quality requirement, the objectivity minimum quality requirement and the reliability minimum quality requirement;
extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data;
calculating the ratio of the number of similar point pairs to the distance in the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a task correlation evaluation result;
calculating the average distance between the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a content diversity evaluation result;
calculating the ratio of the number of non-empty data in the data set meeting the minimum internal quality requirement and the sample data set to the total data amount to obtain an integrity evaluation result;
evaluating whether the data volume in the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a suitable degree evaluation result of the data volume;
evaluating whether the service cycle of the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a timeliness evaluation result;
and carrying out quality sorting on the quality evaluation results.
2. The method of claim 1, wherein the feature extraction of each data in the data set and the sample data set meeting the minimum intrinsic quality requirement to obtain a feature vector for each data comprises:
and extracting the eighth layer of features from each picture data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a VGG-16 model as feature vectors of the picture data.
3. The method of claim 1, wherein the feature extraction of each data in the data set and the sample data set meeting the minimum intrinsic quality requirement to obtain a feature vector for each data comprises:
and extracting the penultimate layer of features from each text data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a BERT model as feature vectors of the text data.
4. A data quality assessment system, comprising:
the intrinsic quality evaluation module is used for evaluating the internal characteristics of the data irrelevant to the task on the data set to obtain the data set meeting the minimum intrinsic quality requirement;
the feature extraction module is used for carrying out feature extraction on each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a feature vector of each data;
the context quality evaluation module is used for performing context quality evaluation on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality evaluation result;
the quality sorting module is used for sorting the quality of the quality evaluation result;
the intrinsic quality assessment module comprises:
the accuracy evaluation unit is used for evaluating the accuracy, the reliability and the degree of no error of the data set through a pattern matching method to obtain an accuracy quantized value;
the accuracy evaluation unit is used for evaluating the data acquisition and storage accuracy of the data set to obtain an accuracy quantized value;
the objectivity evaluation unit is used for evaluating the unbiased degree of the data set to obtain an objectivity quantification value;
the dependability evaluation unit is used for evaluating the trusted degree of the data source of the data set to obtain a dependability quantized value;
a determining unit, configured to obtain a data set that meets a minimum intrinsic quality requirement based on the accuracy quantization value, the objectivity quantization value, the reliability quantization value, the accuracy minimum quality requirement, the objectivity minimum quality requirement, and the reliability minimum quality requirement;
the context quality assessment module comprises:
the task relevance evaluation unit is used for calculating the ratio of the number of similar point pairs to the distance in the data set and the sample data set meeting the minimum internal quality requirement by adopting a method based on local sensitive hash to obtain a task relevance evaluation result;
the content diversity evaluation unit is used for calculating the average distance between the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a content diversity evaluation result;
the integrity evaluation unit is used for calculating the ratio of the number of non-empty data in the data set meeting the minimum internal quality requirement and the sample data set to the total data amount to obtain an integrity evaluation result;
the data volume fitness evaluation unit is used for evaluating whether the data volume in the data set meeting the minimum internal quality requirement and the sample data set meets the requirement of a given task or not, and obtaining a data volume fitness evaluation result;
and the timeliness evaluation unit is used for evaluating whether the service cycle of the data set meeting the minimum internal quality requirement and the sample data set meets the requirement of a given task or not, so as to obtain a timeliness evaluation result.
5. The system of claim 4, wherein the feature extraction module is specifically configured to:
and extracting the eighth layer of features from each picture data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a VGG-16 model as feature vectors of the picture data.
6. The system of claim 4, wherein the feature extraction module is specifically configured to:
and extracting the penultimate layer of features from each text data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a BERT model as feature vectors of the text data.
CN202010472680.2A 2020-05-28 2020-05-28 Data quality assessment method and system Active CN111612783B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010472680.2A CN111612783B (en) 2020-05-28 2020-05-28 Data quality assessment method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010472680.2A CN111612783B (en) 2020-05-28 2020-05-28 Data quality assessment method and system

Publications (2)

Publication Number Publication Date
CN111612783A CN111612783A (en) 2020-09-01
CN111612783B true CN111612783B (en) 2023-10-24

Family

ID=72200233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010472680.2A Active CN111612783B (en) 2020-05-28 2020-05-28 Data quality assessment method and system

Country Status (1)

Country Link
CN (1) CN111612783B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114782731B (en) * 2022-05-19 2024-08-27 中科南京软件技术研究院 Image data set validity evaluation method, device, equipment and storage medium
CN117556268A (en) * 2022-07-31 2024-02-13 华为技术有限公司 Data quality measurement method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106056287A (en) * 2016-06-03 2016-10-26 华东理工大学 Equipment and method for carrying out data quality evaluation on data set based on context
WO2017162835A1 (en) * 2016-03-24 2017-09-28 Universität Stuttgart Data compression by means of adaptive subsampling
CN109800812A (en) * 2019-01-24 2019-05-24 山东大学第二医院 CT image classification feature selection approach and system based on counterfeit filter
CN110728437A (en) * 2019-09-26 2020-01-24 华南师范大学 Quality evaluation method and system for open data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017162835A1 (en) * 2016-03-24 2017-09-28 Universität Stuttgart Data compression by means of adaptive subsampling
CN106056287A (en) * 2016-06-03 2016-10-26 华东理工大学 Equipment and method for carrying out data quality evaluation on data set based on context
CN109800812A (en) * 2019-01-24 2019-05-24 山东大学第二医院 CT image classification feature selection approach and system based on counterfeit filter
CN110728437A (en) * 2019-09-26 2020-01-24 华南师范大学 Quality evaluation method and system for open data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于层次分析法的教育网络安全数据质量管理研究;王明政等;《信息网络安全》;20191210(第12期);全文 *

Also Published As

Publication number Publication date
CN111612783A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
CN109492772B (en) Method and device for generating information
US20220121906A1 (en) Task-aware neural network architecture search
CN111898578B (en) Crowd density acquisition method and device and electronic equipment
CN111159564A (en) Information recommendation method and device, storage medium and computer equipment
CN109033408B (en) Information pushing method and device, computer readable storage medium and electronic equipment
CN105187237B (en) The method and apparatus for searching associated user identifier
CN110245132B (en) Data anomaly detection method, device, computer readable storage medium and computer equipment
CN111612783B (en) Data quality assessment method and system
CN102402594A (en) Rich media individualized recommending method
CN113128305B (en) Portrait archive aggregation evaluation method and device, electronic equipment and storage medium
US20230004776A1 (en) Moderator for identifying deficient nodes in federated learning
CN111708942A (en) Multimedia resource pushing method, device, server and storage medium
CN110895706A (en) Method and device for acquiring target cluster number and computer system
WO2022017082A1 (en) Method and apparatus for detecting false transaction orders
CN115858911A (en) Information recommendation method and device, electronic equipment and computer-readable storage medium
CN113239879A (en) Federal model training and certificate detection method, device, equipment and medium
CN111291694A (en) Dish image identification method and device
CN113704566B (en) Identification number body identification method, storage medium and electronic equipment
CN111966851B (en) Image recognition method and system based on small number of samples
CN113407808A (en) Method and device for judging applicability of graph neural network model and computer equipment
CN114840742A (en) User portrait construction device, method and computer readable medium
CN112532692A (en) Information pushing method and device and storage medium
CN112182382A (en) Data processing method, electronic device, and medium
CN105872268B (en) A kind of call center user incoming call purpose prediction technique and device
CN113672783B (en) Feature processing method, model training method and media resource processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Li Xiangyang

Inventor after: Li Anran

Inventor after: Zhang Lan

Inventor after: Xie Junting

Inventor before: Li Anran

Inventor before: Zhang Lan

Inventor before: Li Xiangyang

Inventor before: Xie Junting

CB03 Change of inventor or designer information