CN111612783B - Data quality assessment method and system - Google Patents
Data quality assessment method and system Download PDFInfo
- Publication number
- CN111612783B CN111612783B CN202010472680.2A CN202010472680A CN111612783B CN 111612783 B CN111612783 B CN 111612783B CN 202010472680 A CN202010472680 A CN 202010472680A CN 111612783 B CN111612783 B CN 111612783B
- Authority
- CN
- China
- Prior art keywords
- data set
- data
- quality
- quality requirement
- minimum
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001303 quality assessment method Methods 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 claims abstract description 51
- 239000013598 vector Substances 0.000 claims abstract description 50
- 238000013441 quality evaluation Methods 0.000 claims abstract description 44
- 238000011156 evaluation Methods 0.000 claims description 72
- 238000000605 extraction Methods 0.000 claims description 16
- 238000011002 quantification Methods 0.000 claims description 15
- 238000013139 quantization Methods 0.000 claims description 14
- 238000005070 sampling Methods 0.000 description 6
- 230000002776 aggregation Effects 0.000 description 4
- 238000004220 aggregation Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 238000013480 data collection Methods 0.000 description 3
- 238000012163 sequencing technique Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012854 evaluation process Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000011835 investigation Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30168—Image quality inspection
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a data quality evaluation method and a system, wherein the method comprises the following steps: evaluating internal characteristics of the data irrelevant to tasks on the data set to obtain the data set meeting the minimum internal quality requirement; extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data; performing context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result; and sorting the quality evaluation results. The invention can comprehensively consider the internal quality irrelevant to the task, the context quality relevant to the task and the requirement on large-scale data quality evaluation when evaluating the data quality, thereby effectively improving the comprehensiveness, accuracy and efficiency of the data quality evaluation.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data quality evaluation method and system.
Background
Today, with the rapid development of mobile networks, sensor networks and crowd sensing technologies, a wide variety of data is being generated in large quantities. At the same time, a large number of data-based information services are also emerging, in which the quality of the data plays a vital role. 1) The high quality data may provide sufficient and accurate information to accomplish a particular task, such as training a high quality machine learning model; helping smart city systems make informed decisions. 2) A large number of services provide the data itself as a product to users on demand, for example, crowd-sourced services. For these services, the quality of the data determines the satisfaction of the user. 3) High data quality helps to optimize system resource utilization. Limited resources (e.g., bandwidth, storage and computing resources) should be preferentially allocated to high quality data to ensure system performance and quality of service. Taking crowd sensing application as an example, a large number of participants upload images in a mobile phone, effective data quality assessment, especially effective quality assessment of a large image set, can remarkably promote quality of the uploaded images, so that bandwidth loss caused by low-quality image transmission is avoided.
Data quality assessment has attracted attention from researchers, however, existing assessment methods suffer from the following drawbacks when faced with specific tasks and large amounts of data. First, existing work mostly focuses on the inherent quality of data, while important context quality is ignored. With the same data, one task may perform well while another task may perform poorly. For example, a high quality image dataset for training face recognition may be a poor quality dataset for object detection tasks. Second, existing works mostly aim at single data units (such as a picture and a text) when evaluating the quality of data, and lack an evaluation method for the overall quality of the data set. If the overall quality of the data set is obtained simply by the quality statistics of the individual data units, such as the statistics of the minimum or average value of the quality of all the data units, the influence of the relationship between the data units on the quality of the data set is ignored. Finally, although data quality has been proposed for various dimensions, it remains a challenge to fuse these dimensions to obtain a comprehensive overall quality result.
Therefore, how to evaluate the quality of data more comprehensively and accurately is a problem to be solved urgently.
Disclosure of Invention
In view of the above, the present invention provides a data quality evaluation method, which can comprehensively consider the internal quality irrelevant to the task, the context quality relevant to the task, and the requirement for large-scale data quality evaluation during data quality evaluation, thereby effectively improving the comprehensiveness, accuracy and efficiency of data quality evaluation.
The invention provides a data quality assessment method, which comprises the following steps:
evaluating internal characteristics of the data irrelevant to tasks on the data set to obtain the data set meeting the minimum internal quality requirement;
extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data;
performing context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result;
and carrying out quality sorting on the quality evaluation results.
Preferably, the evaluating the internal features of the data independent of the task to obtain the data set meeting the minimum internal quality requirement includes:
evaluating the correctness, reliability and error-free degree of the data set by a pattern matching method to obtain an accuracy quantized value;
evaluating the data acquisition and storage precision of the data set to obtain an accuracy quantization value;
evaluating the unbiased degree of the data set to obtain an objectivity quantification value;
evaluating the trusted degree of the data source of the data set to obtain a reliability quantification value;
and obtaining a data set meeting the minimum internal quality requirement based on the accuracy quantized value, the objectivity quantized value, the reliability quantized value, the accuracy minimum quality requirement, the objectivity minimum quality requirement and the reliability minimum quality requirement.
Preferably, the feature extraction is performed on each data in the data set and the sample data set meeting the minimum intrinsic quality requirement, so as to obtain a feature vector of each data, which includes:
and extracting the eighth layer of features from each picture data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a VGG-16 model as feature vectors of the picture data.
Preferably, the feature extraction is performed on each data in the data set and the sample data set meeting the minimum intrinsic quality requirement, so as to obtain a feature vector of each data, which includes:
and extracting the penultimate layer of features from each text data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a BERT model as feature vectors of the text data.
Preferably, performing a context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum intrinsic quality requirement to obtain a quality assessment result, including:
calculating the ratio of the number of similar point pairs to the distance in the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a task correlation evaluation result;
calculating the average distance between the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a content diversity evaluation result;
calculating the ratio of the number of non-empty data in the data set meeting the minimum internal quality requirement and the sample data set to the total data amount to obtain an integrity evaluation result;
evaluating whether the data volume in the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a suitable degree evaluation result of the data volume;
and evaluating whether the service cycle of the data set meeting the minimum internal quality requirement and the service cycle of the sample data set meet the requirement of a given task or not, and obtaining a time-efficiency evaluation result.
A data quality assessment system, comprising:
the intrinsic quality evaluation module is used for evaluating the internal characteristics of the data irrelevant to the task on the data set to obtain the data set meeting the minimum intrinsic quality requirement;
the feature extraction module is used for carrying out feature extraction on each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a feature vector of each data;
the context quality evaluation module is used for performing context quality evaluation on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality evaluation result;
and the quality sorting module is used for sorting the quality of the quality evaluation result.
Preferably, the intrinsic quality assessment module comprises:
the accuracy evaluation unit is used for evaluating the accuracy, the reliability and the degree of no error of the data set through a pattern matching method to obtain an accuracy quantized value;
the accuracy evaluation unit is used for evaluating the data acquisition and storage accuracy of the data set to obtain an accuracy quantized value;
the objectivity evaluation unit is used for evaluating the unbiased degree of the data set to obtain an objectivity quantification value;
the dependability evaluation unit is used for evaluating the trusted degree of the data source of the data set to obtain a dependability quantized value;
and the determining unit is used for obtaining a data set meeting the minimum internal quality requirement based on the accuracy quantized value, the objectivity quantized value, the reliability quantized value, the accuracy minimum quality requirement value, the objectivity minimum quality requirement value and the reliability minimum quality requirement value.
Preferably, the feature extraction module is specifically configured to:
and extracting the eighth layer of features from each picture data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a VGG-16 model as feature vectors of the picture data.
Preferably, the feature extraction module is specifically configured to:
and extracting the penultimate layer of features from each text data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a BERT model as feature vectors of the text data.
Preferably, the context quality assessment module comprises:
the task relevance evaluation unit is used for calculating the ratio of the number of similar point pairs to the distance in the data set and the sample data set meeting the minimum internal quality requirement by adopting a method based on local sensitive hash to obtain a task relevance evaluation result;
the content diversity evaluation unit is used for calculating the average distance between the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a content diversity evaluation result;
the integrity evaluation unit is used for calculating the ratio of the number of non-empty data in the data set meeting the minimum internal quality requirement and the sample data set to the total data amount to obtain an integrity evaluation result;
the data volume fitness evaluation unit is used for evaluating whether the data volume in the data set meeting the minimum internal quality requirement and the sample data set meets the requirement of a given task or not, and obtaining a data volume fitness evaluation result;
and the timeliness evaluation unit is used for evaluating whether the service cycle of the data set meeting the minimum internal quality requirement and the sample data set meets the requirement of a given task or not, so as to obtain a timeliness evaluation result.
In summary, the invention discloses a data quality evaluation method, when the data quality is required to be evaluated, firstly evaluating the internal characteristics of data irrelevant to tasks on a data set to obtain the data set meeting the minimum internal quality requirement; then, extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data; and carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result, and carrying out quality sequencing on the quality assessment result. The invention can comprehensively consider the internal quality irrelevant to the task, the context quality relevant to the task and the requirement on large-scale data quality evaluation when evaluating the data quality, thereby effectively improving the comprehensiveness, accuracy and efficiency of the data quality evaluation.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of embodiment 1 of a data quality assessment method of the present disclosure;
FIG. 2 is a flow chart of a method of embodiment 2 of a data quality assessment method of the present disclosure;
FIG. 3 is a schematic diagram of a data quality evaluation system according to an embodiment 1 of the present disclosure;
fig. 4 is a schematic structural diagram of an embodiment 2 of a data quality evaluation system according to the present disclosure.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, a method flowchart of an embodiment 1 of a data quality evaluation method disclosed in the present invention may include:
s101, evaluating internal characteristics of data irrelevant to tasks on the data set to obtain the data set meeting the minimum internal quality requirement;
when the data quality needs to be evaluated, the accuracy, the precision, the objectivity and the reliability are firstly in four dimensionsThe data set is evaluated for internal features of the data that are not related to the task. For data set D, the quantitative values for the four dimensions of accuracy, precision, objectivity and reliability are respectively Minimum quality requirements for four dimensions of accuracy, precision, objectivity and reliability are θ respectively c ,θ p ,θ o ,θ r . The data set D must meet the minimum intrinsic quality requirementInferior datasets that do not meet the lowest intrinsic quality requirement R will be placed directly at the bottom of the ordered list without further evaluation.
S102, extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data;
for the data set M meeting the minimum internal quality requirement R, the quality evaluator extracts the characteristics of each data in the data set M and the sample data set S to obtain the characteristic vector of each data.
S103, carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result;
the context quality assessment dataset is adapted to the extent of a given task. In this embodiment, the data consumer expresses the task' S need for data by providing a small sample data set S. And then carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement from five dimensions of task relativity, content diversity, integrity, appropriateness of data volume and timeliness to obtain a quality assessment result.
S104, quality sorting is carried out on the quality evaluation results.
Finally, by minimizing Kendall tau distance, using quality ordering method (rank aggregation), a best quality ordered data set sequence is calculated given multiple input data set quality assessment result sequences, with the higher ordered data set having higher data quality on a given task.
In summary, in the above embodiment, when the data quality needs to be evaluated, firstly, evaluating the internal features of the data unrelated to the task on the data set to obtain the data set meeting the minimum internal quality requirement; then, extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data; and carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result, and carrying out quality sequencing on the quality assessment result. The invention can comprehensively consider the internal quality irrelevant to the task, the context quality relevant to the task and the requirement on large-scale data quality evaluation when evaluating the data quality, thereby effectively improving the comprehensiveness, accuracy and efficiency of the data quality evaluation.
As shown in fig. 2, a method flowchart of an embodiment 2 of a data quality evaluation method disclosed in the present invention may include:
s201, evaluating the correctness, reliability and error-free degree of a data set by a pattern matching method to obtain an accuracy quantized value;
when the data quality needs to be evaluated, the data set is firstly evaluated for the internal characteristics of the data irrelevant to tasks from four dimensions of accuracy, precision, objectivity and reliability.
Specifically, the accuracy, reliability and error-free degree of the data are evaluated by using a pattern matching method, and an accuracy quantized value is obtainedFor example, for text data, its spelling and grammar are evaluated for correctness.
S202, evaluating the data acquisition and storage precision of a data set to obtain an accuracy quantization value;
meanwhile, the data acquisition and storage precision of the data set is evaluated to obtain an accuracy quantized valueFor example, a pre-trained convolutional neural network is utilized to estimate the accuracy of the image, including the JPEG compression rate of the picture, the degree of blurring, etc.
S203, evaluating the unbiased degree of the data set to obtain an objectivity quantification value;
meanwhile, the unbiased degree of the data set is evaluated by adopting a method for checking history objective records and questionnaire investigation, so as to obtain an objectivity quantification value
S204, evaluating the trusted degree of the data source of the data set to obtain a reliability quantification value;
meanwhile, the method for checking the history objective record and questionnaire is adopted to evaluate the trusted degree of the data source of the data set, and a reliability quantification value is obtained
S205, obtaining a data set meeting the minimum internal quality requirement based on the accuracy quantized value, the objectivity quantized value, the reliability quantized value, the accuracy minimum quality requirement, the objectivity minimum quality requirement and the reliability minimum quality requirement;
then, the value is quantized according to the accuracyAccuracy quantization value->Objectivity quantification value->Reliability quantization value->Minimum accuracy quality requirement value θ c Minimum accuracy quality requirement value θ p Objective minimum quality requirement value theta o And a reliability minimum quality requirement value theta r To meet the minimum internal quality requirementIs a data set of the (c).
S206, extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data;
for the data set M meeting the minimum internal quality requirement R, the quality evaluator extracts the characteristics of each data in the data set M and the sample data set S to obtain the characteristic vector of each data.
Specifically, for the picture data, extracting the eighth layer of features of the picture data by using a VGG-16 model as feature vectors of the picture; for text data, its penultimate layer features are extracted as feature expressions of the text using the BERT (Bidirectional Encoder Representations from Transformers) model from the bi-directional encoder characterizer of the transformer.
S207, calculating the ratio of the number of similar point pairs to the distance in a data set and a sample data set meeting the minimum internal quality requirement by adopting a method based on local sensitive hash, and obtaining a task relevance evaluation result;
the context quality assessment dataset is adapted to the extent of a given task. In this embodiment, the data consumer expresses the task' S need for data by providing a small sample data set S. And then carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement from five dimensions of task relativity, content diversity, integrity, appropriateness of data volume and timeliness to obtain a quality assessment result.
Wherein, a method based on local sensitive hash is adopted to calculate the ratio of the number X (M, S) of similar point pairs in the data set M and the sample data set S to the |D|, and the ratio is used to approximate the value of the task relevance. Specific:
the feature vectors are hashed locally and with high probability, similar data points are mapped into the same bucket and dissimilar data points are mapped into different buckets.
Calculating data point d in the same bucket i E M and data point d j Distance Dis (d) of E S i ,d j ) (e.g., based on the Euclidean distance or cosine distance of the feature vector), when the distance is less than the threshold delta, then data point d i And data point d j Are considered to be similar pairs of points.
X (M, S)/|d| (X (M, S) is the number of pairs of similar points) is calculated and used to approximate the value of the expressed task relevance.
S208, calculating the average distance between the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash, and obtaining a content diversity evaluation result;
meanwhile, an average distance between the data set M and the sample data set S is calculated by adopting a method based on local sensitive hash sampling, and the distance is used for approximating the value of expressing the content diversity. Specific:
the feature vectors are hashed locally and with high probability, similar data points are mapped into the same bucket and dissimilar data points are mapped into different buckets.
Uniformly sampling the data in all barrels to obtain a set G, and calculating the distance Dis (d) of all data points in the set G i ,d j ),d i ,d j E G, and approximating the value of content diversity with the mean (the higher the sampling rate, the closer the calculated mean is to the value of true content diversity).
S209, calculating the ratio of the number of non-empty data in the data set and the sample data set meeting the minimum internal quality requirement to the total data amount to obtain an integrity evaluation result;
and meanwhile, calculating the ratio of the number of non-null data in the data set and the sample data set meeting the minimum internal quality requirement to the total data amount to obtain an integrity evaluation result.
S210, evaluating whether the data volume in the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a suitability evaluation result of the data volume;
and simultaneously, evaluating whether the data volume in the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task, and obtaining a suitable degree evaluation result of the data volume.
S211, evaluating whether the service periods of the data set and the sample data set meeting the minimum internal quality requirement meet the requirement of a given task or not, and obtaining a timeliness evaluation result;
and simultaneously, evaluating whether the service cycle of the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a time-efficiency evaluation result.
S212, quality sorting is carried out on the quality evaluation results.
Finally, by minimizing Kendall tau distance, using quality ordering method (rank aggregation), a best quality ordered data set sequence is calculated given multiple input data set quality assessment result sequences, with the higher ordered data set having higher data quality on a given task.
In summary, in the data quality evaluation method disclosed by the invention, the evaluation index is comprehensive, and the internal quality irrelevant to the task and the context quality relevant to the task are comprehensively considered; the evaluation process is efficient and is suitable for large-scale data collection; the evaluation method is high in universality and suitable for various types of data; the evaluation result has interpretability.
As shown in fig. 3, a schematic structural diagram of an embodiment 1 of a data quality evaluation system disclosed in the present invention may include:
the intrinsic quality evaluation module 301 is configured to evaluate the internal features of the data set that are not related to the task, so as to obtain a data set that meets the minimum intrinsic quality requirement;
when the data quality needs to be evaluated, the data set is firstly evaluated for the internal characteristics of the data irrelevant to tasks from four dimensions of accuracy, precision, objectivity and reliability. For data set D, the quantitative values for the four dimensions of accuracy, precision, objectivity and reliability are respectively Minimum quality requirements for four dimensions of accuracy, precision, objectivity and reliability are θ respectively c ,θ p ,θ o ,θ r . The data set D must meet the minimum intrinsic quality requirementInferior datasets that do not meet the lowest intrinsic quality requirement R will be placed directly at the bottom of the ordered list without further evaluation.
The feature extraction module 302 is configured to perform feature extraction on each data in the data set and the sample data set that meet the minimum intrinsic quality requirement, so as to obtain a feature vector of each data;
for the data set M meeting the minimum internal quality requirement R, the quality evaluator extracts the characteristics of each data in the data set M and the sample data set S to obtain the characteristic vector of each data.
A context quality evaluation module 303, configured to perform context quality evaluation on the feature vector of each data in the data set and the sample data set that meet the minimum intrinsic quality requirement, to obtain a quality evaluation result;
the context quality assessment dataset is adapted to the extent of a given task. In this embodiment, the data consumer expresses the task' S need for data by providing a small sample data set S. And then carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement from five dimensions of task relativity, content diversity, integrity, appropriateness of data volume and timeliness to obtain a quality assessment result.
The quality sorting module 304 is configured to sort the quality of the quality evaluation result.
Finally, by minimizing Kendall tau distance, using quality ordering method (rank aggregation), a best quality ordered data set sequence is calculated given multiple input data set quality assessment result sequences, with the higher ordered data set having higher data quality on a given task.
In summary, in the above embodiment, when the data quality needs to be evaluated, firstly, evaluating the internal features of the data unrelated to the task on the data set to obtain the data set meeting the minimum internal quality requirement; then, extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data; and carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality assessment result, and carrying out quality sequencing on the quality assessment result. The invention can comprehensively consider the internal quality irrelevant to the task, the context quality relevant to the task and the requirement on large-scale data quality evaluation when evaluating the data quality, thereby effectively improving the comprehensiveness, accuracy and efficiency of the data quality evaluation.
As shown in fig. 4, a schematic structural diagram of an embodiment 2 of a data quality evaluation system disclosed in the present invention may include:
an accuracy evaluation unit 401, configured to evaluate accuracy, reliability and error-free degree of the data set by using a pattern matching method, so as to obtain an accuracy quantized value;
when the data quality needs to be evaluated, the data set is firstly evaluated for the internal characteristics of the data irrelevant to tasks from four dimensions of accuracy, precision, objectivity and reliability.
Specifically, the correctness, reliability and absence of data are evaluated by using a pattern matching methodThe degree of error, obtaining an accuracy quantized valueFor example, for text data, its spelling and grammar are evaluated for correctness.
An accuracy evaluation unit 402, configured to evaluate accuracy of data collection and storage of the data set, and obtain an accuracy quantization value;
meanwhile, the data acquisition and storage precision of the data set is evaluated to obtain an accuracy quantized valueFor example, a pre-trained convolutional neural network is utilized to estimate the accuracy of the image, including the JPEG compression rate of the picture, the degree of blurring, etc.
An objectivity evaluation unit 403, configured to evaluate the unbiased degree of the data set, so as to obtain an objectivity quantification value;
meanwhile, the unbiased degree of the data set is evaluated by adopting a method for checking history objective records and questionnaire investigation, so as to obtain an objectivity quantification value
A dependability evaluation unit 404, configured to evaluate the degree of trust of the data source of the data set, and obtain a quantized reliability value;
meanwhile, the method for checking the history objective record and questionnaire is adopted to evaluate the trusted degree of the data source of the data set, and a reliability quantification value is obtained
A determining unit 405, configured to obtain a dataset that meets a minimum intrinsic quality requirement based on the accuracy quantization value, the objectivity quantization value, the reliability quantization value, the accuracy minimum quality requirement, the objectivity minimum quality requirement, and the reliability minimum quality requirement;
then, the value is quantized according to the accuracyAccuracy quantization value->Objectivity quantification value->Reliability quantization value->Minimum accuracy quality requirement value θ c Minimum accuracy quality requirement value θ p Objective minimum quality requirement value theta o And a reliability minimum quality requirement value theta r To meet the minimum internal quality requirementIs a data set of the (c).
A feature extraction module 406, configured to perform feature extraction on each data in the data set and the sample data set that meet the minimum intrinsic quality requirement, so as to obtain a feature vector of each data;
for the data set M meeting the minimum internal quality requirement R, the quality evaluator extracts the characteristics of each data in the data set M and the sample data set S to obtain the characteristic vector of each data.
Specifically, for the picture data, extracting the eighth layer of features of the picture data by using a VGG-16 model as feature vectors of the picture; for text data, its penultimate layer features are extracted as feature expressions of the text using the BERT (Bidirectional Encoder Representations from Transformers) model from the bi-directional encoder characterizer of the transformer.
A task relevance evaluation unit 407, configured to calculate a ratio of the number of similar point pairs to the distance in the data set and the sample data set that satisfy the minimum internal quality requirement by using a method based on local sensitive hashing, to obtain a task relevance evaluation result;
the context quality assessment dataset is adapted to the extent of a given task. In this embodiment, the data consumer expresses the task' S need for data by providing a small sample data set S. And then carrying out context quality assessment on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement from five dimensions of task relativity, content diversity, integrity, appropriateness of data volume and timeliness to obtain a quality assessment result.
Wherein, a method based on local sensitive hash is adopted to calculate the ratio of the number X (M, S) of similar point pairs in the data set M and the sample data set S to the |D|, and the ratio is used to approximate the value of the task relevance. Specific:
the feature vectors are hashed locally and with high probability, similar data points are mapped into the same bucket and dissimilar data points are mapped into different buckets.
Calculating data point d in the same bucket i E M and data point d j Distance Dis (d) of E S i ,d j ) (e.g., based on the Euclidean distance or cosine distance of the feature vector), when the distance is less than the threshold delta, then data point d i And data point d j Are considered to be similar pairs of points.
X (M, S)/|d| (X (M, S) is the number of pairs of similar points) is calculated and used to approximate the value of the expressed task relevance.
A content diversity evaluation unit 408, configured to calculate an average distance between the data set and the sample data set that meets the minimum internal quality requirement by using a method based on local sensitive hash, so as to obtain a content diversity evaluation result;
meanwhile, an average distance between the data set M and the sample data set S is calculated by adopting a method based on local sensitive hash sampling, and the distance is used for approximating the value of expressing the content diversity. Specific:
the feature vectors are hashed locally and with high probability, similar data points are mapped into the same bucket and dissimilar data points are mapped into different buckets.
Uniformly sampling the data in all barrels to obtainTo set G, and calculate the distance Dis (d) of all data points in set G i ,d j ),d i ,d j E G, and approximating the value of content diversity with the mean (the higher the sampling rate, the closer the calculated mean is to the value of true content diversity).
An integrity evaluation unit 409, configured to calculate a ratio of the number of non-null data in the data set and the sample data set that satisfy the minimum internal quality requirement to the total data amount, to obtain an integrity evaluation result;
and meanwhile, calculating the ratio of the number of non-null data in the data set and the sample data set meeting the minimum internal quality requirement to the total data amount to obtain an integrity evaluation result.
A fitness evaluation unit 410, configured to evaluate whether the data volume in the data set and the sample data set that meet the minimum intrinsic quality requirement meets the requirement of the given task, and obtain a fitness evaluation result of the data volume;
and simultaneously, evaluating whether the data volume in the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task, and obtaining a suitable degree evaluation result of the data volume.
A timeliness evaluation unit 411, configured to evaluate whether a service cycle of a data set and a sample data set that meet a minimum intrinsic quality requirement meets a requirement of a given task, to obtain a timeliness evaluation result;
and simultaneously, evaluating whether the service cycle of the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a time-efficiency evaluation result.
A quality ranking module 412, configured to rank the quality evaluation results.
Finally, by minimizing Kendall tau distance, using quality ordering method (rank aggregation), a best quality ordered data set sequence is calculated given multiple input data set quality assessment result sequences, with the higher ordered data set having higher data quality on a given task.
In summary, in the data quality evaluation method disclosed by the invention, the evaluation index is comprehensive, and the internal quality irrelevant to the task and the context quality relevant to the task are comprehensively considered; the evaluation process is efficient and is suitable for large-scale data collection; the evaluation method is high in universality and suitable for various types of data; the evaluation result has interpretability.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (6)
1. A method of data quality assessment, comprising:
evaluating the correctness, reliability and error-free degree of the data set by a pattern matching method to obtain an accuracy quantized value;
evaluating the data acquisition and storage precision of the data set to obtain an accuracy quantization value;
evaluating the unbiased degree of the data set to obtain an objectivity quantification value;
evaluating the trusted degree of the data source of the data set to obtain a reliability quantification value;
obtaining a data set meeting the minimum internal quality requirement based on the accuracy quantized value, the objectivity quantized value, the reliability quantized value, the accuracy minimum quality requirement, the objectivity minimum quality requirement and the reliability minimum quality requirement;
extracting the characteristics of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain the characteristic vector of each data;
calculating the ratio of the number of similar point pairs to the distance in the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a task correlation evaluation result;
calculating the average distance between the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a content diversity evaluation result;
calculating the ratio of the number of non-empty data in the data set meeting the minimum internal quality requirement and the sample data set to the total data amount to obtain an integrity evaluation result;
evaluating whether the data volume in the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a suitable degree evaluation result of the data volume;
evaluating whether the service cycle of the data set and the sample data set meeting the minimum internal quality requirement meets the requirement of a given task or not, and obtaining a timeliness evaluation result;
and carrying out quality sorting on the quality evaluation results.
2. The method of claim 1, wherein the feature extraction of each data in the data set and the sample data set meeting the minimum intrinsic quality requirement to obtain a feature vector for each data comprises:
and extracting the eighth layer of features from each picture data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a VGG-16 model as feature vectors of the picture data.
3. The method of claim 1, wherein the feature extraction of each data in the data set and the sample data set meeting the minimum intrinsic quality requirement to obtain a feature vector for each data comprises:
and extracting the penultimate layer of features from each text data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a BERT model as feature vectors of the text data.
4. A data quality assessment system, comprising:
the intrinsic quality evaluation module is used for evaluating the internal characteristics of the data irrelevant to the task on the data set to obtain the data set meeting the minimum intrinsic quality requirement;
the feature extraction module is used for carrying out feature extraction on each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a feature vector of each data;
the context quality evaluation module is used for performing context quality evaluation on the feature vector of each data in the data set and the sample data set meeting the minimum internal quality requirement to obtain a quality evaluation result;
the quality sorting module is used for sorting the quality of the quality evaluation result;
the intrinsic quality assessment module comprises:
the accuracy evaluation unit is used for evaluating the accuracy, the reliability and the degree of no error of the data set through a pattern matching method to obtain an accuracy quantized value;
the accuracy evaluation unit is used for evaluating the data acquisition and storage accuracy of the data set to obtain an accuracy quantized value;
the objectivity evaluation unit is used for evaluating the unbiased degree of the data set to obtain an objectivity quantification value;
the dependability evaluation unit is used for evaluating the trusted degree of the data source of the data set to obtain a dependability quantized value;
a determining unit, configured to obtain a data set that meets a minimum intrinsic quality requirement based on the accuracy quantization value, the objectivity quantization value, the reliability quantization value, the accuracy minimum quality requirement, the objectivity minimum quality requirement, and the reliability minimum quality requirement;
the context quality assessment module comprises:
the task relevance evaluation unit is used for calculating the ratio of the number of similar point pairs to the distance in the data set and the sample data set meeting the minimum internal quality requirement by adopting a method based on local sensitive hash to obtain a task relevance evaluation result;
the content diversity evaluation unit is used for calculating the average distance between the data set meeting the minimum internal quality requirement and the sample data set by adopting a method based on local sensitive hash to obtain a content diversity evaluation result;
the integrity evaluation unit is used for calculating the ratio of the number of non-empty data in the data set meeting the minimum internal quality requirement and the sample data set to the total data amount to obtain an integrity evaluation result;
the data volume fitness evaluation unit is used for evaluating whether the data volume in the data set meeting the minimum internal quality requirement and the sample data set meets the requirement of a given task or not, and obtaining a data volume fitness evaluation result;
and the timeliness evaluation unit is used for evaluating whether the service cycle of the data set meeting the minimum internal quality requirement and the sample data set meets the requirement of a given task or not, so as to obtain a timeliness evaluation result.
5. The system of claim 4, wherein the feature extraction module is specifically configured to:
and extracting the eighth layer of features from each picture data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a VGG-16 model as feature vectors of the picture data.
6. The system of claim 4, wherein the feature extraction module is specifically configured to:
and extracting the penultimate layer of features from each text data in the data set and the sample data set meeting the minimum intrinsic quality requirement by using a BERT model as feature vectors of the text data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010472680.2A CN111612783B (en) | 2020-05-28 | 2020-05-28 | Data quality assessment method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010472680.2A CN111612783B (en) | 2020-05-28 | 2020-05-28 | Data quality assessment method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111612783A CN111612783A (en) | 2020-09-01 |
CN111612783B true CN111612783B (en) | 2023-10-24 |
Family
ID=72200233
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010472680.2A Active CN111612783B (en) | 2020-05-28 | 2020-05-28 | Data quality assessment method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111612783B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114782731B (en) * | 2022-05-19 | 2024-08-27 | 中科南京软件技术研究院 | Image data set validity evaluation method, device, equipment and storage medium |
CN117556268A (en) * | 2022-07-31 | 2024-02-13 | 华为技术有限公司 | Data quality measurement method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106056287A (en) * | 2016-06-03 | 2016-10-26 | 华东理工大学 | Equipment and method for carrying out data quality evaluation on data set based on context |
WO2017162835A1 (en) * | 2016-03-24 | 2017-09-28 | Universität Stuttgart | Data compression by means of adaptive subsampling |
CN109800812A (en) * | 2019-01-24 | 2019-05-24 | 山东大学第二医院 | CT image classification feature selection approach and system based on counterfeit filter |
CN110728437A (en) * | 2019-09-26 | 2020-01-24 | 华南师范大学 | Quality evaluation method and system for open data |
-
2020
- 2020-05-28 CN CN202010472680.2A patent/CN111612783B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017162835A1 (en) * | 2016-03-24 | 2017-09-28 | Universität Stuttgart | Data compression by means of adaptive subsampling |
CN106056287A (en) * | 2016-06-03 | 2016-10-26 | 华东理工大学 | Equipment and method for carrying out data quality evaluation on data set based on context |
CN109800812A (en) * | 2019-01-24 | 2019-05-24 | 山东大学第二医院 | CT image classification feature selection approach and system based on counterfeit filter |
CN110728437A (en) * | 2019-09-26 | 2020-01-24 | 华南师范大学 | Quality evaluation method and system for open data |
Non-Patent Citations (1)
Title |
---|
基于层次分析法的教育网络安全数据质量管理研究;王明政等;《信息网络安全》;20191210(第12期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111612783A (en) | 2020-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109492772B (en) | Method and device for generating information | |
US20220121906A1 (en) | Task-aware neural network architecture search | |
CN111898578B (en) | Crowd density acquisition method and device and electronic equipment | |
CN111159564A (en) | Information recommendation method and device, storage medium and computer equipment | |
CN109033408B (en) | Information pushing method and device, computer readable storage medium and electronic equipment | |
CN105187237B (en) | The method and apparatus for searching associated user identifier | |
CN110245132B (en) | Data anomaly detection method, device, computer readable storage medium and computer equipment | |
CN111612783B (en) | Data quality assessment method and system | |
CN102402594A (en) | Rich media individualized recommending method | |
CN113128305B (en) | Portrait archive aggregation evaluation method and device, electronic equipment and storage medium | |
US20230004776A1 (en) | Moderator for identifying deficient nodes in federated learning | |
CN111708942A (en) | Multimedia resource pushing method, device, server and storage medium | |
CN110895706A (en) | Method and device for acquiring target cluster number and computer system | |
WO2022017082A1 (en) | Method and apparatus for detecting false transaction orders | |
CN115858911A (en) | Information recommendation method and device, electronic equipment and computer-readable storage medium | |
CN113239879A (en) | Federal model training and certificate detection method, device, equipment and medium | |
CN111291694A (en) | Dish image identification method and device | |
CN113704566B (en) | Identification number body identification method, storage medium and electronic equipment | |
CN111966851B (en) | Image recognition method and system based on small number of samples | |
CN113407808A (en) | Method and device for judging applicability of graph neural network model and computer equipment | |
CN114840742A (en) | User portrait construction device, method and computer readable medium | |
CN112532692A (en) | Information pushing method and device and storage medium | |
CN112182382A (en) | Data processing method, electronic device, and medium | |
CN105872268B (en) | A kind of call center user incoming call purpose prediction technique and device | |
CN113672783B (en) | Feature processing method, model training method and media resource processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CB03 | Change of inventor or designer information |
Inventor after: Li Xiangyang Inventor after: Li Anran Inventor after: Zhang Lan Inventor after: Xie Junting Inventor before: Li Anran Inventor before: Zhang Lan Inventor before: Li Xiangyang Inventor before: Xie Junting |
|
CB03 | Change of inventor or designer information |