CN116303387A - Scientific data center data quality assessment method - Google Patents
Scientific data center data quality assessment method Download PDFInfo
- Publication number
- CN116303387A CN116303387A CN202310167738.6A CN202310167738A CN116303387A CN 116303387 A CN116303387 A CN 116303387A CN 202310167738 A CN202310167738 A CN 202310167738A CN 116303387 A CN116303387 A CN 116303387A
- Authority
- CN
- China
- Prior art keywords
- data
- evaluation
- feedback
- scientific
- quality
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- General Factory Administration (AREA)
Abstract
The invention discloses a data quality assessment method of a scientific data center, which comprises the following steps: step one, importing data; step two, quality analysis; step three, formulating a weighting rule; step four, quality assessment; step five, setting a feedback threshold value; step six, collecting feedback; step seven, feedback correction; step eight, updating iteration; in the second step, the relevance of the data is determined according to the quality of the content related to the data, and in the second step, the repeatability of the data is analyzed according to similar data under the same background; according to the invention, the applicability of the evaluation method is enlarged by formulating evaluation rules with various different tendencies; according to the invention, the evaluation result is corrected by collecting the user feedback, so that the evaluation error can be found in time.
Description
Technical Field
The invention relates to the technical field of data quality evaluation, in particular to a data quality evaluation method of a scientific data center.
Background
Along with the continuous development of science and technology, the propagation speed of information is rapidly expanded, various system applications are more and more, wherein the processing of data has extremely important positions in the system applications, the quality of the data also determines whether one system application can obtain the trust of a user, the existing data quality evaluation method takes the data multidimensional analysis as an evaluation basis, the data validity is mainly determined by the content and the quantity of the data without examining the source of the data, and the accuracy is to be improved; the existing data quality evaluation method comprehensively considers the analysis results of all dimensions when preparing an evaluation rule, and then uniformly evaluates the analysis results, wherein the evaluation results have comprehensiveness but cannot adapt to various consulting trends of users; the existing data quality assessment method cannot find and correct assessment errors in time due to the fact that feedback is not collected for users.
Disclosure of Invention
The invention aims to provide a data quality evaluation method for a scientific data center, so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: the data quality evaluation method of the scientific data center comprises the following steps: step one, importing data; step two, quality analysis; step three, formulating a weighting rule; step four, quality assessment; step five, setting a feedback threshold value; step six, collecting feedback; step seven, feedback correction; step eight, updating iteration;
in the first step, data in a scientific data center database is imported into an evaluation system;
in the second step, the data are respectively analyzed according to six dimensions of real-time performance, effectiveness, relevance, integrity and repeatability;
in the third step, the dimension classification in the second step is used for randomly taking a dimension as a main part to formulate an evaluation weighting rule;
in the fourth step, data quality evaluation is performed according to each rule specified in the third step, so that different evaluation results are obtained and are used for displaying different search trends in an arrangement mode;
in the fifth step, the feedback result is divided into positive feedback and negative feedback, and the positive feedback and the negative feedback are set to different levels of thresholds;
in the sixth step, after the user searches and refers to the data, user feedback is collected and summarized;
in the seventh step, feedback is added to the evaluation weighting rule according to the feedback result to correct the evaluation result;
in the eighth step, a data quality evaluation period is set, and the evaluation is immediately re-evaluated if the evaluation quality fluctuation is large due to correction in the period, otherwise, the evaluation is re-evaluated after one period is finished.
Preferably, in the second step, the real-time property of the data is analyzed according to the submitting time of the data and whether there is data update.
Preferably, in the second step, the validity of the data is analyzed according to the content, the number and the reliability of the data source.
Preferably, in the second step, the relevance of the data is determined according to the quality of the content associated with the data.
Preferably, in the second step, the integrity of the data includes the source background of the data, the obtaining process and the final result.
Preferably, in the second step, the repeatability of the data is analyzed according to similar data in the same background.
Preferably, in the fifth step, when the feedback result is classified into the threshold, the positive feedback and the negative feedback may be divided into the correction threshold and the re-evaluation threshold.
Preferably, in the seventh step, when the correction threshold is reached, positive and negative corrections are performed on the data evaluation according to the feedback result, and when the re-evaluation threshold is reached, that is, the feedback result deviates greatly from the initial evaluation, the calculation needs to be performed again, and the weighting needs to be performed in combination with the feedback.
Compared with the prior art, the invention has the beneficial effects that: compared with the existing data quality assessment method, the method has the advantages that the data validity is analyzed by examining the data sources, and the accuracy of quality assessment is improved; according to the invention, the applicability of the evaluation method is enlarged by formulating evaluation rules with various different tendencies; according to the invention, the evaluation result is corrected by collecting the user feedback, so that the evaluation error can be found in time.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, an embodiment of the present invention is provided: the data quality evaluation method of the scientific data center comprises the following steps: step one, importing data; step two, quality analysis; step three, formulating a weighting rule; step four, quality assessment; step five, setting a feedback threshold value; step six, collecting feedback; step seven, feedback correction; step eight, updating iteration;
in the first step, data in a scientific data center database is imported into an evaluation system;
in the second step, the data are analyzed according to six dimensions of real-time property, validity, relevance, integrity and repeatability, wherein the real-time property of the data is analyzed according to the submitting time of the data and whether data update exists, the validity of the data is analyzed according to the content, the number and the reliability of a data source of the data, the relevance of the data is determined according to the quality of the content associated with the data, the integrity of the data comprises the source background, the obtaining process and the final result of the data, and the repeatability of the data is analyzed according to similar data under the same background;
in the third step, the dimension classification in the second step is used for randomly taking a dimension as a main part to formulate an evaluation weighting rule;
in the fourth step, data quality evaluation is performed according to each rule specified in the third step, so that different evaluation results are obtained and are used for displaying different search trends in an arrangement mode;
in the fifth step, the feedback result is divided into positive feedback and negative feedback, and the positive feedback and the negative feedback are set to be different levels of thresholds, and the positive feedback and the negative feedback can be divided into a correction threshold and a re-evaluation threshold;
in the sixth step, after the user searches and refers to the data, user feedback is collected and summarized;
in the seventh step, the feedback is added into the evaluation weighting rule according to the feedback result to correct the evaluation result, when the correction threshold is reached, the data evaluation is corrected positively and negatively according to the feedback result, when the re-evaluation threshold is reached, that is, the feedback result greatly deviates from the initial evaluation, the calculation is needed again, and the feedback is combined to weight;
in the eighth step, a data quality evaluation period is set, and the evaluation is immediately re-evaluated if the evaluation quality fluctuation is large due to correction in the period, otherwise, the evaluation is re-evaluated after one period is finished.
Based on the above, the invention has the advantages that when the invention is used, the data source is matched with the traditional validity analysis source, so that the accuracy of data analysis is improved; the invention uses a plurality of different data quality evaluation rules to avoid the problem that the invention can not adapt to different consulting trends of users; according to the invention, the feedback threshold is set, the feedback result is collected, and the data quality is corrected and re-evaluated finally according to the feedback threshold and the feedback result, so that the data quality evaluation can be ensured to follow the sense of the user.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Claims (8)
1. The data quality evaluation method of the scientific data center comprises the following steps: step one, importing data; step two, quality analysis; step three, formulating a weighting rule; step four, quality assessment; step five, setting a feedback threshold value; step six, collecting feedback; step seven, feedback correction; step eight, updating iteration; the method is characterized in that:
in the first step, data in a scientific data center database is imported into an evaluation system;
in the second step, the data are respectively analyzed according to six dimensions of real-time performance, effectiveness, relevance, integrity and repeatability;
in the third step, the dimension classification in the second step is used for randomly taking a dimension as a main part to formulate an evaluation weighting rule;
in the fourth step, data quality evaluation is performed according to each rule specified in the third step, so that different evaluation results are obtained and are used for displaying different search trends in an arrangement mode;
in the fifth step, the feedback result is divided into positive feedback and negative feedback, and the positive feedback and the negative feedback are set to different levels of thresholds;
in the sixth step, after the user searches and refers to the data, user feedback is collected and summarized;
in the seventh step, feedback is added to the evaluation weighting rule according to the feedback result to correct the evaluation result;
in the eighth step, a data quality evaluation period is set, and the evaluation is immediately re-evaluated if the evaluation quality fluctuation is large due to correction in the period, otherwise, the evaluation is re-evaluated after one period is finished.
2. The scientific data center data quality assessment method according to claim 1, characterized in that: in the second step, the real-time property of the data is analyzed according to the submitting time of the data and whether the data update exists.
3. The scientific data center data quality assessment method according to claim 1, characterized in that: in the second step, the validity of the data is analyzed according to the content, the quantity and the reliability of the data source.
4. The scientific data center data quality assessment method according to claim 1, characterized in that: in the second step, the relevance of the data is determined according to the quality of the content associated with the data.
5. The scientific data center data quality assessment method according to claim 1, characterized in that: in the second step, the integrity of the data includes the source background of the data, the obtaining process and the final result.
6. The scientific data center data quality assessment method according to claim 1, characterized in that: in the second step, the repeatability of the data is analyzed according to similar data in the same background.
7. The scientific data center data quality assessment method according to claim 1, characterized in that: in the fifth step, when the feedback result is classified into the threshold value, the positive feedback and the negative feedback can be divided into the correction threshold value and the re-evaluation threshold value.
8. The scientific data center data quality assessment method according to claim 1, characterized in that: in the seventh step, when the correction threshold is reached, positive and negative corrections are performed on the data evaluation according to the feedback result, and when the re-evaluation threshold is reached, that is, the feedback result deviates greatly from the initial evaluation, the calculation needs to be performed again, and the feedback is combined to perform weighting.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310167738.6A CN116303387A (en) | 2023-02-27 | 2023-02-27 | Scientific data center data quality assessment method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310167738.6A CN116303387A (en) | 2023-02-27 | 2023-02-27 | Scientific data center data quality assessment method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116303387A true CN116303387A (en) | 2023-06-23 |
Family
ID=86837179
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310167738.6A Pending CN116303387A (en) | 2023-02-27 | 2023-02-27 | Scientific data center data quality assessment method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116303387A (en) |
-
2023
- 2023-02-27 CN CN202310167738.6A patent/CN116303387A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Raghu et al. | Comparison of strategies for scalable causal discovery of latent variable models from mixed data | |
US9043348B2 (en) | System and method for performing set operations with defined sketch accuracy distribution | |
Bininda-Emonds | The evolution of supertrees | |
Pérez-Wohlfeil et al. | Ultra-fast genome comparison for large-scale genomic experiments | |
CN101556603A (en) | Coordinate search method used for reordering search results | |
CN114281809B (en) | Multi-source heterogeneous data cleaning method and device | |
CN105320764A (en) | 3D model retrieval method and 3D model retrieval apparatus based on slow increment features | |
Yu et al. | SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets | |
CN113343091A (en) | Industrial and enterprise oriented science and technology service recommendation calculation method, medium and program | |
Zhang et al. | The scientometric measurement of interdisciplinarity and diversity in the research portfolios of Chinese universities | |
CN104537280A (en) | Protein interactive relationship identification method based on text relationship similarity | |
CN111326215B (en) | Method and system for searching nucleic acid sequence based on k-tuple frequency | |
Wang et al. | An improved clustering method for detection system of public security events based on genetic algorithm and semisupervised learning | |
CN111046092B (en) | Parallel similarity connection method based on CPU-GPU heterogeneous system structure | |
Yang et al. | Efficient processing of top k group skyline queries | |
CN116303387A (en) | Scientific data center data quality assessment method | |
CN111309930A (en) | Medical knowledge graph entity alignment method based on representation learning | |
Ulrich et al. | Pattern identification in Pareto-set approximations | |
Lin et al. | Reducing Uncertainty of Probabilistic Top-$ k $ Ranking via Pairwise Crowdsourcing | |
CN115185920A (en) | Method, device and equipment for detecting log type | |
Ahmed et al. | Machine Learning Approach for Effective Ranking of Researcher Assessment Parameters | |
CN113704236A (en) | Government affair system data quality evaluation method, device, terminal and storage medium | |
CN113435713A (en) | Risk map compiling method and system based on GIS technology and two-model fusion | |
CN111291376A (en) | Web vulnerability verification method based on crowdsourcing and machine learning | |
Wang et al. | Semi-supervised hierarchical optimization-based affinity propagation algorithm and its applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |