CN116303387A - Scientific data center data quality assessment method - Google Patents

Scientific data center data quality assessment method Download PDF

Info

Publication number
CN116303387A
CN116303387A CN202310167738.6A CN202310167738A CN116303387A CN 116303387 A CN116303387 A CN 116303387A CN 202310167738 A CN202310167738 A CN 202310167738A CN 116303387 A CN116303387 A CN 116303387A
Authority
CN
China
Prior art keywords
data
evaluation
feedback
scientific
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310167738.6A
Other languages
Chinese (zh)
Inventor
伍观娣
陶玉柱
李一凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Academy of Forestry
Original Assignee
Guangdong Academy of Forestry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Academy of Forestry filed Critical Guangdong Academy of Forestry
Priority to CN202310167738.6A priority Critical patent/CN116303387A/en
Publication of CN116303387A publication Critical patent/CN116303387A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • General Factory Administration (AREA)

Abstract

The invention discloses a data quality assessment method of a scientific data center, which comprises the following steps: step one, importing data; step two, quality analysis; step three, formulating a weighting rule; step four, quality assessment; step five, setting a feedback threshold value; step six, collecting feedback; step seven, feedback correction; step eight, updating iteration; in the second step, the relevance of the data is determined according to the quality of the content related to the data, and in the second step, the repeatability of the data is analyzed according to similar data under the same background; according to the invention, the applicability of the evaluation method is enlarged by formulating evaluation rules with various different tendencies; according to the invention, the evaluation result is corrected by collecting the user feedback, so that the evaluation error can be found in time.

Description

Scientific data center data quality assessment method
Technical Field
The invention relates to the technical field of data quality evaluation, in particular to a data quality evaluation method of a scientific data center.
Background
Along with the continuous development of science and technology, the propagation speed of information is rapidly expanded, various system applications are more and more, wherein the processing of data has extremely important positions in the system applications, the quality of the data also determines whether one system application can obtain the trust of a user, the existing data quality evaluation method takes the data multidimensional analysis as an evaluation basis, the data validity is mainly determined by the content and the quantity of the data without examining the source of the data, and the accuracy is to be improved; the existing data quality evaluation method comprehensively considers the analysis results of all dimensions when preparing an evaluation rule, and then uniformly evaluates the analysis results, wherein the evaluation results have comprehensiveness but cannot adapt to various consulting trends of users; the existing data quality assessment method cannot find and correct assessment errors in time due to the fact that feedback is not collected for users.
Disclosure of Invention
The invention aims to provide a data quality evaluation method for a scientific data center, so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: the data quality evaluation method of the scientific data center comprises the following steps: step one, importing data; step two, quality analysis; step three, formulating a weighting rule; step four, quality assessment; step five, setting a feedback threshold value; step six, collecting feedback; step seven, feedback correction; step eight, updating iteration;
in the first step, data in a scientific data center database is imported into an evaluation system;
in the second step, the data are respectively analyzed according to six dimensions of real-time performance, effectiveness, relevance, integrity and repeatability;
in the third step, the dimension classification in the second step is used for randomly taking a dimension as a main part to formulate an evaluation weighting rule;
in the fourth step, data quality evaluation is performed according to each rule specified in the third step, so that different evaluation results are obtained and are used for displaying different search trends in an arrangement mode;
in the fifth step, the feedback result is divided into positive feedback and negative feedback, and the positive feedback and the negative feedback are set to different levels of thresholds;
in the sixth step, after the user searches and refers to the data, user feedback is collected and summarized;
in the seventh step, feedback is added to the evaluation weighting rule according to the feedback result to correct the evaluation result;
in the eighth step, a data quality evaluation period is set, and the evaluation is immediately re-evaluated if the evaluation quality fluctuation is large due to correction in the period, otherwise, the evaluation is re-evaluated after one period is finished.
Preferably, in the second step, the real-time property of the data is analyzed according to the submitting time of the data and whether there is data update.
Preferably, in the second step, the validity of the data is analyzed according to the content, the number and the reliability of the data source.
Preferably, in the second step, the relevance of the data is determined according to the quality of the content associated with the data.
Preferably, in the second step, the integrity of the data includes the source background of the data, the obtaining process and the final result.
Preferably, in the second step, the repeatability of the data is analyzed according to similar data in the same background.
Preferably, in the fifth step, when the feedback result is classified into the threshold, the positive feedback and the negative feedback may be divided into the correction threshold and the re-evaluation threshold.
Preferably, in the seventh step, when the correction threshold is reached, positive and negative corrections are performed on the data evaluation according to the feedback result, and when the re-evaluation threshold is reached, that is, the feedback result deviates greatly from the initial evaluation, the calculation needs to be performed again, and the weighting needs to be performed in combination with the feedback.
Compared with the prior art, the invention has the beneficial effects that: compared with the existing data quality assessment method, the method has the advantages that the data validity is analyzed by examining the data sources, and the accuracy of quality assessment is improved; according to the invention, the applicability of the evaluation method is enlarged by formulating evaluation rules with various different tendencies; according to the invention, the evaluation result is corrected by collecting the user feedback, so that the evaluation error can be found in time.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, an embodiment of the present invention is provided: the data quality evaluation method of the scientific data center comprises the following steps: step one, importing data; step two, quality analysis; step three, formulating a weighting rule; step four, quality assessment; step five, setting a feedback threshold value; step six, collecting feedback; step seven, feedback correction; step eight, updating iteration;
in the first step, data in a scientific data center database is imported into an evaluation system;
in the second step, the data are analyzed according to six dimensions of real-time property, validity, relevance, integrity and repeatability, wherein the real-time property of the data is analyzed according to the submitting time of the data and whether data update exists, the validity of the data is analyzed according to the content, the number and the reliability of a data source of the data, the relevance of the data is determined according to the quality of the content associated with the data, the integrity of the data comprises the source background, the obtaining process and the final result of the data, and the repeatability of the data is analyzed according to similar data under the same background;
in the third step, the dimension classification in the second step is used for randomly taking a dimension as a main part to formulate an evaluation weighting rule;
in the fourth step, data quality evaluation is performed according to each rule specified in the third step, so that different evaluation results are obtained and are used for displaying different search trends in an arrangement mode;
in the fifth step, the feedback result is divided into positive feedback and negative feedback, and the positive feedback and the negative feedback are set to be different levels of thresholds, and the positive feedback and the negative feedback can be divided into a correction threshold and a re-evaluation threshold;
in the sixth step, after the user searches and refers to the data, user feedback is collected and summarized;
in the seventh step, the feedback is added into the evaluation weighting rule according to the feedback result to correct the evaluation result, when the correction threshold is reached, the data evaluation is corrected positively and negatively according to the feedback result, when the re-evaluation threshold is reached, that is, the feedback result greatly deviates from the initial evaluation, the calculation is needed again, and the feedback is combined to weight;
in the eighth step, a data quality evaluation period is set, and the evaluation is immediately re-evaluated if the evaluation quality fluctuation is large due to correction in the period, otherwise, the evaluation is re-evaluated after one period is finished.
Based on the above, the invention has the advantages that when the invention is used, the data source is matched with the traditional validity analysis source, so that the accuracy of data analysis is improved; the invention uses a plurality of different data quality evaluation rules to avoid the problem that the invention can not adapt to different consulting trends of users; according to the invention, the feedback threshold is set, the feedback result is collected, and the data quality is corrected and re-evaluated finally according to the feedback threshold and the feedback result, so that the data quality evaluation can be ensured to follow the sense of the user.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (8)

1. The data quality evaluation method of the scientific data center comprises the following steps: step one, importing data; step two, quality analysis; step three, formulating a weighting rule; step four, quality assessment; step five, setting a feedback threshold value; step six, collecting feedback; step seven, feedback correction; step eight, updating iteration; the method is characterized in that:
in the first step, data in a scientific data center database is imported into an evaluation system;
in the second step, the data are respectively analyzed according to six dimensions of real-time performance, effectiveness, relevance, integrity and repeatability;
in the third step, the dimension classification in the second step is used for randomly taking a dimension as a main part to formulate an evaluation weighting rule;
in the fourth step, data quality evaluation is performed according to each rule specified in the third step, so that different evaluation results are obtained and are used for displaying different search trends in an arrangement mode;
in the fifth step, the feedback result is divided into positive feedback and negative feedback, and the positive feedback and the negative feedback are set to different levels of thresholds;
in the sixth step, after the user searches and refers to the data, user feedback is collected and summarized;
in the seventh step, feedback is added to the evaluation weighting rule according to the feedback result to correct the evaluation result;
in the eighth step, a data quality evaluation period is set, and the evaluation is immediately re-evaluated if the evaluation quality fluctuation is large due to correction in the period, otherwise, the evaluation is re-evaluated after one period is finished.
2. The scientific data center data quality assessment method according to claim 1, characterized in that: in the second step, the real-time property of the data is analyzed according to the submitting time of the data and whether the data update exists.
3. The scientific data center data quality assessment method according to claim 1, characterized in that: in the second step, the validity of the data is analyzed according to the content, the quantity and the reliability of the data source.
4. The scientific data center data quality assessment method according to claim 1, characterized in that: in the second step, the relevance of the data is determined according to the quality of the content associated with the data.
5. The scientific data center data quality assessment method according to claim 1, characterized in that: in the second step, the integrity of the data includes the source background of the data, the obtaining process and the final result.
6. The scientific data center data quality assessment method according to claim 1, characterized in that: in the second step, the repeatability of the data is analyzed according to similar data in the same background.
7. The scientific data center data quality assessment method according to claim 1, characterized in that: in the fifth step, when the feedback result is classified into the threshold value, the positive feedback and the negative feedback can be divided into the correction threshold value and the re-evaluation threshold value.
8. The scientific data center data quality assessment method according to claim 1, characterized in that: in the seventh step, when the correction threshold is reached, positive and negative corrections are performed on the data evaluation according to the feedback result, and when the re-evaluation threshold is reached, that is, the feedback result deviates greatly from the initial evaluation, the calculation needs to be performed again, and the feedback is combined to perform weighting.
CN202310167738.6A 2023-02-27 2023-02-27 Scientific data center data quality assessment method Pending CN116303387A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310167738.6A CN116303387A (en) 2023-02-27 2023-02-27 Scientific data center data quality assessment method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310167738.6A CN116303387A (en) 2023-02-27 2023-02-27 Scientific data center data quality assessment method

Publications (1)

Publication Number Publication Date
CN116303387A true CN116303387A (en) 2023-06-23

Family

ID=86837179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310167738.6A Pending CN116303387A (en) 2023-02-27 2023-02-27 Scientific data center data quality assessment method

Country Status (1)

Country Link
CN (1) CN116303387A (en)

Similar Documents

Publication Publication Date Title
Raghu et al. Comparison of strategies for scalable causal discovery of latent variable models from mixed data
US9043348B2 (en) System and method for performing set operations with defined sketch accuracy distribution
Bininda-Emonds The evolution of supertrees
Pérez-Wohlfeil et al. Ultra-fast genome comparison for large-scale genomic experiments
CN101556603A (en) Coordinate search method used for reordering search results
CN114281809B (en) Multi-source heterogeneous data cleaning method and device
CN105320764A (en) 3D model retrieval method and 3D model retrieval apparatus based on slow increment features
Yu et al. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets
CN113343091A (en) Industrial and enterprise oriented science and technology service recommendation calculation method, medium and program
Zhang et al. The scientometric measurement of interdisciplinarity and diversity in the research portfolios of Chinese universities
CN104537280A (en) Protein interactive relationship identification method based on text relationship similarity
CN111326215B (en) Method and system for searching nucleic acid sequence based on k-tuple frequency
Wang et al. An improved clustering method for detection system of public security events based on genetic algorithm and semisupervised learning
CN111046092B (en) Parallel similarity connection method based on CPU-GPU heterogeneous system structure
Yang et al. Efficient processing of top k group skyline queries
CN116303387A (en) Scientific data center data quality assessment method
CN111309930A (en) Medical knowledge graph entity alignment method based on representation learning
Ulrich et al. Pattern identification in Pareto-set approximations
Lin et al. Reducing Uncertainty of Probabilistic Top-$ k $ Ranking via Pairwise Crowdsourcing
CN115185920A (en) Method, device and equipment for detecting log type
Ahmed et al. Machine Learning Approach for Effective Ranking of Researcher Assessment Parameters
CN113704236A (en) Government affair system data quality evaluation method, device, terminal and storage medium
CN113435713A (en) Risk map compiling method and system based on GIS technology and two-model fusion
CN111291376A (en) Web vulnerability verification method based on crowdsourcing and machine learning
Wang et al. Semi-supervised hierarchical optimization-based affinity propagation algorithm and its applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication