CN110737640A - data quality improving method and system based on distributed system - Google Patents

data quality improving method and system based on distributed system Download PDF

Info

Publication number
CN110737640A
CN110737640A CN201910969243.9A CN201910969243A CN110737640A CN 110737640 A CN110737640 A CN 110737640A CN 201910969243 A CN201910969243 A CN 201910969243A CN 110737640 A CN110737640 A CN 110737640A
Authority
CN
China
Prior art keywords
data
cleaning
cleaned
data quality
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910969243.9A
Other languages
Chinese (zh)
Inventor
孙涛
刘秀源
郭爱章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN201910969243.9A priority Critical patent/CN110737640A/en
Publication of CN110737640A publication Critical patent/CN110737640A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The utility model provides data quality improving methods and systems based on distributed systems, which comprises the steps of obtaining data of each data source and loading the data to the distributed file system, preprocessing the loaded data by the distributed file system, wherein the preprocessing process mainly comprises filling incomplete fields and merging repeated records, cleaning the data to be cleaned, and after the data cleaning is finished, performing data quality evaluation by constructing a data quality evaluation model.

Description

data quality improving method and system based on distributed system
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to methods and systems for improving data quality based on a distributed system.
Background
In recent years, with the continuous development of industrial 4.0, internet of things technology and enterprise internal management systems, massive data are gathered in the process. And these data have immeasurable value for the development of the enterprise. However, the quality of the data is not high, which affects the accuracy of the data analysis result, and the problem data-based decision is likely to cause loss to the enterprise.
The industrial big data is divided into three categories, namely ⑴ various business data related to a manufacturing enterprise, including production data, sales data and the like, ⑵ data generated by various machine devices, such as operation information, operation states and the like of the various machine devices, ⑶ data outside the enterprise, such as client data, after-sales service data and the like, the industrial big data comprises wide range, so that the characteristics of the data are different from other industrial data, compared with internet data, the industrial big data not only comprises the characteristics of the industrial big data, but also has the characteristics of high dimensionality, low value density, low timeliness, high degree of being not , and the like.
Disclosure of Invention
The invention aims to provide data quality improving methods based on a distributed system, and through data cleaning, the problems that the data format is not , the data recording is repeated, and the field in the data is lacked are all effectively processed.
The implementation manner of the present specification provides methods for improving data quality based on a distributed system, which are implemented by the following technical solutions:
the method comprises the following steps:
acquiring data of each data source and loading the data to a distributed file system;
the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;
carrying out data cleaning on data to be cleaned;
and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
The implementation manner of the present specification provides data quality improvement systems based on a distributed system, which are implemented by the following technical solutions:
the method comprises the following steps:
a data acquisition module configured to: acquiring data of each data source and loading the data to a distributed file system;
a data pre-processing module configured to: the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;
a data cleansing module configured to: carrying out data cleaning on data to be cleaned;
a data quality assessment module configured to: and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
Compared with the prior art, the beneficial effect of this disclosure is:
according to the data cleaning method, the problems caused by the fact that the data format is not , the data recording is repeated and the field in the data is lacked are effectively processed through data cleaning, most error data are cleaned through the data cleaning method, and the data quality is greatly improved.
Drawings
The accompanying drawings, which form a part hereof , are included to provide a further understanding of the disclosure, and are included to explain the exemplary embodiments and the description of the disclosure and not to limit the disclosure.
FIG. 1 is a flow chart of a data cleansing scheme of an embodiment of the present disclosure;
FIG. 2 is a diagram illustrating a moving window scan ordering dataset according to an exemplary embodiment of the present disclosure;
fig. 3 is a diagram of a data quality evaluation model according to an embodiment of the disclosure.
Detailed Description
It is noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure at unless otherwise indicated all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example
The embodiment discloses data quality improving methods based on a distributed system, which comprise:
various types of data are loaded into a distributed file system that can pre-process the loaded data, including structured data and unstructured data, merged or aggregated on demand in a manner that is difficult to accomplish by any single systems.
The preprocessing process mainly comprises filling incomplete fields and combining repeated records, and the combining repeated records mainly comprises creating a sorting key, detecting repeated records through a sliding window and combining the repeated records.
For an empty field, it needs to be filled with a value. Necessary information is added to incomplete fields in the data to make them complete. For example, the identifier URL in the diary data resource, some data only defines the parameter value, and does not provide the complete URL address, and the missing part needs to be filled or modified.
For the existing repeated records, the preprocessing scheme mainly comprises the following steps:
, creating sort key, first extracting subset sequence of record attributes, then calculating corresponding key value for each records aiming at the data to be processed, sorting the data to be processed according to the key value, moving the possibly repeated records to the adjacent area so as to limit the object matching records in the specific range of the specific record.
Second, merge, as shown in FIG. 2, a fixed-size window is slid over the sorted data set, and each record in the data set is compared only to the records in the window, assuming the window can hold K records, every records that newly enter the window are compared to K-1 records that have already entered the window until the end of the record set.
After the data preprocessing is finished, pulling data to be cleaned in the HDFS for data cleaning:
firstly, formulating a corresponding cleaning rule for data to be cleaned by a domain expert, then judging whether the data to be cleaned meets the cleaning rule, if so, cleaning the data, writing the cleaned data into an HDFS (Hadoop distributed File System), if not, cleaning only the data meeting the cleaning rule, and finishing the cleaning program after cleaning, wherein is used for analyzing and processing the data not meeting the cleaning rule, and reformulating the cleaning rule for cleaning.
Specifically, a data cleansing engine is required to be called to perform data cleansing operation, the cleansing engine part is a core module of the whole data cleansing, the main process comprises two parts of data detection and data repair, wherein the repair of error data comprises two parts of positioning of error data and repair of error data. Considering that the data detection and the data restoration are mutually related processes, the data cleaning method can automatically execute the two processes of the data detection and the data restoration by correlating the data detection and the data restoration until a correct restoration result appears. Mainly comprises the following steps:
(1) and (4) cleaning the data of the HDFS, wherein the parameters n and k are input into a data cleaning engine algorithm. The data cleaning engine determines n isolated points to be cleaned according to input, wherein the isolated points refer to objects with small parts different from other data in the data set.
(2) For n isolated points that have been determined, it is necessary to determine whether the isolated points meet the cleaning rule. And if the n points meet the average cleaning rule, cleaning the n isolated points according to the cleaning rule, and rewriting the cleaned data into the HDFS.
(3) If only r points in the n points meet the cleaning rule and the n-r points do not meet the cleaning rule, only the r points meeting the cleaning rule need to be cleaned, and the data cleaning program is finished after the data is cleaned.
After the data cleaning is finished, the data quality needs to be evaluated, the data quality evaluation is evaluated by establishing a data quality evaluation model, and the process is as follows:
(1) determining application views
Before data quality evaluation is carried out, firstly, data quality evaluation requirements are defined, which data are interested parts of a user, and a corresponding user view is established with the data, for example, for a table, the data quality condition of the gender and the identification number of the user needs to be evaluated, and a corresponding generated view is needed.
(2) Selecting an evaluation index of data quality
And selecting a corresponding data quality evaluation index according to the researched content.
(3) Formulating a rule set
According to the selected evaluation index, a corresponding rule set is determined, corresponding expected values and weights are established, and the corresponding rule set is formulated for consistency and completeness of data.
(4) Calculating a score according to a rule
Writing out corresponding SQL statement according to rule
And calculating the query result of the SQL statement, and calculating the percentage of the result of the query statement to the total number of the data in the table to obtain the final result.
(5) According to the contents and the results obtained after calculation, an evaluation model of the data quality is established: m ═ D, I, R, W, E, S >
D: data set to be evaluated
I indices selected for evaluation of the data set, such as integrity, consistency, validity.
And R, carrying out data quality evaluation on the data set to obtain an evaluation rule.
W is the proportion of each evaluation rule in the whole data quality evaluation.
E: an expected value is given in advance for the evaluation result of each data quality evaluation index.
S: the evaluation result of each rule actually calculated.
The evaluation model of the data quality is shown in fig. 3: and after the data cleaning is finished, performing data quality evaluation through the constructed data quality evaluation model, and calculating a data quality evaluation result according to the data quality model. The formula for data quality assessment includes:
absolute quantized value of data quality:
Figure BDA0002231536840000071
where W represents the proportion of each evaluation rule in the entire data quality evaluation. S represents the evaluation result of each rule actually calculated.
Relative quantization value of data quality
Figure BDA0002231536840000072
Where E denotes that an evaluation result for each data quality evaluation index gives an expected value in advance.
SA is a weighted average calculated by the data quality rule, reflecting the truth of the data quality of the data set.
The value of SA is subtracted from the expected value to obtain the value of SR, wherein the larger the value of SR is, the better the data quality of the data set is, and conversely, the smaller the value of SR is, the worse the data quality of the data set is.
The data quality model can effectively evaluate the data quality condition, and improves the accuracy of data quality evaluation.
Example II
The implementation manner of the present specification provides data quality improvement systems based on a distributed system, which are implemented by the following technical solutions:
the method comprises the following steps:
a data acquisition module configured to: acquiring data of each data source and loading the data to a distributed file system;
a data pre-processing module configured to: the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;
a data cleansing module configured to: carrying out data cleaning on data to be cleaned;
a data quality assessment module configured to: and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
The specific content of the relevant module in this embodiment refers to the relevant implementation step in embodiment , and is not described in detail here in .
Example III
Embodiments of the present description provide a distributed file system, including a server configured to receive data of data sources and pre-process the loaded data, where the pre-process mainly includes filling incomplete fields and merging duplicate records;
carrying out data cleaning on data to be cleaned;
and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
The specific steps of this embodiment are referred to as related implementation steps in embodiment , and are not described in detail here in .
Example four
The embodiment of the specification provides computer devices, which comprise a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize steps of the data quality improvement method based on the distributed system.
Example five
The present specification provides computer-readable storage media having stored thereon a computer program that, when executed by a processor, performs the steps of the distributed system-based data quality improvement method.
In the present embodiments, a computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for performing various aspects of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device.
Experimental examples
Experiment and analysis of results
The experiment takes different journal resource data as an example to verify the effect of improving the data quality of the system.
Experimental configuration
In this experiment, a 16G host computer was mounted using an Intel Core i7-2600(3.4Ghz) CPU. The method is realized by java programming, each group of experiment runs for 4 times, and the conclusion is obtained by analyzing the experiment result.
TABLE 1 data before washing
Figure BDA0002231536840000101
Through the analysis of the data source, the data of the data source has the following problems:
(1) problems due to data formats not being
For example, there is a problem that the format is not in the journal number recording, and there are xxx-yyy type data and xxx/yy type data, and there are problems in this respect in the journal number recording, for example, in the journal of the same second generation, some of the journals are represented by the serial number 2 and some of the journals are represented by the serial number 02, and in the journal format at the time of publication, there are two types of aa-bb-cc and aa.bb.cc, and therefore, the expression format for the data system of the same type is important aspects for improving the data quality.
(2) Problem of data record duplication
For publication 1006-6401, there are two identical records representing the data object, so the merging of duplicate data records is required to merge the duplicate records therein.
(3) Problem of field missing in data
Data records such as 1003-.
TABLE 2 data after washing
Figure BDA0002231536840000111
The result data is shown in table 2, through data cleaning, the problems caused by the fact that the data format is not , the data recording is repeated, and the field in the data is lacked are all effectively processed, most error data are cleaned through the data cleaning method, and the data quality is greatly improved.
It is to be understood that throughout the description of this specification, references to the terms " embodiment," "another embodiment," "other embodiments," or " th through nth embodiments," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least embodiments or examples of the invention.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (10)

1, data quality improving method based on distributed system, which is characterized by comprising:
acquiring data of each data source and loading the data to a distributed file system;
the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;
carrying out data cleaning on data to be cleaned;
and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
2. The method of data quality enhancement over a distributed system as claimed in claim 1 wherein the preprocessing of the loaded data by the distributed file system includes padding incomplete fields and merging duplicate records.
3. The method for improving data quality based on distributed system as claimed in claim 2, wherein the process of merging duplicate records mainly comprises:
creating a sort key: extracting character strings or character string attribute values, calculating key values of the character strings or character string attribute values aiming at each data set, sequencing each data set according to key words, and moving possibly repeated records to an adjacent area so as to limit object matching records within a specific range of specific records;
sliding window detection duplicate recording: sliding a fixed-size window over the sorted data set, each record in the data set being compared only to records in the window;
duplicate records are merged.
4. The method for improving data quality based on distributed system as claimed in claim 1, wherein, the data to be cleaned is cleaned, it is determined whether the data to be cleaned satisfies the cleaning rule, if so, the data is cleaned and the cleaned data is written into the HDFS, if not, only the part of the data satisfying the cleaning rule is cleaned.
5. The method for improving data quality based on distributed system, as claimed in claim 4, wherein the data to be cleaned is cleaned by the method comprising:
cleaning the data of the HDFS, wherein the parameters n and k are input into a data cleaning engine algorithm, and the data cleaning engine determines n isolated points to be cleaned according to the input;
for n determined isolated points, we need to judge whether the isolated points meet the cleaning rule; if the n points meet the average cleaning rule, cleaning the n isolated points according to the cleaning rule, and rewriting the cleaned data into the HDFS;
if only r points in the n points meet the cleaning rule and the n-r points do not meet the cleaning rule, only the r points meeting the cleaning rule need to be cleaned, and the data cleaning program is finished after the data is cleaned.
6. The method for improving data quality based on distributed system as claimed in claim 4, wherein the data quality evaluation model is M ═ D, I, R, W, E, S >
D: a data set to be evaluated;
i, selecting indexes for the evaluation of the data set;
r, carrying out data quality evaluation on the data set to select an evaluation rule;
w is the proportion of each evaluation rule in the whole data quality evaluation;
e: an expected value is given in advance for the evaluation result of each data quality evaluation index;
s: the evaluation result of each rule actually calculated.
7, kinds of data quality improving system based on distributed system, which is characterized by comprising:
a data acquisition module configured to: acquiring data of each data source and loading the data to a distributed file system;
a data pre-processing module configured to: the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;
a data cleansing module configured to: carrying out data cleaning on data to be cleaned;
a data quality assessment module configured to: and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
8. The distributed file system is characterized by comprising a server, wherein the server is configured to receive data of each data source and preprocess the loaded data, and the preprocessing process mainly comprises filling incomplete fields and merging repeated records;
carrying out data cleaning on data to be cleaned;
and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
Computer apparatus of 9, , comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps of the distributed system based data quality improvement method of any of claims 1-6 to .
10, computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the distributed system based data quality improvement method of any of claims 1-6 to .
CN201910969243.9A 2019-10-12 2019-10-12 data quality improving method and system based on distributed system Pending CN110737640A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910969243.9A CN110737640A (en) 2019-10-12 2019-10-12 data quality improving method and system based on distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910969243.9A CN110737640A (en) 2019-10-12 2019-10-12 data quality improving method and system based on distributed system

Publications (1)

Publication Number Publication Date
CN110737640A true CN110737640A (en) 2020-01-31

Family

ID=69268826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910969243.9A Pending CN110737640A (en) 2019-10-12 2019-10-12 data quality improving method and system based on distributed system

Country Status (1)

Country Link
CN (1) CN110737640A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150088827A1 (en) * 2013-09-26 2015-03-26 Cygnus Broadband, Inc. File block placement in a distributed file system network
CN105138650A (en) * 2015-08-28 2015-12-09 成都康赛信息技术有限公司 Hadoop data cleaning method and system based on outlier mining
CN106651188A (en) * 2016-12-27 2017-05-10 贵州电网有限责任公司贵阳供电局 Electric transmission and transformation device multi-source state assessment data processing method and application thereof
CN107025301A (en) * 2017-04-25 2017-08-08 西安理工大学 Flight ensures the method for cleaning of data
CN107463532A (en) * 2017-06-28 2017-12-12 国网上海市电力公司 A kind of mass analysis method of electric power statistics
CN109254961A (en) * 2018-09-27 2019-01-22 广东电网有限责任公司信息中心 A kind of distribution multi engine data quality management system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150088827A1 (en) * 2013-09-26 2015-03-26 Cygnus Broadband, Inc. File block placement in a distributed file system network
CN105138650A (en) * 2015-08-28 2015-12-09 成都康赛信息技术有限公司 Hadoop data cleaning method and system based on outlier mining
CN106651188A (en) * 2016-12-27 2017-05-10 贵州电网有限责任公司贵阳供电局 Electric transmission and transformation device multi-source state assessment data processing method and application thereof
CN107025301A (en) * 2017-04-25 2017-08-08 西安理工大学 Flight ensures the method for cleaning of data
CN107463532A (en) * 2017-06-28 2017-12-12 国网上海市电力公司 A kind of mass analysis method of electric power statistics
CN109254961A (en) * 2018-09-27 2019-01-22 广东电网有限责任公司信息中心 A kind of distribution multi engine data quality management system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苏艳玲等: "《电子商务基础与实务 第2版》", 31 August 2012 *

Similar Documents

Publication Publication Date Title
CN106575246B (en) Machine learning service
CN106663224B (en) Interactive interface for machine learning model assessment
CN103513983B (en) method and system for predictive alert threshold determination tool
CN110263230B (en) Data cleaning method and device based on density clustering
Souza et al. Provenance data in the machine learning lifecycle in computational science and engineering
Peukert et al. A self-configuring schema matching system
Kumar et al. Attribute correction-data cleaning using association rule and clustering methods
CN110442847B (en) Code similarity detection method and device based on code warehouse process management
CN106649557B (en) Semantic association mining method for defect report and mail list
CN110263229B (en) Data lake-based data management method and device
CN114968727B (en) Database through infrastructure fault positioning method based on artificial intelligence operation and maintenance
CN114676961A (en) Enterprise external migration risk prediction method and device and computer readable storage medium
Petermann et al. Graph mining for complex data analytics
Li et al. An efficient noise-filtered ensemble model for customer churn analysis in aviation industry
Malik et al. A comprehensive approach towards data preprocessing techniques & association rules
Omori et al. Comparing concept drift detection with process mining tools
CN110737640A (en) data quality improving method and system based on distributed system
Meneghetti et al. Output-sensitive evaluation of prioritized skyline queries
US20200311141A1 (en) Filter evaluation in a database system
CN113157814A (en) Query-driven intelligent workload analysis method under relational database
CN112100370B (en) Picture-trial expert combination recommendation method based on text volume and similarity algorithm
CN112052365B (en) Cross-border scene portrait construction method and device
CN113780366B (en) Crowd-sourced test report clustering method based on AP neighbor propagation algorithm
Wright Knowledge discovery preprocessing: determining record usability
US20230141506A1 (en) Pre-constructed query recommendations for data analytics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination