CN110737640A - data quality improving method and system based on distributed system - Google Patents
data quality improving method and system based on distributed system Download PDFInfo
- Publication number
- CN110737640A CN110737640A CN201910969243.9A CN201910969243A CN110737640A CN 110737640 A CN110737640 A CN 110737640A CN 201910969243 A CN201910969243 A CN 201910969243A CN 110737640 A CN110737640 A CN 110737640A
- Authority
- CN
- China
- Prior art keywords
- data
- cleaning
- cleaned
- data quality
- rule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The utility model provides data quality improving methods and systems based on distributed systems, which comprises the steps of obtaining data of each data source and loading the data to the distributed file system, preprocessing the loaded data by the distributed file system, wherein the preprocessing process mainly comprises filling incomplete fields and merging repeated records, cleaning the data to be cleaned, and after the data cleaning is finished, performing data quality evaluation by constructing a data quality evaluation model.
Description
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to methods and systems for improving data quality based on a distributed system.
Background
In recent years, with the continuous development of industrial 4.0, internet of things technology and enterprise internal management systems, massive data are gathered in the process. And these data have immeasurable value for the development of the enterprise. However, the quality of the data is not high, which affects the accuracy of the data analysis result, and the problem data-based decision is likely to cause loss to the enterprise.
The industrial big data is divided into three categories, namely ⑴ various business data related to a manufacturing enterprise, including production data, sales data and the like, ⑵ data generated by various machine devices, such as operation information, operation states and the like of the various machine devices, ⑶ data outside the enterprise, such as client data, after-sales service data and the like, the industrial big data comprises wide range, so that the characteristics of the data are different from other industrial data, compared with internet data, the industrial big data not only comprises the characteristics of the industrial big data, but also has the characteristics of high dimensionality, low value density, low timeliness, high degree of being not , and the like.
Disclosure of Invention
The invention aims to provide data quality improving methods based on a distributed system, and through data cleaning, the problems that the data format is not , the data recording is repeated, and the field in the data is lacked are all effectively processed.
The implementation manner of the present specification provides methods for improving data quality based on a distributed system, which are implemented by the following technical solutions:
the method comprises the following steps:
acquiring data of each data source and loading the data to a distributed file system;
the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;
carrying out data cleaning on data to be cleaned;
and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
The implementation manner of the present specification provides data quality improvement systems based on a distributed system, which are implemented by the following technical solutions:
the method comprises the following steps:
a data acquisition module configured to: acquiring data of each data source and loading the data to a distributed file system;
a data pre-processing module configured to: the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;
a data cleansing module configured to: carrying out data cleaning on data to be cleaned;
a data quality assessment module configured to: and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
Compared with the prior art, the beneficial effect of this disclosure is:
according to the data cleaning method, the problems caused by the fact that the data format is not , the data recording is repeated and the field in the data is lacked are effectively processed through data cleaning, most error data are cleaned through the data cleaning method, and the data quality is greatly improved.
Drawings
The accompanying drawings, which form a part hereof , are included to provide a further understanding of the disclosure, and are included to explain the exemplary embodiments and the description of the disclosure and not to limit the disclosure.
FIG. 1 is a flow chart of a data cleansing scheme of an embodiment of the present disclosure;
FIG. 2 is a diagram illustrating a moving window scan ordering dataset according to an exemplary embodiment of the present disclosure;
fig. 3 is a diagram of a data quality evaluation model according to an embodiment of the disclosure.
Detailed Description
It is noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure at unless otherwise indicated all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example
The embodiment discloses data quality improving methods based on a distributed system, which comprise:
various types of data are loaded into a distributed file system that can pre-process the loaded data, including structured data and unstructured data, merged or aggregated on demand in a manner that is difficult to accomplish by any single systems.
The preprocessing process mainly comprises filling incomplete fields and combining repeated records, and the combining repeated records mainly comprises creating a sorting key, detecting repeated records through a sliding window and combining the repeated records.
For an empty field, it needs to be filled with a value. Necessary information is added to incomplete fields in the data to make them complete. For example, the identifier URL in the diary data resource, some data only defines the parameter value, and does not provide the complete URL address, and the missing part needs to be filled or modified.
For the existing repeated records, the preprocessing scheme mainly comprises the following steps:
, creating sort key, first extracting subset sequence of record attributes, then calculating corresponding key value for each records aiming at the data to be processed, sorting the data to be processed according to the key value, moving the possibly repeated records to the adjacent area so as to limit the object matching records in the specific range of the specific record.
Second, merge, as shown in FIG. 2, a fixed-size window is slid over the sorted data set, and each record in the data set is compared only to the records in the window, assuming the window can hold K records, every records that newly enter the window are compared to K-1 records that have already entered the window until the end of the record set.
After the data preprocessing is finished, pulling data to be cleaned in the HDFS for data cleaning:
firstly, formulating a corresponding cleaning rule for data to be cleaned by a domain expert, then judging whether the data to be cleaned meets the cleaning rule, if so, cleaning the data, writing the cleaned data into an HDFS (Hadoop distributed File System), if not, cleaning only the data meeting the cleaning rule, and finishing the cleaning program after cleaning, wherein is used for analyzing and processing the data not meeting the cleaning rule, and reformulating the cleaning rule for cleaning.
Specifically, a data cleansing engine is required to be called to perform data cleansing operation, the cleansing engine part is a core module of the whole data cleansing, the main process comprises two parts of data detection and data repair, wherein the repair of error data comprises two parts of positioning of error data and repair of error data. Considering that the data detection and the data restoration are mutually related processes, the data cleaning method can automatically execute the two processes of the data detection and the data restoration by correlating the data detection and the data restoration until a correct restoration result appears. Mainly comprises the following steps:
(1) and (4) cleaning the data of the HDFS, wherein the parameters n and k are input into a data cleaning engine algorithm. The data cleaning engine determines n isolated points to be cleaned according to input, wherein the isolated points refer to objects with small parts different from other data in the data set.
(2) For n isolated points that have been determined, it is necessary to determine whether the isolated points meet the cleaning rule. And if the n points meet the average cleaning rule, cleaning the n isolated points according to the cleaning rule, and rewriting the cleaned data into the HDFS.
(3) If only r points in the n points meet the cleaning rule and the n-r points do not meet the cleaning rule, only the r points meeting the cleaning rule need to be cleaned, and the data cleaning program is finished after the data is cleaned.
After the data cleaning is finished, the data quality needs to be evaluated, the data quality evaluation is evaluated by establishing a data quality evaluation model, and the process is as follows:
(1) determining application views
Before data quality evaluation is carried out, firstly, data quality evaluation requirements are defined, which data are interested parts of a user, and a corresponding user view is established with the data, for example, for a table, the data quality condition of the gender and the identification number of the user needs to be evaluated, and a corresponding generated view is needed.
(2) Selecting an evaluation index of data quality
And selecting a corresponding data quality evaluation index according to the researched content.
(3) Formulating a rule set
According to the selected evaluation index, a corresponding rule set is determined, corresponding expected values and weights are established, and the corresponding rule set is formulated for consistency and completeness of data.
(4) Calculating a score according to a rule
Writing out corresponding SQL statement according to rule
And calculating the query result of the SQL statement, and calculating the percentage of the result of the query statement to the total number of the data in the table to obtain the final result.
(5) According to the contents and the results obtained after calculation, an evaluation model of the data quality is established: m ═ D, I, R, W, E, S >
D: data set to be evaluated
I indices selected for evaluation of the data set, such as integrity, consistency, validity.
And R, carrying out data quality evaluation on the data set to obtain an evaluation rule.
W is the proportion of each evaluation rule in the whole data quality evaluation.
E: an expected value is given in advance for the evaluation result of each data quality evaluation index.
S: the evaluation result of each rule actually calculated.
The evaluation model of the data quality is shown in fig. 3: and after the data cleaning is finished, performing data quality evaluation through the constructed data quality evaluation model, and calculating a data quality evaluation result according to the data quality model. The formula for data quality assessment includes:
absolute quantized value of data quality:
where W represents the proportion of each evaluation rule in the entire data quality evaluation. S represents the evaluation result of each rule actually calculated.
Relative quantization value of data quality
Where E denotes that an evaluation result for each data quality evaluation index gives an expected value in advance.
SA is a weighted average calculated by the data quality rule, reflecting the truth of the data quality of the data set.
The value of SA is subtracted from the expected value to obtain the value of SR, wherein the larger the value of SR is, the better the data quality of the data set is, and conversely, the smaller the value of SR is, the worse the data quality of the data set is.
The data quality model can effectively evaluate the data quality condition, and improves the accuracy of data quality evaluation.
Example II
The implementation manner of the present specification provides data quality improvement systems based on a distributed system, which are implemented by the following technical solutions:
the method comprises the following steps:
a data acquisition module configured to: acquiring data of each data source and loading the data to a distributed file system;
a data pre-processing module configured to: the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;
a data cleansing module configured to: carrying out data cleaning on data to be cleaned;
a data quality assessment module configured to: and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
The specific content of the relevant module in this embodiment refers to the relevant implementation step in embodiment , and is not described in detail here in .
Example III
Embodiments of the present description provide a distributed file system, including a server configured to receive data of data sources and pre-process the loaded data, where the pre-process mainly includes filling incomplete fields and merging duplicate records;
carrying out data cleaning on data to be cleaned;
and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
The specific steps of this embodiment are referred to as related implementation steps in embodiment , and are not described in detail here in .
Example four
The embodiment of the specification provides computer devices, which comprise a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize steps of the data quality improvement method based on the distributed system.
Example five
The present specification provides computer-readable storage media having stored thereon a computer program that, when executed by a processor, performs the steps of the distributed system-based data quality improvement method.
In the present embodiments, a computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for performing various aspects of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device.
Experimental examples
Experiment and analysis of results
The experiment takes different journal resource data as an example to verify the effect of improving the data quality of the system.
Experimental configuration
In this experiment, a 16G host computer was mounted using an Intel Core i7-2600(3.4Ghz) CPU. The method is realized by java programming, each group of experiment runs for 4 times, and the conclusion is obtained by analyzing the experiment result.
TABLE 1 data before washing
Through the analysis of the data source, the data of the data source has the following problems:
(1) problems due to data formats not being
For example, there is a problem that the format is not in the journal number recording, and there are xxx-yyy type data and xxx/yy type data, and there are problems in this respect in the journal number recording, for example, in the journal of the same second generation, some of the journals are represented by the serial number 2 and some of the journals are represented by the serial number 02, and in the journal format at the time of publication, there are two types of aa-bb-cc and aa.bb.cc, and therefore, the expression format for the data system of the same type is important aspects for improving the data quality.
(2) Problem of data record duplication
For publication 1006-6401, there are two identical records representing the data object, so the merging of duplicate data records is required to merge the duplicate records therein.
(3) Problem of field missing in data
Data records such as 1003-.
TABLE 2 data after washing
The result data is shown in table 2, through data cleaning, the problems caused by the fact that the data format is not , the data recording is repeated, and the field in the data is lacked are all effectively processed, most error data are cleaned through the data cleaning method, and the data quality is greatly improved.
It is to be understood that throughout the description of this specification, references to the terms " embodiment," "another embodiment," "other embodiments," or " th through nth embodiments," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least embodiments or examples of the invention.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Claims (10)
1, data quality improving method based on distributed system, which is characterized by comprising:
acquiring data of each data source and loading the data to a distributed file system;
the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;
carrying out data cleaning on data to be cleaned;
and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
2. The method of data quality enhancement over a distributed system as claimed in claim 1 wherein the preprocessing of the loaded data by the distributed file system includes padding incomplete fields and merging duplicate records.
3. The method for improving data quality based on distributed system as claimed in claim 2, wherein the process of merging duplicate records mainly comprises:
creating a sort key: extracting character strings or character string attribute values, calculating key values of the character strings or character string attribute values aiming at each data set, sequencing each data set according to key words, and moving possibly repeated records to an adjacent area so as to limit object matching records within a specific range of specific records;
sliding window detection duplicate recording: sliding a fixed-size window over the sorted data set, each record in the data set being compared only to records in the window;
duplicate records are merged.
4. The method for improving data quality based on distributed system as claimed in claim 1, wherein, the data to be cleaned is cleaned, it is determined whether the data to be cleaned satisfies the cleaning rule, if so, the data is cleaned and the cleaned data is written into the HDFS, if not, only the part of the data satisfying the cleaning rule is cleaned.
5. The method for improving data quality based on distributed system, as claimed in claim 4, wherein the data to be cleaned is cleaned by the method comprising:
cleaning the data of the HDFS, wherein the parameters n and k are input into a data cleaning engine algorithm, and the data cleaning engine determines n isolated points to be cleaned according to the input;
for n determined isolated points, we need to judge whether the isolated points meet the cleaning rule; if the n points meet the average cleaning rule, cleaning the n isolated points according to the cleaning rule, and rewriting the cleaned data into the HDFS;
if only r points in the n points meet the cleaning rule and the n-r points do not meet the cleaning rule, only the r points meeting the cleaning rule need to be cleaned, and the data cleaning program is finished after the data is cleaned.
6. The method for improving data quality based on distributed system as claimed in claim 4, wherein the data quality evaluation model is M ═ D, I, R, W, E, S >
D: a data set to be evaluated;
i, selecting indexes for the evaluation of the data set;
r, carrying out data quality evaluation on the data set to select an evaluation rule;
w is the proportion of each evaluation rule in the whole data quality evaluation;
e: an expected value is given in advance for the evaluation result of each data quality evaluation index;
s: the evaluation result of each rule actually calculated.
7, kinds of data quality improving system based on distributed system, which is characterized by comprising:
a data acquisition module configured to: acquiring data of each data source and loading the data to a distributed file system;
a data pre-processing module configured to: the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;
a data cleansing module configured to: carrying out data cleaning on data to be cleaned;
a data quality assessment module configured to: and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
8. The distributed file system is characterized by comprising a server, wherein the server is configured to receive data of each data source and preprocess the loaded data, and the preprocessing process mainly comprises filling incomplete fields and merging repeated records;
carrying out data cleaning on data to be cleaned;
and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.
Computer apparatus of 9, , comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps of the distributed system based data quality improvement method of any of claims 1-6 to .
10, computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the distributed system based data quality improvement method of any of claims 1-6 to .
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910969243.9A CN110737640A (en) | 2019-10-12 | 2019-10-12 | data quality improving method and system based on distributed system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910969243.9A CN110737640A (en) | 2019-10-12 | 2019-10-12 | data quality improving method and system based on distributed system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110737640A true CN110737640A (en) | 2020-01-31 |
Family
ID=69268826
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910969243.9A Pending CN110737640A (en) | 2019-10-12 | 2019-10-12 | data quality improving method and system based on distributed system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110737640A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150088827A1 (en) * | 2013-09-26 | 2015-03-26 | Cygnus Broadband, Inc. | File block placement in a distributed file system network |
CN105138650A (en) * | 2015-08-28 | 2015-12-09 | 成都康赛信息技术有限公司 | Hadoop data cleaning method and system based on outlier mining |
CN106651188A (en) * | 2016-12-27 | 2017-05-10 | 贵州电网有限责任公司贵阳供电局 | Electric transmission and transformation device multi-source state assessment data processing method and application thereof |
CN107025301A (en) * | 2017-04-25 | 2017-08-08 | 西安理工大学 | Flight ensures the method for cleaning of data |
CN107463532A (en) * | 2017-06-28 | 2017-12-12 | 国网上海市电力公司 | A kind of mass analysis method of electric power statistics |
CN109254961A (en) * | 2018-09-27 | 2019-01-22 | 广东电网有限责任公司信息中心 | A kind of distribution multi engine data quality management system |
-
2019
- 2019-10-12 CN CN201910969243.9A patent/CN110737640A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150088827A1 (en) * | 2013-09-26 | 2015-03-26 | Cygnus Broadband, Inc. | File block placement in a distributed file system network |
CN105138650A (en) * | 2015-08-28 | 2015-12-09 | 成都康赛信息技术有限公司 | Hadoop data cleaning method and system based on outlier mining |
CN106651188A (en) * | 2016-12-27 | 2017-05-10 | 贵州电网有限责任公司贵阳供电局 | Electric transmission and transformation device multi-source state assessment data processing method and application thereof |
CN107025301A (en) * | 2017-04-25 | 2017-08-08 | 西安理工大学 | Flight ensures the method for cleaning of data |
CN107463532A (en) * | 2017-06-28 | 2017-12-12 | 国网上海市电力公司 | A kind of mass analysis method of electric power statistics |
CN109254961A (en) * | 2018-09-27 | 2019-01-22 | 广东电网有限责任公司信息中心 | A kind of distribution multi engine data quality management system |
Non-Patent Citations (1)
Title |
---|
苏艳玲等: "《电子商务基础与实务 第2版》", 31 August 2012 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106575246B (en) | Machine learning service | |
CN106663224B (en) | Interactive interface for machine learning model assessment | |
CN103513983B (en) | method and system for predictive alert threshold determination tool | |
CN110263230B (en) | Data cleaning method and device based on density clustering | |
Souza et al. | Provenance data in the machine learning lifecycle in computational science and engineering | |
Peukert et al. | A self-configuring schema matching system | |
Kumar et al. | Attribute correction-data cleaning using association rule and clustering methods | |
CN110442847B (en) | Code similarity detection method and device based on code warehouse process management | |
CN106649557B (en) | Semantic association mining method for defect report and mail list | |
CN110263229B (en) | Data lake-based data management method and device | |
CN114968727B (en) | Database through infrastructure fault positioning method based on artificial intelligence operation and maintenance | |
CN114676961A (en) | Enterprise external migration risk prediction method and device and computer readable storage medium | |
Petermann et al. | Graph mining for complex data analytics | |
Li et al. | An efficient noise-filtered ensemble model for customer churn analysis in aviation industry | |
Malik et al. | A comprehensive approach towards data preprocessing techniques & association rules | |
Omori et al. | Comparing concept drift detection with process mining tools | |
CN110737640A (en) | data quality improving method and system based on distributed system | |
Meneghetti et al. | Output-sensitive evaluation of prioritized skyline queries | |
US20200311141A1 (en) | Filter evaluation in a database system | |
CN113157814A (en) | Query-driven intelligent workload analysis method under relational database | |
CN112100370B (en) | Picture-trial expert combination recommendation method based on text volume and similarity algorithm | |
CN112052365B (en) | Cross-border scene portrait construction method and device | |
CN113780366B (en) | Crowd-sourced test report clustering method based on AP neighbor propagation algorithm | |
Wright | Knowledge discovery preprocessing: determining record usability | |
US20230141506A1 (en) | Pre-constructed query recommendations for data analytics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |