CN110737640A

CN110737640A - data quality improving method and system based on distributed system

Info

Publication number: CN110737640A
Application number: CN201910969243.9A
Authority: CN
Inventors: 孙涛; 刘秀源; 郭爱章
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-01-31

Abstract

The utility model provides data quality improving methods and systems based on distributed systems, which comprises the steps of obtaining data of each data source and loading the data to the distributed file system, preprocessing the loaded data by the distributed file system, wherein the preprocessing process mainly comprises filling incomplete fields and merging repeated records, cleaning the data to be cleaned, and after the data cleaning is finished, performing data quality evaluation by constructing a data quality evaluation model.

Description

data quality improving method and system based on distributed system

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to methods and systems for improving data quality based on a distributed system.

Background

In recent years, with the continuous development of industrial 4.0, internet of things technology and enterprise internal management systems, massive data are gathered in the process. And these data have immeasurable value for the development of the enterprise. However, the quality of the data is not high, which affects the accuracy of the data analysis result, and the problem data-based decision is likely to cause loss to the enterprise.

The industrial big data is divided into three categories, namely ⑴ various business data related to a manufacturing enterprise, including production data, sales data and the like, ⑵ data generated by various machine devices, such as operation information, operation states and the like of the various machine devices, ⑶ data outside the enterprise, such as client data, after-sales service data and the like, the industrial big data comprises wide range, so that the characteristics of the data are different from other industrial data, compared with internet data, the industrial big data not only comprises the characteristics of the industrial big data, but also has the characteristics of high dimensionality, low value density, low timeliness, high degree of being not , and the like.

Disclosure of Invention

The invention aims to provide data quality improving methods based on a distributed system, and through data cleaning, the problems that the data format is not , the data recording is repeated, and the field in the data is lacked are all effectively processed.

The implementation manner of the present specification provides methods for improving data quality based on a distributed system, which are implemented by the following technical solutions:

the method comprises the following steps:

acquiring data of each data source and loading the data to a distributed file system;

the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;

carrying out data cleaning on data to be cleaned;

and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.

The implementation manner of the present specification provides data quality improvement systems based on a distributed system, which are implemented by the following technical solutions:

the method comprises the following steps:

a data acquisition module configured to: acquiring data of each data source and loading the data to a distributed file system;

a data pre-processing module configured to: the distributed file system carries out preprocessing on the loaded data, and the preprocessing process mainly comprises filling incomplete fields and combining repeated records;

a data cleansing module configured to: carrying out data cleaning on data to be cleaned;

a data quality assessment module configured to: and after the data cleaning is finished, carrying out data quality evaluation through the constructed data quality evaluation model.

Compared with the prior art, the beneficial effect of this disclosure is:

according to the data cleaning method, the problems caused by the fact that the data format is not , the data recording is repeated and the field in the data is lacked are effectively processed through data cleaning, most error data are cleaned through the data cleaning method, and the data quality is greatly improved.

Drawings

The accompanying drawings, which form a part hereof , are included to provide a further understanding of the disclosure, and are included to explain the exemplary embodiments and the description of the disclosure and not to limit the disclosure.

FIG. 1 is a flow chart of a data cleansing scheme of an embodiment of the present disclosure;

FIG. 2 is a diagram illustrating a moving window scan ordering dataset according to an exemplary embodiment of the present disclosure;

fig. 3 is a diagram of a data quality evaluation model according to an embodiment of the disclosure.

Detailed Description

It is noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure at unless otherwise indicated all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example

The embodiment discloses data quality improving methods based on a distributed system, which comprise:

various types of data are loaded into a distributed file system that can pre-process the loaded data, including structured data and unstructured data, merged or aggregated on demand in a manner that is difficult to accomplish by any single systems.

The preprocessing process mainly comprises filling incomplete fields and combining repeated records, and the combining repeated records mainly comprises creating a sorting key, detecting repeated records through a sliding window and combining the repeated records.

For an empty field, it needs to be filled with a value. Necessary information is added to incomplete fields in the data to make them complete. For example, the identifier URL in the diary data resource, some data only defines the parameter value, and does not provide the complete URL address, and the missing part needs to be filled or modified.

For the existing repeated records, the preprocessing scheme mainly comprises the following steps:

, creating sort key, first extracting subset sequence of record attributes, then calculating corresponding key value for each records aiming at the data to be processed, sorting the data to be processed according to the key value, moving the possibly repeated records to the adjacent area so as to limit the object matching records in the specific range of the specific record.

Second, merge, as shown in FIG. 2, a fixed-size window is slid over the sorted data set, and each record in the data set is compared only to the records in the window, assuming the window can hold K records, every records that newly enter the window are compared to K-1 records that have already entered the window until the end of the record set.

After the data preprocessing is finished, pulling data to be cleaned in the HDFS for data cleaning:

firstly, formulating a corresponding cleaning rule for data to be cleaned by a domain expert, then judging whether the data to be cleaned meets the cleaning rule, if so, cleaning the data, writing the cleaned data into an HDFS (Hadoop distributed File System), if not, cleaning only the data meeting the cleaning rule, and finishing the cleaning program after cleaning, wherein is used for analyzing and processing the data not meeting the cleaning rule, and reformulating the cleaning rule for cleaning.

Specifically, a data cleansing engine is required to be called to perform data cleansing operation, the cleansing engine part is a core module of the whole data cleansing, the main process comprises two parts of data detection and data repair, wherein the repair of error data comprises two parts of positioning of error data and repair of error data. Considering that the data detection and the data restoration are mutually related processes, the data cleaning method can automatically execute the two processes of the data detection and the data restoration by correlating the data detection and the data restoration until a correct restoration result appears. Mainly comprises the following steps:

(1) and (4) cleaning the data of the HDFS, wherein the parameters n and k are input into a data cleaning engine algorithm. The data cleaning engine determines n isolated points to be cleaned according to input, wherein the isolated points refer to objects with small parts different from other data in the data set.

(2) For n isolated points that have been determined, it is necessary to determine whether the isolated points meet the cleaning rule. And if the n points meet the average cleaning rule, cleaning the n isolated points according to the cleaning rule, and rewriting the cleaned data into the HDFS.

(3) If only r points in the n points meet the cleaning rule and the n-r points do not meet the cleaning rule, only the r points meeting the cleaning rule need to be cleaned, and the data cleaning program is finished after the data is cleaned.

After the data cleaning is finished, the data quality needs to be evaluated, the data quality evaluation is evaluated by establishing a data quality evaluation model, and the process is as follows:

(1) determining application views

Before data quality evaluation is carried out, firstly, data quality evaluation requirements are defined, which data are interested parts of a user, and a corresponding user view is established with the data, for example, for a table, the data quality condition of the gender and the identification number of the user needs to be evaluated, and a corresponding generated view is needed.

(2) Selecting an evaluation index of data quality

And selecting a corresponding data quality evaluation index according to the researched content.

(3) Formulating a rule set

According to the selected evaluation index, a corresponding rule set is determined, corresponding expected values and weights are established, and the corresponding rule set is formulated for consistency and completeness of data.

(4) Calculating a score according to a rule

Writing out corresponding SQL statement according to rule

And calculating the query result of the SQL statement, and calculating the percentage of the result of the query statement to the total number of the data in the table to obtain the final result.

(5) According to the contents and the results obtained after calculation, an evaluation model of the data quality is established: m ═ D, I, R, W, E, S >

D: data set to be evaluated

I indices selected for evaluation of the data set, such as integrity, consistency, validity.

And R, carrying out data quality evaluation on the data set to obtain an evaluation rule.

W is the proportion of each evaluation rule in the whole data quality evaluation.

E: an expected value is given in advance for the evaluation result of each data quality evaluation index.

S: the evaluation result of each rule actually calculated.

The evaluation model of the data quality is shown in fig. 3: and after the data cleaning is finished, performing data quality evaluation through the constructed data quality evaluation model, and calculating a data quality evaluation result according to the data quality model. The formula for data quality assessment includes:

absolute quantized value of data quality:

where W represents the proportion of each evaluation rule in the entire data quality evaluation. S represents the evaluation result of each rule actually calculated.

Relative quantization value of data quality

Where E denotes that an evaluation result for each data quality evaluation index gives an expected value in advance.

SA is a weighted average calculated by the data quality rule, reflecting the truth of the data quality of the data set.

The value of SA is subtracted from the expected value to obtain the value of SR, wherein the larger the value of SR is, the better the data quality of the data set is, and conversely, the smaller the value of SR is, the worse the data quality of the data set is.

The data quality model can effectively evaluate the data quality condition, and improves the accuracy of data quality evaluation.

Example II

the method comprises the following steps:

The specific content of the relevant module in this embodiment refers to the relevant implementation step in embodiment , and is not described in detail here in .

Example III

Embodiments of the present description provide a distributed file system, including a server configured to receive data of data sources and pre-process the loaded data, where the pre-process mainly includes filling incomplete fields and merging duplicate records;

carrying out data cleaning on data to be cleaned;

The specific steps of this embodiment are referred to as related implementation steps in embodiment , and are not described in detail here in .

Example four

The embodiment of the specification provides computer devices, which comprise a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize steps of the data quality improvement method based on the distributed system.

Example five

The present specification provides computer-readable storage media having stored thereon a computer program that, when executed by a processor, performs the steps of the distributed system-based data quality improvement method.

In the present embodiments, a computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for performing various aspects of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device.

Experimental examples

Experiment and analysis of results

The experiment takes different journal resource data as an example to verify the effect of improving the data quality of the system.

Experimental configuration

In this experiment, a 16G host computer was mounted using an Intel Core i7-2600(3.4Ghz) CPU. The method is realized by java programming, each group of experiment runs for 4 times, and the conclusion is obtained by analyzing the experiment result.

TABLE 1 data before washing

Through the analysis of the data source, the data of the data source has the following problems:

(1) problems due to data formats not being

For example, there is a problem that the format is not in the journal number recording, and there are xxx-yyy type data and xxx/yy type data, and there are problems in this respect in the journal number recording, for example, in the journal of the same second generation, some of the journals are represented by the serial number 2 and some of the journals are represented by the serial number 02, and in the journal format at the time of publication, there are two types of aa-bb-cc and aa.bb.cc, and therefore, the expression format for the data system of the same type is important aspects for improving the data quality.

(2) Problem of data record duplication

For publication 1006-6401, there are two identical records representing the data object, so the merging of duplicate data records is required to merge the duplicate records therein.

(3) Problem of field missing in data

Data records such as 1003-.

TABLE 2 data after washing

The result data is shown in table 2, through data cleaning, the problems caused by the fact that the data format is not , the data recording is repeated, and the field in the data is lacked are all effectively processed, most error data are cleaned through the data cleaning method, and the data quality is greatly improved.

It is to be understood that throughout the description of this specification, references to the terms " embodiment," "another embodiment," "other embodiments," or " th through nth embodiments," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least embodiments or examples of the invention.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1, data quality improving method based on distributed system, which is characterized by comprising:

carrying out data cleaning on data to be cleaned;

2. The method of data quality enhancement over a distributed system as claimed in claim 1 wherein the preprocessing of the loaded data by the distributed file system includes padding incomplete fields and merging duplicate records.

3. The method for improving data quality based on distributed system as claimed in claim 2, wherein the process of merging duplicate records mainly comprises:

creating a sort key: extracting character strings or character string attribute values, calculating key values of the character strings or character string attribute values aiming at each data set, sequencing each data set according to key words, and moving possibly repeated records to an adjacent area so as to limit object matching records within a specific range of specific records;

sliding window detection duplicate recording: sliding a fixed-size window over the sorted data set, each record in the data set being compared only to records in the window;

duplicate records are merged.

4. The method for improving data quality based on distributed system as claimed in claim 1, wherein, the data to be cleaned is cleaned, it is determined whether the data to be cleaned satisfies the cleaning rule, if so, the data is cleaned and the cleaned data is written into the HDFS, if not, only the part of the data satisfying the cleaning rule is cleaned.

5. The method for improving data quality based on distributed system, as claimed in claim 4, wherein the data to be cleaned is cleaned by the method comprising:

cleaning the data of the HDFS, wherein the parameters n and k are input into a data cleaning engine algorithm, and the data cleaning engine determines n isolated points to be cleaned according to the input;

for n determined isolated points, we need to judge whether the isolated points meet the cleaning rule; if the n points meet the average cleaning rule, cleaning the n isolated points according to the cleaning rule, and rewriting the cleaned data into the HDFS;

if only r points in the n points meet the cleaning rule and the n-r points do not meet the cleaning rule, only the r points meeting the cleaning rule need to be cleaned, and the data cleaning program is finished after the data is cleaned.

6. The method for improving data quality based on distributed system as claimed in claim 4, wherein the data quality evaluation model is M ═ D, I, R, W, E, S >

D: a data set to be evaluated;

i, selecting indexes for the evaluation of the data set;

r, carrying out data quality evaluation on the data set to select an evaluation rule;

w is the proportion of each evaluation rule in the whole data quality evaluation;

e: an expected value is given in advance for the evaluation result of each data quality evaluation index;

s: the evaluation result of each rule actually calculated.

7, kinds of data quality improving system based on distributed system, which is characterized by comprising:

8. The distributed file system is characterized by comprising a server, wherein the server is configured to receive data of each data source and preprocess the loaded data, and the preprocessing process mainly comprises filling incomplete fields and merging repeated records;

carrying out data cleaning on data to be cleaned;

Computer apparatus of 9, , comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps of the distributed system based data quality improvement method of any of claims 1-6 to .

10, computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the distributed system based data quality improvement method of any of claims 1-6 to .