CN106294745A

CN106294745A - Big data cleaning method and device

Info

Publication number: CN106294745A
Application number: CN201610652750.6A
Authority: CN
Inventors: 赵伟伟; 张丛喆
Original assignee: Netposa Technologies Ltd
Current assignee: Netposa Technologies Ltd
Priority date: 2016-08-10
Filing date: 2016-08-10
Publication date: 2017-01-04

Abstract

The invention provides a kind of big Data Cleaning Method and device, belong to big data compilation technical field, it is possible to significantly improve cleaning speed and the cleaning efficiency of data cleansing.This big Data Cleaning Method includes: cleaning process is carried out configuration definition；Cleaning process is resolved, cleaning process is converted to the atomic operation of Spark；Cleaning task is committed to Spark cluster；Data cleansing is carried out by Spark cluster.

Description

Big Data Cleaning Method and device

Technical field

The present invention relates to big data compilation technical field, in particular to the big Data Cleaning Method of one and device.

Background technology

Along with the arrival of big data age, the scale of data becomes huge, and the growth rate of data accelerates, the type of data The most various with structure.How big data can be become useful data, how can excavate therein from huge data Value becomes more and more urgent and important.

Data cleansing is exactly the most primary work, is capable of big data are carried out noise reduction, mainly by data cleansing It is that the data of incomplete data, the data of mistake and repetition are got rid of, thus obtains the data that concordance is higher.

In existing data cleansing technology, cleaning procedure major part is stand-alone program, and cleaning speed and cleaning efficiency are relatively low. It is under certain data magnitude, it is possible to realizes automaticdata by computer technology and cleans.But at big data age, along with number According to amount and the increase of data type, existing data cleansing technology has been difficult to meet the demand that current data is cleaned.

Summary of the invention

In view of this, it is an object of the invention to provide a kind of big Data Cleaning Method and device, it is possible to significantly improve number According to the cleaning speed cleaned and cleaning efficiency.

First aspect, embodiments provides a kind of big Data Cleaning Method, including:

Cleaning process is carried out configuration definition；

Cleaning process is resolved, cleaning process is converted to the atomic operation of Spark；

Cleaning task is committed to Spark cluster；

Data cleansing is carried out by Spark cluster.

In conjunction with first aspect, embodiments provide the first possible embodiment of first aspect, wherein, institute State and carried out data cleansing by Spark cluster, specifically include:

From data source loading data；

Utilize the cleaning algorithm of distributed parallel, data are carried out；

The result of data cleansing is stored.

In conjunction with first aspect, embodiments provide the embodiment that the second of first aspect is possible, wherein, institute State cleaning algorithm and include at least one in the process of processing empty value, duplicate removal and sequence process.

In conjunction with first aspect, embodiments provide the third possible embodiment of first aspect, wherein, many Stream compression between individual cleaning algorithm is transmitted by elasticity distribution formula data set.

In conjunction with first aspect, embodiments provide the 4th kind of possible embodiment of first aspect, wherein, institute Stating data source is data base or distributed file system.

In conjunction with first aspect, embodiments provide the 5th kind of possible embodiment of first aspect, wherein, institute State and cleaning process is carried out configuration definition, particularly as follows:

Based on JSON form, cleaning process is carried out configuration definition.

Second aspect, the embodiment of the present invention also provides for a kind of big data cleansing device, including:

Big data cleansing engine, for carrying out configuration definition to cleaning process；Cleaning process is resolved, cleaning is flowed Journey is converted to the atomic operation of Spark；Cleaning task is committed to Spark cluster；

Spark cluster, is used for carrying out data cleansing.

In conjunction with second aspect, embodiments provide the first possible embodiment of second aspect, wherein, institute State Spark cluster specifically for:

From data source loading data；

Utilize the cleaning algorithm of distributed parallel, data are carried out；

The result of data cleansing is stored.

In conjunction with second aspect, embodiments provide the embodiment that the second of second aspect is possible, wherein, be somebody's turn to do Device also includes storing assembly, for storing the result of data cleansing.

In conjunction with second aspect, embodiments provide the third possible embodiment of second aspect, wherein, institute State cleaning algorithm and include at least one in the process of processing empty value, duplicate removal and sequence process.

The embodiment of the present invention brings following beneficial effect: use big Data Cleaning Method that the embodiment of the present invention provides and Clean device, first cleaning process is carried out configuration definition, then cleaning process resolves and is converted to the atom behaviour of Spark Make.After cleaning task is committed to big data analysis framework Spark cluster, Spark cluster carry out data cleansing, because each Each step in cleaning process has been converted into the atomic operation of Spark, so each carried out in Spark cluster cleans Step all can perform with distributed parallel such that it is able to significantly improves the cleaning speed of data cleansing, it is achieved at high speed with efficient The data cleansing of rate, is more applicable for current big data environment.

Other features and advantages of the present invention will illustrate in the following description, and, partly become from description Obtain it is clear that or understand by implementing the present invention.The purpose of the present invention and other advantages are at description, claims And specifically noted structure realizes and obtains in accompanying drawing.

For making the above-mentioned purpose of the present invention, feature and advantage to become apparent, preferred embodiment cited below particularly, and coordinate Appended accompanying drawing, is described in detail below.

Accompanying drawing explanation

In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, below by embodiment required use attached Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, and it is right to be therefore not construed as The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to according to this A little accompanying drawings obtain other relevant accompanying drawings.

Fig. 1 shows the flow chart of a kind of big Data Cleaning Method that the embodiment of the present invention one is provided；

Fig. 2 shows the schematic diagram of a kind of big data cleansing device that the embodiment of the present invention two is provided.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention Middle accompanying drawing, is clearly and completely described the technical scheme in the embodiment of the present invention, it is clear that described embodiment is only It is a part of embodiment of the present invention rather than whole embodiments.Therefore, enforcement to the present invention provided in the accompanying drawings below The detailed description of example is not intended to limit the scope of claimed invention, but is merely representative of the selected enforcement of the present invention Example.Based on embodiments of the invention, it is all that those skilled in the art are obtained on the premise of not making creative work Other embodiments, broadly fall into the scope of protection of the invention.

In current data cleansing technology, cleaning procedure major part is stand-alone program, and cleaning speed and cleaning efficiency are relatively low, It is difficult to meet the demand of data cleansing under current big data environment.

Based on this, a kind of Data Cleaning Method and the device greatly that the embodiment of the present invention provides, it is possible to significantly improve data clear The cleaning speed washed, it is achieved at high speed with high efficiency data cleansing, be more applicable for current big data environment.

Embodiment one:

As it is shown in figure 1, the embodiment of the present invention provides a kind of big Data Cleaning Method, mainly comprise the steps that

S1: cleaning process is carried out configuration definition.

Concrete, start data cleansing engine, first load cleaning process configuration file, based on JSON (Java Script Object Notation) form, cleaning process is carried out configuration definition, configuration item example is as follows:

JSON is the data interchange format of a kind of lightweight, is a subset based on ECMAScript.JSON has used Entirely independent of the text formatting of language, JSON is made to become preferable data interchange language, it is easy to people reads and writes, the easiest Resolve in machine and generate, cross-platform data transmission has the biggest advantage.

S2: resolve cleaning process, is converted to the atomic operation of Spark by cleaning process.

Cleaning process is resolved by big data cleansing engine according to the definition information of configuration file, is changed by cleaning step Atomic operation for Spark.

Spark is the most popular general parallel computation frame after Hadoop of field of cloud calculation, is that one can The company-data analysis platform calculating (In-Memory Computing) based on internal memory of flexible (scalable), compares Hadoop Cluster storage method more has performance advantage.Spark distributed data based on internal memory collection, optimize iterative live load with And interactive inquiry, thus improve speed and the efficiency that big data calculate.

S3: cleaning task is committed to Spark cluster.

First initialize Spark cluster, load Spark context environmental, submit to for cleaning operation and prepare.Then basis The order of cleaning process definition, is committed to Spark cluster by concrete data cleansing operation.

S4: carried out data cleansing by Spark cluster.

S41: from data source loading data.

Data source can be different types of Data Source, and the data source in the present embodiment is data base or distributed document System (Hadoop Distributed File System is called for short HDFS).

In other embodiments, it is also possible to be extended according to concrete business, growth data Source Type has only to phase The increase data answered load atom and process, and the loading procedure of data source is also distributed variable-frequencypump.

S42: utilize the cleaning algorithm of distributed parallel, data are carried out.

The present embodiment illustrates three kinds and cleans algorithm: processing empty value, duplicate removal process and sequence processes.

As a preferred version, the stream compression between multiple cleaning algorithms passes through elasticity distribution formula data set (Resilient Distributed Datasets is called for short RDD) transmission.Because Spark cluster is set up at unified abstract RDD On, so that Spark cluster can tackle different big data in an essentially uniform manner processes scene, including MapReduce, Streaming, SQL, Machine Learning, Graph etc..RDD is fault-tolerant, a parallel data knot Structure, can allow user explicitly store data in disk and internal memory, and can control the subregion of data.Meanwhile, RDD also provides for One group of abundant operation operates these data, such as map, flatMap, filter, join, groupBy, ReduceByKey etc..

The present embodiment illustrates three kinds and cleans algorithm: processing empty value, duplicate removal process and sequence processes.Other embodiment party In formula, clean algorithm and be not limited to three of the above, can be extended according to practical business demand.If newly-increased algorithm newly-increased Individual method, carries out specifying just can applying in configuration definition simultaneously.

S43: the result of data cleansing is stored.

Start data storage engines, namely start the program that the result data cleaned is stored.Data storage is drawn Hold up to define according to the result in configuration definition and select mode to be stored, the present embodiment can be deposited by data base or HDFS Storage result.In other embodiments, other storage modes can be extended.

In the big Data Cleaning Method that the embodiment of the present invention provides, first cleaning process is carried out configuration definition, then to cleaning Flow process carries out resolving and being converted to the atomic operation of Spark.Cleaning task is committed to big data analysis framework Spark cluster After, the Spark cluster increased income carry out data cleansing, finally the result of data cleansing is stored.Stream is cleaned because each Each step in journey has been converted into the atomic operation of Spark, so each cleaning step carried out in Spark cluster is equal Can perform with distributed parallel such that it is able to significantly improve the cleaning speed of data cleansing, it is achieved at high speed with high efficiency number According to cleaning, it is more applicable for current big data environment.

Additionally, Spark cluster can well support extension, mode based on configuration is carried out flow definition and can drop The coupling of low program, increases or deletes the corresponding algorithm that cleans and can realize under minimum change.

Embodiment two:

As in figure 2 it is shown, the embodiment of the present invention provides a kind of big data cleansing device, including big data cleansing engine 1 He Spark cluster 2.

Wherein, big data cleansing engine 1 is for carrying out configuration definition to cleaning process, and resolves cleaning process, Cleaning process is converted to the atomic operation of Spark, and cleaning task is committed to Spark cluster；Spark cluster 2 is used for Carry out data cleansing.

Concrete, after big data cleansing engine 1 starts, first load cleaning process configuration file, based on JSON form, right Cleaning process carries out configuration definition.Cleaning process is carried out by the biggest data cleansing engine 1 according to the definition information of configuration file Resolve, cleaning step is converted to the atomic operation of Spark.

After Spark cluster 2 initializes, the order that big data cleansing engine 1 defines according to cleaning process, by concrete number It is committed to Spark cluster according to cleaning operation.

After Spark cluster 2 receives data cleansing operation, first from data source 4 loading data.Data source 4 can be different The Data Source of type, the data source 4 in the present embodiment includes data base or HDFS.In other embodiments, it is also possible to root Being extended according to concrete business, growth data Source Type has only to accordingly increase data loading atom and processes, data The loading procedure in source is also distributed variable-frequencypump.

Then Spark cluster 2 utilizes the cleaning algorithm of distributed parallel, is carried out data.Clear in the present embodiment Wash algorithm and include that processing empty value, duplicate removal process and sequence processes.In other embodiments, clean algorithm and be not limited to above three Kind, can be extended according to practical business demand.As long as the newly-increased method of newly-increased algorithm, simultaneously in configuration definition Carry out specifying and just can apply.

As a preferred version, the stream compression between multiple cleaning algorithms is transmitted by RDD.Because Spark cluster Set up on unified abstract RDD, so that Spark cluster can tackle different big data in an essentially uniform manner Process scene, including MapReduce, Streaming, SQL, Machine Learning, Graph etc..RDD be one fault-tolerant , parallel data structure, user can be allowed explicitly to store data in disk and internal memory, and can control data point District.Meanwhile, RDD additionally provides one group of abundant operation to operate these data, such as map, flatMap, filter, join, GroupBy, reduceByKey etc..

Finally, Spark cluster 2 starts data storage engines, stores the result of data cleansing.The embodiment of the present invention The big data cleansing device provided also includes storing assembly 3, for storing the result of data cleansing.

Data storage engines defines according to the result in configuration definition and selects mode to be stored, and can lead in the present embodiment Cross data base or HDFS stores result.In other embodiments, other storage modes can be extended.

In the big data cleansing device that the embodiment of the present invention provides, big data cleansing engine 1 cleaning process is joined Put definition, then cleaning process resolves and is converted to the atomic operation of Spark.Cleaning task is committed to big data analysis After framework Spark cluster 2, the Spark cluster 2 increased income carry out data cleansing, finally the result of data cleansing is stored to depositing Storage assembly 3.Because each step in each cleaning process has been converted into the atomic operation of Spark, so at Spark cluster Each cleaning step carried out in 2 all can perform with distributed parallel such that it is able to significantly improves the cleaning speed of data cleansing, Realize high speed and high efficiency data cleansing, be more applicable for current big data environment.

Additionally, Spark cluster 2 can well support extension, it is permissible that mode based on configuration is carried out flow definition The coupling of reduction program, increases or deletes the corresponding algorithm that cleans and can realize under minimum change.

If described function is using the form realization of SFU software functional unit and as independent production marketing or use, permissible It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is the most in other words The part contributing prior art or the part of this technical scheme can embody with the form of software product, this meter Calculation machine software product is stored in a storage medium, including some instructions with so that a computer equipment (can be individual People's computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention. And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random access memory are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic disc or CD.

The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is not limited thereto, and any Those familiar with the art, in the technical scope that the invention discloses, can readily occur in change or replace, should contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with scope of the claims.

Claims

1. a big Data Cleaning Method, it is characterised in that including:

Cleaning process is carried out configuration definition；

Cleaning task is committed to Spark cluster；

Data cleansing is carried out by Spark cluster.

Method the most according to claim 1, it is characterised in that described carried out data cleansing by Spark cluster, specifically wraps Include:

From data source loading data；

Utilize the cleaning algorithm of distributed parallel, data are carried out；

The result of data cleansing is stored.

Method the most according to claim 2, it is characterised in that described cleaning algorithm include processing empty value, duplicate removal process and At least one in sequence process.

Method the most according to claim 3, it is characterised in that the stream compression between multiple cleaning algorithms is by elasticity point Cloth data set transmits.

Method the most according to claim 2, it is characterised in that described data source is data base or distributed file system.

Method the most according to claim 1, it is characterised in that described cleaning process is carried out configuration definition, particularly as follows:

Based on JSON form, cleaning process is carried out configuration definition.

7. a big data cleansing device, it is characterised in that including:

Big data cleansing engine, for carrying out configuration definition to cleaning process；Cleaning process is resolved, cleaning process is turned It is changed to the atomic operation of Spark；Cleaning task is committed to Spark cluster；

Spark cluster, is used for carrying out data cleansing.

Device the most according to claim 7, it is characterised in that described Spark cluster specifically for:

From data source loading data；

Utilize the cleaning algorithm of distributed parallel, data are carried out；

The result of data cleansing is stored.

Device the most according to claim 8, it is characterised in that also include storing assembly, for storing the knot of data cleansing Really.

Device the most according to claim 8, it is characterised in that described cleaning algorithm include processing empty value, duplicate removal process and At least one in sequence process.