CN106202569A

CN106202569A - A kind of cleaning method based on big data quantity

Info

Publication number: CN106202569A
Application number: CN201610647894.2A
Authority: CN
Inventors: 蒙进财; 李鹏; 白志凌
Original assignee: Beijing VRV Software Corp Ltd
Current assignee: Beijing VRV Software Corp Ltd
Priority date: 2016-08-09
Filing date: 2016-08-09
Publication date: 2016-12-07

Abstract

The present invention provides a kind of cleaning method based on big data quantity.Said method comprising the steps of: configure the storage mode of data after cleaning rule, configuration cleaning, the Spark cluster server resource of configuration cleaning procedure, dispose cleaning procedure task and assess the data after cleaning.The present invention has minimizing data storage capacity, raising data retrieval accuracy and retrieval rate, reduction web shows end response time and the advantage meeting different business needs.

Description

A kind of cleaning method based on big data quantity

Technical field

The present invention relates to data prediction field, and relate more specifically to a kind of cleaning method based on big data quantity.

Background technology

Along with the development of Internet technology, enterprise is during data produce and excavate, and data volume is in significantly Growth.During increasing, the superposition of data volume causes the repetition of mass data, there is many junk datas in other words Useless data.It addition, incomplete data message needs completion present in data.In order to reduce the business after progressively going forward one by one Demand, improve efficiency and response speed, needs, according to different traffic direction and type, clean from existing big data quantity Go out the data of correspondence.

For enterprise, in the business demand of big data quantity, the satisfaction of client depend on the integrity degree of data with And check the response speed of information needed.In order to improve the demand of this respect, carry out data rule analysis, thus formulate different The cleaning rule of type of service meets each functional area.For various different data digging systems, it is both for specific Application carries out data cleansing, specifically includes: detects and eliminates data exception, detect and eliminate approximately duplicate record, logarithm According to carrying out integrated and the data of specific area being carried out.But, for data exist the attribute of a large amount of missing values, logical Frequently with measure be directly to delete, but some system carry out extracting-change-load (ETL) process time, it is impossible to directly locate Manage substantial amounts of missing values；And for important attribute, a small amount of missing values can be there is equally, need data filling complete After carry out a series of data mining.For above-mentioned incomplete data characteristics, it is usually taken following during data cleansing Data are filled up by two ways:

One, replaces the property value of disappearance with same constant, such as " Unknown ".This mode is generally used for processing Data exist the data of a large amount of missing values attribute, first by a replacement values, null value is carried out constraint and replace, then, if processed After data later data excacation is not worth will select delete.

Its two, utilize missing values attribute most likely value fill missing values.For lacking the data of important attribute, In advance each attribute is carried out Data-Statistics, add up distribution and the frequency of its value, the value to all omissions of missing values attribute That value all utilizing the frequency of occurrences the highest is filled up.

Generally speaking, the final purpose of data cleansing is that various dirty datas carry out the process of corresponded manner, obtains standard , clean, continuous print, required data use to carry out data statistics, data mining etc..Process in conventional data cleansing During, web mode and major part data cleansing program need do not carrying out collecting and dividing in the big data quantity of over cleaning Analysis, the consequence of do so not only consumes substantial amounts of server resource, and can be substantially reduced the response speed of server.

Summary of the invention

For above-mentioned problems of the prior art, it is an object of the invention to provide a kind of based on big data quantity clear Washing method, it uses Spark technological means, according to the data of Hadoop distributed file system (HDFS), by HDFS, Hive Storage mode (traffic direction) with Hadoop Database (Hbase) data, it is possible to reduce the memory capacity of data, reduction The consumption of server resource, raising retrieval rate and the data precision, reduction web are shown end response time, are improve server Response speed and meet different business needs.

To achieve these goals, the technical solution used in the present invention is as follows:

A kind of cleaning method based on big data quantity, it comprises the following steps:

Step one: according to the data in HDFS or Hive data base, configure cleaning rule according to type of service；

Step 2: according to the purposes of data, configuration storage mode of data after over cleaning；

Step 3: the size of the data cleaned as required, configures Spark cluster server resource；

Step 4: dispose cleaning procedure task；

Step 5: the data through over cleaning are estimated.

Further, the cleaning rule in step one is: configures to remove in single table and repeats the field of data institute foundation, configuration list For judging the field of junk data institute foundation, configuration multilist enter in the field of completion content institute foundation, the single table of configuration in table Row associates in the condition and/or configuration multilist in the field of institute's foundation, configuration multilist screened the data after association and associates After the field of desired data.

Further, above-mentioned association includes left association, right association or coupling.

Further, the storage mode in step 2 is HDFS, Hive or Hbase.

Further, the Spark cluster server resource in step 3 includes that the memory size of server, cleaning procedure are corresponding Burst size, the maximum CPU core number of server and/or the Log Directory of cleaning procedure.

Further, the deployment cleaning procedure task in step 4 includes: upload packet to be cleaned to task scheduling service Device, collocation task scheduling also submit to Spark cluster server and monitoring cleaning procedure running.

Further, the index of the assessment in step 5 includes the credibility of data and the availability of data.

Further, the content of above-mentioned assessment includes: the storage mode of the data after over cleaning, the data after over cleaning Accuracy, data whether have redundancy, web access data whether reach regulation response time and/or multilist association after data Format and content.

Accompanying drawing explanation

Fig. 1 is the flow chart of a kind of based on big data quantity the cleaning method of the present invention.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with the accompanying drawings, the present invention is entered Row further describes.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not used to limit The present invention.

The present invention utilize the data that big data quantity stores during data mining and web crawlers exist repeat data, Junk data and needs carry out the field of content completion according to required field, and utilize the resource of large data sets group, spark collection The performance of group, the big data of comprehensive assessment, at the data precision that completes of process cleaned and cleaning speed, are then based on big data The distributed treatment ability of platform, carries out data volume cleaning, and the data volume after process is respectively stored in HDFS and Hbase In, in order to extract data according to different types of service and direction or provide web page to carry out data display.

According to one embodiment of present invention, it is provided that a kind of cleaning method based on big data quantity, the method uses Scala language designs cleaning procedure, uses the key-value of Hbase to store non-relational data, has developed completely in program Finish and test up to standard after, further according to the advantage of Spark Distributed Calculation, dispose cleaning procedure by task scheduling, it is ensured that every It data volume produced can be through over cleaning.As it is shown in figure 1, first, according to the data in HDFS or Hive or Hbase, The rule cleaned according to type of service configuration；Then, according to the data after cleaning for which kind of type of service and direction, configuration is clear Wash the storage mode of rear data, such as HDFS, Hive warehouse or Hbase；Then, according to the size of required cleaning data volume, It is configured to the Spark cluster server resource of cleaning procedure, including memory size, the cleaning of the server needed for cleaning procedure The Log Directory of burst size, the maximum CPU core number of cleaning procedure required service device and cleaning procedure that program is corresponding so that Catch mistake in time；Then, after having configured cleaning rule, data storage method and server resource, cleaning procedure is disposed In task scheduling；Finally, the data after assessment cleaning, reason is that the purpose cleaning data is the different demands meeting client And improve the accreditation of client, therefore, it is estimated just seeming increasingly important to the data after cleaning.

One side according to embodiments of the present invention, the rule that configuration is cleaned includes: will be according to which field in the single table of configuration Value remove repeat data, in the single table of configuration by complete for the value complement according to which field data content, configuration list table by basis The value of which field judges that data are which field will to be associated (the most left association, the right side according in junk data, configuration multilist Association and coupling), in configuration multilist to requisite number after association in the condition screened of data after association and configuration multilist According to field.

Another aspect according to embodiments of the present invention, the assessment of data cleansing is substantially to the quality of data after cleaning It is estimated, specifically includes: whether assessment data store, assess the accuracy of data, assessment number according to the service class of configuration According to whether also having whether redundancy, assessment data can reach the response time of regulation, assess whether data meet multilist web access The form of data after association, content is the most consistent after whether assessment data meet multilist association.But, the assessment of the quality of data Journey is a kind of by measuring and improving aggregation of data feature and optimize the process of data value.The evaluation index of the quality of data and side The difficult point of method research is the evaluation index etc. of the implication to the quality of data, content, classification.Data quality accessment is at least Should comprise following both sides basic evaluation index:

One, data must be believable to user.Credibility includes accuracy, integrity, concordance, effectiveness, only The indexs such as one property.Specific as follows:

1. accuracy: the feature describing the most corresponding Subject of data is consistent.

2. integrity: describe whether data exist disappearance record or absent field.

3. concordance: the value of the same attribute describing same entity is the most consistent in different systems.

4. effectiveness: describe whether data meet user-defined condition or in certain domain value range.

5. uniqueness: whether description data exist is repeated record.

Its two, data must be available to user.Availability includes the index such as timeliness, stability.Have as follows:

1. timeliness: describing data is current data or historical data.

2. stability: describe whether data are stable, if within the effect duration of data.

Embodiment described above only have expressed embodiments of the present invention, and it describes more concrete and detailed, but can not Therefore the restriction to the scope of the claims of the present invention it is interpreted as.It should be pointed out that, for the person of ordinary skill of the art, Without departing from the inventive concept of the premise, it is also possible to make some deformation and improvement, these broadly fall into the protection model of the present invention Enclose.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims

1. a cleaning method based on big data quantity, it is characterised in that said method comprising the steps of:

Step 2: according to the purposes of described data, the storage mode of the described data after configuration cleaning；

Step 3: the size of the described data cleaned as required, configures Spark cluster server resource；

Step 4: dispose cleaning procedure task；

Step 5: the described data after cleaning are estimated.

Method the most according to claim 1, it is characterised in that the described cleaning rule in step one is: in the single table of configuration Remove and repeat the field of data institute foundation, configure in the field of completion content institute foundation in described single table, the single table of configuration and be used for sentencing The field of disconnected junk data institute foundation, configuration multilist are associated in the field of institute's foundation, configuration multilist the number after association According to carrying out the condition screened and/or the field configuring the desired data after associating in multilist.

Method the most according to claim 2, it is characterised in that described in be associated as left association, right association or coupling.

Method the most according to claim 1, it is characterised in that the described storage mode in step 2 be HDFS, Hive or Hbase。

Method the most according to claim 1, it is characterised in that the described Spark cluster server resource bag in step 3 Include the memory size of described server, the burst size that described cleaning procedure is corresponding, the maximum CPU core number of described server and/ Or the Log Directory of described cleaning procedure.

Method the most according to claim 1, it is characterised in that the described deployment cleaning procedure task in step 4 includes: Upload packet to be cleaned to task scheduling server, collocation task scheduling and submit to described Spark cluster server and Monitoring cleaning procedure running.

Method the most according to claim 1, it is characterised in that the index of the described assessment in step 5 includes described data The availability of credible and described data.

Method the most according to claim 7, it is characterised in that the content of described assessment includes: after described cleaning Whether described data storage method, the accuracy of described data after described cleaning, described data have redundancy, web to access Described data reach the format and content of described data after the response time of regulation and/or multilist association.