CN106202569A - A kind of cleaning method based on big data quantity - Google Patents

A kind of cleaning method based on big data quantity Download PDF

Info

Publication number
CN106202569A
CN106202569A CN201610647894.2A CN201610647894A CN106202569A CN 106202569 A CN106202569 A CN 106202569A CN 201610647894 A CN201610647894 A CN 201610647894A CN 106202569 A CN106202569 A CN 106202569A
Authority
CN
China
Prior art keywords
data
cleaning
configuration
described data
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610647894.2A
Other languages
Chinese (zh)
Inventor
蒙进财
李鹏
白志凌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing VRV Software Corp Ltd
Original Assignee
Beijing VRV Software Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing VRV Software Corp Ltd filed Critical Beijing VRV Software Corp Ltd
Priority to CN201610647894.2A priority Critical patent/CN106202569A/en
Publication of CN106202569A publication Critical patent/CN106202569A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Abstract

The present invention provides a kind of cleaning method based on big data quantity.Said method comprising the steps of: configure the storage mode of data after cleaning rule, configuration cleaning, the Spark cluster server resource of configuration cleaning procedure, dispose cleaning procedure task and assess the data after cleaning.The present invention has minimizing data storage capacity, raising data retrieval accuracy and retrieval rate, reduction web shows end response time and the advantage meeting different business needs.

Description

A kind of cleaning method based on big data quantity
Technical field
The present invention relates to data prediction field, and relate more specifically to a kind of cleaning method based on big data quantity.
Background technology
Along with the development of Internet technology, enterprise is during data produce and excavate, and data volume is in significantly Growth.During increasing, the superposition of data volume causes the repetition of mass data, there is many junk datas in other words Useless data.It addition, incomplete data message needs completion present in data.In order to reduce the business after progressively going forward one by one Demand, improve efficiency and response speed, needs, according to different traffic direction and type, clean from existing big data quantity Go out the data of correspondence.
For enterprise, in the business demand of big data quantity, the satisfaction of client depend on the integrity degree of data with And check the response speed of information needed.In order to improve the demand of this respect, carry out data rule analysis, thus formulate different The cleaning rule of type of service meets each functional area.For various different data digging systems, it is both for specific Application carries out data cleansing, specifically includes: detects and eliminates data exception, detect and eliminate approximately duplicate record, logarithm According to carrying out integrated and the data of specific area being carried out.But, for data exist the attribute of a large amount of missing values, logical Frequently with measure be directly to delete, but some system carry out extracting-change-load (ETL) process time, it is impossible to directly locate Manage substantial amounts of missing values;And for important attribute, a small amount of missing values can be there is equally, need data filling complete After carry out a series of data mining.For above-mentioned incomplete data characteristics, it is usually taken following during data cleansing Data are filled up by two ways:
One, replaces the property value of disappearance with same constant, such as " Unknown ".This mode is generally used for processing Data exist the data of a large amount of missing values attribute, first by a replacement values, null value is carried out constraint and replace, then, if processed After data later data excacation is not worth will select delete.
Its two, utilize missing values attribute most likely value fill missing values.For lacking the data of important attribute, In advance each attribute is carried out Data-Statistics, add up distribution and the frequency of its value, the value to all omissions of missing values attribute That value all utilizing the frequency of occurrences the highest is filled up.
Generally speaking, the final purpose of data cleansing is that various dirty datas carry out the process of corresponded manner, obtains standard , clean, continuous print, required data use to carry out data statistics, data mining etc..Process in conventional data cleansing During, web mode and major part data cleansing program need do not carrying out collecting and dividing in the big data quantity of over cleaning Analysis, the consequence of do so not only consumes substantial amounts of server resource, and can be substantially reduced the response speed of server.
Summary of the invention
For above-mentioned problems of the prior art, it is an object of the invention to provide a kind of based on big data quantity clear Washing method, it uses Spark technological means, according to the data of Hadoop distributed file system (HDFS), by HDFS, Hive Storage mode (traffic direction) with Hadoop Database (Hbase) data, it is possible to reduce the memory capacity of data, reduction The consumption of server resource, raising retrieval rate and the data precision, reduction web are shown end response time, are improve server Response speed and meet different business needs.
To achieve these goals, the technical solution used in the present invention is as follows:
A kind of cleaning method based on big data quantity, it comprises the following steps:
Step one: according to the data in HDFS or Hive data base, configure cleaning rule according to type of service;
Step 2: according to the purposes of data, configuration storage mode of data after over cleaning;
Step 3: the size of the data cleaned as required, configures Spark cluster server resource;
Step 4: dispose cleaning procedure task;
Step 5: the data through over cleaning are estimated.
Further, the cleaning rule in step one is: configures to remove in single table and repeats the field of data institute foundation, configuration list For judging the field of junk data institute foundation, configuration multilist enter in the field of completion content institute foundation, the single table of configuration in table Row associates in the condition and/or configuration multilist in the field of institute's foundation, configuration multilist screened the data after association and associates After the field of desired data.
Further, above-mentioned association includes left association, right association or coupling.
Further, the storage mode in step 2 is HDFS, Hive or Hbase.
Further, the Spark cluster server resource in step 3 includes that the memory size of server, cleaning procedure are corresponding Burst size, the maximum CPU core number of server and/or the Log Directory of cleaning procedure.
Further, the deployment cleaning procedure task in step 4 includes: upload packet to be cleaned to task scheduling service Device, collocation task scheduling also submit to Spark cluster server and monitoring cleaning procedure running.
Further, the index of the assessment in step 5 includes the credibility of data and the availability of data.
Further, the content of above-mentioned assessment includes: the storage mode of the data after over cleaning, the data after over cleaning Accuracy, data whether have redundancy, web access data whether reach regulation response time and/or multilist association after data Format and content.
Accompanying drawing explanation
Fig. 1 is the flow chart of a kind of based on big data quantity the cleaning method of the present invention.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with the accompanying drawings, the present invention is entered Row further describes.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not used to limit The present invention.
The present invention utilize the data that big data quantity stores during data mining and web crawlers exist repeat data, Junk data and needs carry out the field of content completion according to required field, and utilize the resource of large data sets group, spark collection The performance of group, the big data of comprehensive assessment, at the data precision that completes of process cleaned and cleaning speed, are then based on big data The distributed treatment ability of platform, carries out data volume cleaning, and the data volume after process is respectively stored in HDFS and Hbase In, in order to extract data according to different types of service and direction or provide web page to carry out data display.
According to one embodiment of present invention, it is provided that a kind of cleaning method based on big data quantity, the method uses Scala language designs cleaning procedure, uses the key-value of Hbase to store non-relational data, has developed completely in program Finish and test up to standard after, further according to the advantage of Spark Distributed Calculation, dispose cleaning procedure by task scheduling, it is ensured that every It data volume produced can be through over cleaning.As it is shown in figure 1, first, according to the data in HDFS or Hive or Hbase, The rule cleaned according to type of service configuration;Then, according to the data after cleaning for which kind of type of service and direction, configuration is clear Wash the storage mode of rear data, such as HDFS, Hive warehouse or Hbase;Then, according to the size of required cleaning data volume, It is configured to the Spark cluster server resource of cleaning procedure, including memory size, the cleaning of the server needed for cleaning procedure The Log Directory of burst size, the maximum CPU core number of cleaning procedure required service device and cleaning procedure that program is corresponding so that Catch mistake in time;Then, after having configured cleaning rule, data storage method and server resource, cleaning procedure is disposed In task scheduling;Finally, the data after assessment cleaning, reason is that the purpose cleaning data is the different demands meeting client And improve the accreditation of client, therefore, it is estimated just seeming increasingly important to the data after cleaning.
One side according to embodiments of the present invention, the rule that configuration is cleaned includes: will be according to which field in the single table of configuration Value remove repeat data, in the single table of configuration by complete for the value complement according to which field data content, configuration list table by basis The value of which field judges that data are which field will to be associated (the most left association, the right side according in junk data, configuration multilist Association and coupling), in configuration multilist to requisite number after association in the condition screened of data after association and configuration multilist According to field.
Another aspect according to embodiments of the present invention, the assessment of data cleansing is substantially to the quality of data after cleaning It is estimated, specifically includes: whether assessment data store, assess the accuracy of data, assessment number according to the service class of configuration According to whether also having whether redundancy, assessment data can reach the response time of regulation, assess whether data meet multilist web access The form of data after association, content is the most consistent after whether assessment data meet multilist association.But, the assessment of the quality of data Journey is a kind of by measuring and improving aggregation of data feature and optimize the process of data value.The evaluation index of the quality of data and side The difficult point of method research is the evaluation index etc. of the implication to the quality of data, content, classification.Data quality accessment is at least Should comprise following both sides basic evaluation index:
One, data must be believable to user.Credibility includes accuracy, integrity, concordance, effectiveness, only The indexs such as one property.Specific as follows:
1. accuracy: the feature describing the most corresponding Subject of data is consistent.
2. integrity: describe whether data exist disappearance record or absent field.
3. concordance: the value of the same attribute describing same entity is the most consistent in different systems.
4. effectiveness: describe whether data meet user-defined condition or in certain domain value range.
5. uniqueness: whether description data exist is repeated record.
Its two, data must be available to user.Availability includes the index such as timeliness, stability.Have as follows:
1. timeliness: describing data is current data or historical data.
2. stability: describe whether data are stable, if within the effect duration of data.
Embodiment described above only have expressed embodiments of the present invention, and it describes more concrete and detailed, but can not Therefore the restriction to the scope of the claims of the present invention it is interpreted as.It should be pointed out that, for the person of ordinary skill of the art, Without departing from the inventive concept of the premise, it is also possible to make some deformation and improvement, these broadly fall into the protection model of the present invention Enclose.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (8)

1. a cleaning method based on big data quantity, it is characterised in that said method comprising the steps of:
Step one: according to the data in HDFS or Hive data base, configure cleaning rule according to type of service;
Step 2: according to the purposes of described data, the storage mode of the described data after configuration cleaning;
Step 3: the size of the described data cleaned as required, configures Spark cluster server resource;
Step 4: dispose cleaning procedure task;
Step 5: the described data after cleaning are estimated.
Method the most according to claim 1, it is characterised in that the described cleaning rule in step one is: in the single table of configuration Remove and repeat the field of data institute foundation, configure in the field of completion content institute foundation in described single table, the single table of configuration and be used for sentencing The field of disconnected junk data institute foundation, configuration multilist are associated in the field of institute's foundation, configuration multilist the number after association According to carrying out the condition screened and/or the field configuring the desired data after associating in multilist.
Method the most according to claim 2, it is characterised in that described in be associated as left association, right association or coupling.
Method the most according to claim 1, it is characterised in that the described storage mode in step 2 be HDFS, Hive or Hbase。
Method the most according to claim 1, it is characterised in that the described Spark cluster server resource bag in step 3 Include the memory size of described server, the burst size that described cleaning procedure is corresponding, the maximum CPU core number of described server and/ Or the Log Directory of described cleaning procedure.
Method the most according to claim 1, it is characterised in that the described deployment cleaning procedure task in step 4 includes: Upload packet to be cleaned to task scheduling server, collocation task scheduling and submit to described Spark cluster server and Monitoring cleaning procedure running.
Method the most according to claim 1, it is characterised in that the index of the described assessment in step 5 includes described data The availability of credible and described data.
Method the most according to claim 7, it is characterised in that the content of described assessment includes: after described cleaning Whether described data storage method, the accuracy of described data after described cleaning, described data have redundancy, web to access Described data reach the format and content of described data after the response time of regulation and/or multilist association.
CN201610647894.2A 2016-08-09 2016-08-09 A kind of cleaning method based on big data quantity Pending CN106202569A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610647894.2A CN106202569A (en) 2016-08-09 2016-08-09 A kind of cleaning method based on big data quantity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610647894.2A CN106202569A (en) 2016-08-09 2016-08-09 A kind of cleaning method based on big data quantity

Publications (1)

Publication Number Publication Date
CN106202569A true CN106202569A (en) 2016-12-07

Family

ID=57514771

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610647894.2A Pending CN106202569A (en) 2016-08-09 2016-08-09 A kind of cleaning method based on big data quantity

Country Status (1)

Country Link
CN (1) CN106202569A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897411A (en) * 2017-02-20 2017-06-27 广东奡风科技股份有限公司 ETL system and its method based on Spark technologies
CN107633257A (en) * 2017-08-15 2018-01-26 上海数据交易中心有限公司 Data Quality Assessment Methodology and device, computer-readable recording medium, terminal
CN107832451A (en) * 2017-11-23 2018-03-23 安徽科创智慧知识产权服务有限公司 A kind of big data cleaning way of simplification
CN107895032A (en) * 2017-11-23 2018-04-10 安徽科创智慧知识产权服务有限公司 Carry out the network data acquisition method that data are tentatively cleaned
CN107908720A (en) * 2017-11-14 2018-04-13 河北工程大学 A kind of patent data cleaning method and system based on AdaBoost algorithms
CN107943973A (en) * 2017-11-28 2018-04-20 上海云信留客信息科技有限公司 A kind of big data system for washing intelligently and cloud intelligent robot clean service platform
CN109033174A (en) * 2018-06-21 2018-12-18 北京国网信通埃森哲信息技术有限公司 A kind of power quality data cleaning method and device
CN109104732A (en) * 2018-06-13 2018-12-28 珠海格力电器股份有限公司 Data transmission method for uplink, device and intelligent electric appliance
CN109753496A (en) * 2018-11-27 2019-05-14 天聚地合(苏州)数据股份有限公司 A kind of data cleaning method for big data
CN109783341A (en) * 2017-11-10 2019-05-21 阿里巴巴集团控股有限公司 Regression testing method and device
CN110019152A (en) * 2017-07-27 2019-07-16 润泽科技发展有限公司 A kind of big data cleaning method
CN110427356A (en) * 2018-04-26 2019-11-08 中移(苏州)软件技术有限公司 One parameter configuration method and equipment
CN112000656A (en) * 2020-09-01 2020-11-27 北京天源迪科信息技术有限公司 Intelligent data cleaning method and device based on metadata
CN112486969A (en) * 2020-12-01 2021-03-12 李孔雀 Data cleaning method applied to big data and deep learning and cloud server
CN113127460A (en) * 2019-12-31 2021-07-16 北京懿医云科技有限公司 Evaluation method of data cleaning frame, device, equipment and storage medium thereof
CN117056576A (en) * 2023-10-13 2023-11-14 太极计算机股份有限公司 Data quality flexible verification method based on big data platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013184677A1 (en) * 2012-06-04 2013-12-12 Google Inc. Method and system for deleting obsolete files from a file system
CN104317801A (en) * 2014-09-19 2015-01-28 东北大学 Data cleaning system and method for aiming at big data
CN104680328A (en) * 2015-03-16 2015-06-03 朗新科技股份有限公司 Power grid construction quality monitoring method based on client perception values
CN105138650A (en) * 2015-08-28 2015-12-09 成都康赛信息技术有限公司 Hadoop data cleaning method and system based on outlier mining

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013184677A1 (en) * 2012-06-04 2013-12-12 Google Inc. Method and system for deleting obsolete files from a file system
CN104317801A (en) * 2014-09-19 2015-01-28 东北大学 Data cleaning system and method for aiming at big data
CN104680328A (en) * 2015-03-16 2015-06-03 朗新科技股份有限公司 Power grid construction quality monitoring method based on client perception values
CN105138650A (en) * 2015-08-28 2015-12-09 成都康赛信息技术有限公司 Hadoop data cleaning method and system based on outlier mining

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
金翰伟: "基于Spark的大数据清洗框架设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897411A (en) * 2017-02-20 2017-06-27 广东奡风科技股份有限公司 ETL system and its method based on Spark technologies
CN110019152A (en) * 2017-07-27 2019-07-16 润泽科技发展有限公司 A kind of big data cleaning method
CN107633257A (en) * 2017-08-15 2018-01-26 上海数据交易中心有限公司 Data Quality Assessment Methodology and device, computer-readable recording medium, terminal
CN107633257B (en) * 2017-08-15 2020-04-17 上海数据交易中心有限公司 Data quality evaluation method and device, computer readable storage medium and terminal
CN109783341A (en) * 2017-11-10 2019-05-21 阿里巴巴集团控股有限公司 Regression testing method and device
CN107908720A (en) * 2017-11-14 2018-04-13 河北工程大学 A kind of patent data cleaning method and system based on AdaBoost algorithms
CN107832451A (en) * 2017-11-23 2018-03-23 安徽科创智慧知识产权服务有限公司 A kind of big data cleaning way of simplification
CN107895032A (en) * 2017-11-23 2018-04-10 安徽科创智慧知识产权服务有限公司 Carry out the network data acquisition method that data are tentatively cleaned
CN107943973A (en) * 2017-11-28 2018-04-20 上海云信留客信息科技有限公司 A kind of big data system for washing intelligently and cloud intelligent robot clean service platform
CN110427356A (en) * 2018-04-26 2019-11-08 中移(苏州)软件技术有限公司 One parameter configuration method and equipment
CN110427356B (en) * 2018-04-26 2021-08-13 中移(苏州)软件技术有限公司 Parameter configuration method and equipment
CN109104732A (en) * 2018-06-13 2018-12-28 珠海格力电器股份有限公司 Data transmission method for uplink, device and intelligent electric appliance
CN109033174A (en) * 2018-06-21 2018-12-18 北京国网信通埃森哲信息技术有限公司 A kind of power quality data cleaning method and device
CN109753496A (en) * 2018-11-27 2019-05-14 天聚地合(苏州)数据股份有限公司 A kind of data cleaning method for big data
CN113127460A (en) * 2019-12-31 2021-07-16 北京懿医云科技有限公司 Evaluation method of data cleaning frame, device, equipment and storage medium thereof
CN113127460B (en) * 2019-12-31 2023-11-17 北京懿医云科技有限公司 Evaluation method of data cleaning frame, device, equipment and storage medium thereof
CN112000656A (en) * 2020-09-01 2020-11-27 北京天源迪科信息技术有限公司 Intelligent data cleaning method and device based on metadata
CN112486969A (en) * 2020-12-01 2021-03-12 李孔雀 Data cleaning method applied to big data and deep learning and cloud server
CN112486969B (en) * 2020-12-01 2021-08-03 罗嗣扬 Data cleaning method applied to big data and deep learning and cloud server
CN117056576A (en) * 2023-10-13 2023-11-14 太极计算机股份有限公司 Data quality flexible verification method based on big data platform
CN117056576B (en) * 2023-10-13 2024-04-05 太极计算机股份有限公司 Data quality flexible verification method based on big data platform

Similar Documents

Publication Publication Date Title
CN106202569A (en) A kind of cleaning method based on big data quantity
CN104426713B (en) The monitoring method and device of web site access effect data
CN103793465B (en) Mass users behavior real-time analysis method and system based on cloud computing
CA2822900C (en) Filtering queried data on data stores
US9612892B2 (en) Creating a correlation rule defining a relationship between event types
CN105989129B (en) Real time data statistical method and device
CN102946319B (en) Networks congestion control information analysis system and analytical method thereof
CN104424229B (en) A kind of calculation method and system that various dimensions are split
CA2947158A1 (en) Systems, devices and methods for generating locality-indicative data representations of data streams, and compressions thereof
CN104424287B (en) Data query method and apparatus
CN106339331B (en) A kind of data buffer storage stratification scaling method based on user activity
US20060294220A1 (en) Diagnostics and resolution mining architecture
CN104978324B (en) Data processing method and device
CN108021651A (en) Network public opinion risk assessment method and device
CN105224529A (en) A kind of personalized recommendation method based on user browsing behavior and device
CN104081364A (en) Collaborative caching
CN106992886A (en) A kind of log analysis method and device based on distributed storage
CN107844402A (en) A kind of resource monitoring method, device and terminal based on super fusion storage system
CN106372133A (en) Big data-based user behavior analysis processing method and system
CN103518200B (en) Determine the unique visitor of network site
Han et al. A comparative analysis on Weibo and Twitter
CN107704620A (en) A kind of method, apparatus of file administration, equipment and storage medium
Wei et al. Delle: Detecting latest local events from geotagged tweets
CN110019152A (en) A kind of big data cleaning method
CN111459900B (en) Big data life cycle setting method, device, storage medium and server

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161207

RJ01 Rejection of invention patent application after publication