CN106202569A - A kind of cleaning method based on big data quantity - Google Patents
A kind of cleaning method based on big data quantity Download PDFInfo
- Publication number
- CN106202569A CN106202569A CN201610647894.2A CN201610647894A CN106202569A CN 106202569 A CN106202569 A CN 106202569A CN 201610647894 A CN201610647894 A CN 201610647894A CN 106202569 A CN106202569 A CN 106202569A
- Authority
- CN
- China
- Prior art keywords
- data
- cleaning
- configuration
- described data
- field
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Abstract
The present invention provides a kind of cleaning method based on big data quantity.Said method comprising the steps of: configure the storage mode of data after cleaning rule, configuration cleaning, the Spark cluster server resource of configuration cleaning procedure, dispose cleaning procedure task and assess the data after cleaning.The present invention has minimizing data storage capacity, raising data retrieval accuracy and retrieval rate, reduction web shows end response time and the advantage meeting different business needs.
Description
Technical field
The present invention relates to data prediction field, and relate more specifically to a kind of cleaning method based on big data quantity.
Background technology
Along with the development of Internet technology, enterprise is during data produce and excavate, and data volume is in significantly
Growth.During increasing, the superposition of data volume causes the repetition of mass data, there is many junk datas in other words
Useless data.It addition, incomplete data message needs completion present in data.In order to reduce the business after progressively going forward one by one
Demand, improve efficiency and response speed, needs, according to different traffic direction and type, clean from existing big data quantity
Go out the data of correspondence.
For enterprise, in the business demand of big data quantity, the satisfaction of client depend on the integrity degree of data with
And check the response speed of information needed.In order to improve the demand of this respect, carry out data rule analysis, thus formulate different
The cleaning rule of type of service meets each functional area.For various different data digging systems, it is both for specific
Application carries out data cleansing, specifically includes: detects and eliminates data exception, detect and eliminate approximately duplicate record, logarithm
According to carrying out integrated and the data of specific area being carried out.But, for data exist the attribute of a large amount of missing values, logical
Frequently with measure be directly to delete, but some system carry out extracting-change-load (ETL) process time, it is impossible to directly locate
Manage substantial amounts of missing values;And for important attribute, a small amount of missing values can be there is equally, need data filling complete
After carry out a series of data mining.For above-mentioned incomplete data characteristics, it is usually taken following during data cleansing
Data are filled up by two ways:
One, replaces the property value of disappearance with same constant, such as " Unknown ".This mode is generally used for processing
Data exist the data of a large amount of missing values attribute, first by a replacement values, null value is carried out constraint and replace, then, if processed
After data later data excacation is not worth will select delete.
Its two, utilize missing values attribute most likely value fill missing values.For lacking the data of important attribute,
In advance each attribute is carried out Data-Statistics, add up distribution and the frequency of its value, the value to all omissions of missing values attribute
That value all utilizing the frequency of occurrences the highest is filled up.
Generally speaking, the final purpose of data cleansing is that various dirty datas carry out the process of corresponded manner, obtains standard
, clean, continuous print, required data use to carry out data statistics, data mining etc..Process in conventional data cleansing
During, web mode and major part data cleansing program need do not carrying out collecting and dividing in the big data quantity of over cleaning
Analysis, the consequence of do so not only consumes substantial amounts of server resource, and can be substantially reduced the response speed of server.
Summary of the invention
For above-mentioned problems of the prior art, it is an object of the invention to provide a kind of based on big data quantity clear
Washing method, it uses Spark technological means, according to the data of Hadoop distributed file system (HDFS), by HDFS, Hive
Storage mode (traffic direction) with Hadoop Database (Hbase) data, it is possible to reduce the memory capacity of data, reduction
The consumption of server resource, raising retrieval rate and the data precision, reduction web are shown end response time, are improve server
Response speed and meet different business needs.
To achieve these goals, the technical solution used in the present invention is as follows:
A kind of cleaning method based on big data quantity, it comprises the following steps:
Step one: according to the data in HDFS or Hive data base, configure cleaning rule according to type of service;
Step 2: according to the purposes of data, configuration storage mode of data after over cleaning;
Step 3: the size of the data cleaned as required, configures Spark cluster server resource;
Step 4: dispose cleaning procedure task;
Step 5: the data through over cleaning are estimated.
Further, the cleaning rule in step one is: configures to remove in single table and repeats the field of data institute foundation, configuration list
For judging the field of junk data institute foundation, configuration multilist enter in the field of completion content institute foundation, the single table of configuration in table
Row associates in the condition and/or configuration multilist in the field of institute's foundation, configuration multilist screened the data after association and associates
After the field of desired data.
Further, above-mentioned association includes left association, right association or coupling.
Further, the storage mode in step 2 is HDFS, Hive or Hbase.
Further, the Spark cluster server resource in step 3 includes that the memory size of server, cleaning procedure are corresponding
Burst size, the maximum CPU core number of server and/or the Log Directory of cleaning procedure.
Further, the deployment cleaning procedure task in step 4 includes: upload packet to be cleaned to task scheduling service
Device, collocation task scheduling also submit to Spark cluster server and monitoring cleaning procedure running.
Further, the index of the assessment in step 5 includes the credibility of data and the availability of data.
Further, the content of above-mentioned assessment includes: the storage mode of the data after over cleaning, the data after over cleaning
Accuracy, data whether have redundancy, web access data whether reach regulation response time and/or multilist association after data
Format and content.
Accompanying drawing explanation
Fig. 1 is the flow chart of a kind of based on big data quantity the cleaning method of the present invention.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with the accompanying drawings, the present invention is entered
Row further describes.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not used to limit
The present invention.
The present invention utilize the data that big data quantity stores during data mining and web crawlers exist repeat data,
Junk data and needs carry out the field of content completion according to required field, and utilize the resource of large data sets group, spark collection
The performance of group, the big data of comprehensive assessment, at the data precision that completes of process cleaned and cleaning speed, are then based on big data
The distributed treatment ability of platform, carries out data volume cleaning, and the data volume after process is respectively stored in HDFS and Hbase
In, in order to extract data according to different types of service and direction or provide web page to carry out data display.
According to one embodiment of present invention, it is provided that a kind of cleaning method based on big data quantity, the method uses
Scala language designs cleaning procedure, uses the key-value of Hbase to store non-relational data, has developed completely in program
Finish and test up to standard after, further according to the advantage of Spark Distributed Calculation, dispose cleaning procedure by task scheduling, it is ensured that every
It data volume produced can be through over cleaning.As it is shown in figure 1, first, according to the data in HDFS or Hive or Hbase,
The rule cleaned according to type of service configuration;Then, according to the data after cleaning for which kind of type of service and direction, configuration is clear
Wash the storage mode of rear data, such as HDFS, Hive warehouse or Hbase;Then, according to the size of required cleaning data volume,
It is configured to the Spark cluster server resource of cleaning procedure, including memory size, the cleaning of the server needed for cleaning procedure
The Log Directory of burst size, the maximum CPU core number of cleaning procedure required service device and cleaning procedure that program is corresponding so that
Catch mistake in time;Then, after having configured cleaning rule, data storage method and server resource, cleaning procedure is disposed
In task scheduling;Finally, the data after assessment cleaning, reason is that the purpose cleaning data is the different demands meeting client
And improve the accreditation of client, therefore, it is estimated just seeming increasingly important to the data after cleaning.
One side according to embodiments of the present invention, the rule that configuration is cleaned includes: will be according to which field in the single table of configuration
Value remove repeat data, in the single table of configuration by complete for the value complement according to which field data content, configuration list table by basis
The value of which field judges that data are which field will to be associated (the most left association, the right side according in junk data, configuration multilist
Association and coupling), in configuration multilist to requisite number after association in the condition screened of data after association and configuration multilist
According to field.
Another aspect according to embodiments of the present invention, the assessment of data cleansing is substantially to the quality of data after cleaning
It is estimated, specifically includes: whether assessment data store, assess the accuracy of data, assessment number according to the service class of configuration
According to whether also having whether redundancy, assessment data can reach the response time of regulation, assess whether data meet multilist web access
The form of data after association, content is the most consistent after whether assessment data meet multilist association.But, the assessment of the quality of data
Journey is a kind of by measuring and improving aggregation of data feature and optimize the process of data value.The evaluation index of the quality of data and side
The difficult point of method research is the evaluation index etc. of the implication to the quality of data, content, classification.Data quality accessment is at least
Should comprise following both sides basic evaluation index:
One, data must be believable to user.Credibility includes accuracy, integrity, concordance, effectiveness, only
The indexs such as one property.Specific as follows:
1. accuracy: the feature describing the most corresponding Subject of data is consistent.
2. integrity: describe whether data exist disappearance record or absent field.
3. concordance: the value of the same attribute describing same entity is the most consistent in different systems.
4. effectiveness: describe whether data meet user-defined condition or in certain domain value range.
5. uniqueness: whether description data exist is repeated record.
Its two, data must be available to user.Availability includes the index such as timeliness, stability.Have as follows:
1. timeliness: describing data is current data or historical data.
2. stability: describe whether data are stable, if within the effect duration of data.
Embodiment described above only have expressed embodiments of the present invention, and it describes more concrete and detailed, but can not
Therefore the restriction to the scope of the claims of the present invention it is interpreted as.It should be pointed out that, for the person of ordinary skill of the art,
Without departing from the inventive concept of the premise, it is also possible to make some deformation and improvement, these broadly fall into the protection model of the present invention
Enclose.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.
Claims (8)
1. a cleaning method based on big data quantity, it is characterised in that said method comprising the steps of:
Step one: according to the data in HDFS or Hive data base, configure cleaning rule according to type of service;
Step 2: according to the purposes of described data, the storage mode of the described data after configuration cleaning;
Step 3: the size of the described data cleaned as required, configures Spark cluster server resource;
Step 4: dispose cleaning procedure task;
Step 5: the described data after cleaning are estimated.
Method the most according to claim 1, it is characterised in that the described cleaning rule in step one is: in the single table of configuration
Remove and repeat the field of data institute foundation, configure in the field of completion content institute foundation in described single table, the single table of configuration and be used for sentencing
The field of disconnected junk data institute foundation, configuration multilist are associated in the field of institute's foundation, configuration multilist the number after association
According to carrying out the condition screened and/or the field configuring the desired data after associating in multilist.
Method the most according to claim 2, it is characterised in that described in be associated as left association, right association or coupling.
Method the most according to claim 1, it is characterised in that the described storage mode in step 2 be HDFS, Hive or
Hbase。
Method the most according to claim 1, it is characterised in that the described Spark cluster server resource bag in step 3
Include the memory size of described server, the burst size that described cleaning procedure is corresponding, the maximum CPU core number of described server and/
Or the Log Directory of described cleaning procedure.
Method the most according to claim 1, it is characterised in that the described deployment cleaning procedure task in step 4 includes:
Upload packet to be cleaned to task scheduling server, collocation task scheduling and submit to described Spark cluster server and
Monitoring cleaning procedure running.
Method the most according to claim 1, it is characterised in that the index of the described assessment in step 5 includes described data
The availability of credible and described data.
Method the most according to claim 7, it is characterised in that the content of described assessment includes: after described cleaning
Whether described data storage method, the accuracy of described data after described cleaning, described data have redundancy, web to access
Described data reach the format and content of described data after the response time of regulation and/or multilist association.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610647894.2A CN106202569A (en) | 2016-08-09 | 2016-08-09 | A kind of cleaning method based on big data quantity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610647894.2A CN106202569A (en) | 2016-08-09 | 2016-08-09 | A kind of cleaning method based on big data quantity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106202569A true CN106202569A (en) | 2016-12-07 |
Family
ID=57514771
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610647894.2A Pending CN106202569A (en) | 2016-08-09 | 2016-08-09 | A kind of cleaning method based on big data quantity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106202569A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106897411A (en) * | 2017-02-20 | 2017-06-27 | 广东奡风科技股份有限公司 | ETL system and its method based on Spark technologies |
CN107633257A (en) * | 2017-08-15 | 2018-01-26 | 上海数据交易中心有限公司 | Data Quality Assessment Methodology and device, computer-readable recording medium, terminal |
CN107832451A (en) * | 2017-11-23 | 2018-03-23 | 安徽科创智慧知识产权服务有限公司 | A kind of big data cleaning way of simplification |
CN107895032A (en) * | 2017-11-23 | 2018-04-10 | 安徽科创智慧知识产权服务有限公司 | Carry out the network data acquisition method that data are tentatively cleaned |
CN107908720A (en) * | 2017-11-14 | 2018-04-13 | 河北工程大学 | A kind of patent data cleaning method and system based on AdaBoost algorithms |
CN107943973A (en) * | 2017-11-28 | 2018-04-20 | 上海云信留客信息科技有限公司 | A kind of big data system for washing intelligently and cloud intelligent robot clean service platform |
CN109033174A (en) * | 2018-06-21 | 2018-12-18 | 北京国网信通埃森哲信息技术有限公司 | A kind of power quality data cleaning method and device |
CN109104732A (en) * | 2018-06-13 | 2018-12-28 | 珠海格力电器股份有限公司 | Data transmission method for uplink, device and intelligent electric appliance |
CN109753496A (en) * | 2018-11-27 | 2019-05-14 | 天聚地合(苏州)数据股份有限公司 | A kind of data cleaning method for big data |
CN109783341A (en) * | 2017-11-10 | 2019-05-21 | 阿里巴巴集团控股有限公司 | Regression testing method and device |
CN110019152A (en) * | 2017-07-27 | 2019-07-16 | 润泽科技发展有限公司 | A kind of big data cleaning method |
CN110427356A (en) * | 2018-04-26 | 2019-11-08 | 中移(苏州)软件技术有限公司 | One parameter configuration method and equipment |
CN112000656A (en) * | 2020-09-01 | 2020-11-27 | 北京天源迪科信息技术有限公司 | Intelligent data cleaning method and device based on metadata |
CN112486969A (en) * | 2020-12-01 | 2021-03-12 | 李孔雀 | Data cleaning method applied to big data and deep learning and cloud server |
CN113127460A (en) * | 2019-12-31 | 2021-07-16 | 北京懿医云科技有限公司 | Evaluation method of data cleaning frame, device, equipment and storage medium thereof |
CN117056576A (en) * | 2023-10-13 | 2023-11-14 | 太极计算机股份有限公司 | Data quality flexible verification method based on big data platform |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013184677A1 (en) * | 2012-06-04 | 2013-12-12 | Google Inc. | Method and system for deleting obsolete files from a file system |
CN104317801A (en) * | 2014-09-19 | 2015-01-28 | 东北大学 | Data cleaning system and method for aiming at big data |
CN104680328A (en) * | 2015-03-16 | 2015-06-03 | 朗新科技股份有限公司 | Power grid construction quality monitoring method based on client perception values |
CN105138650A (en) * | 2015-08-28 | 2015-12-09 | 成都康赛信息技术有限公司 | Hadoop data cleaning method and system based on outlier mining |
-
2016
- 2016-08-09 CN CN201610647894.2A patent/CN106202569A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013184677A1 (en) * | 2012-06-04 | 2013-12-12 | Google Inc. | Method and system for deleting obsolete files from a file system |
CN104317801A (en) * | 2014-09-19 | 2015-01-28 | 东北大学 | Data cleaning system and method for aiming at big data |
CN104680328A (en) * | 2015-03-16 | 2015-06-03 | 朗新科技股份有限公司 | Power grid construction quality monitoring method based on client perception values |
CN105138650A (en) * | 2015-08-28 | 2015-12-09 | 成都康赛信息技术有限公司 | Hadoop data cleaning method and system based on outlier mining |
Non-Patent Citations (1)
Title |
---|
金翰伟: "基于Spark的大数据清洗框架设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106897411A (en) * | 2017-02-20 | 2017-06-27 | 广东奡风科技股份有限公司 | ETL system and its method based on Spark technologies |
CN110019152A (en) * | 2017-07-27 | 2019-07-16 | 润泽科技发展有限公司 | A kind of big data cleaning method |
CN107633257A (en) * | 2017-08-15 | 2018-01-26 | 上海数据交易中心有限公司 | Data Quality Assessment Methodology and device, computer-readable recording medium, terminal |
CN107633257B (en) * | 2017-08-15 | 2020-04-17 | 上海数据交易中心有限公司 | Data quality evaluation method and device, computer readable storage medium and terminal |
CN109783341A (en) * | 2017-11-10 | 2019-05-21 | 阿里巴巴集团控股有限公司 | Regression testing method and device |
CN107908720A (en) * | 2017-11-14 | 2018-04-13 | 河北工程大学 | A kind of patent data cleaning method and system based on AdaBoost algorithms |
CN107832451A (en) * | 2017-11-23 | 2018-03-23 | 安徽科创智慧知识产权服务有限公司 | A kind of big data cleaning way of simplification |
CN107895032A (en) * | 2017-11-23 | 2018-04-10 | 安徽科创智慧知识产权服务有限公司 | Carry out the network data acquisition method that data are tentatively cleaned |
CN107943973A (en) * | 2017-11-28 | 2018-04-20 | 上海云信留客信息科技有限公司 | A kind of big data system for washing intelligently and cloud intelligent robot clean service platform |
CN110427356A (en) * | 2018-04-26 | 2019-11-08 | 中移(苏州)软件技术有限公司 | One parameter configuration method and equipment |
CN110427356B (en) * | 2018-04-26 | 2021-08-13 | 中移(苏州)软件技术有限公司 | Parameter configuration method and equipment |
CN109104732A (en) * | 2018-06-13 | 2018-12-28 | 珠海格力电器股份有限公司 | Data transmission method for uplink, device and intelligent electric appliance |
CN109033174A (en) * | 2018-06-21 | 2018-12-18 | 北京国网信通埃森哲信息技术有限公司 | A kind of power quality data cleaning method and device |
CN109753496A (en) * | 2018-11-27 | 2019-05-14 | 天聚地合(苏州)数据股份有限公司 | A kind of data cleaning method for big data |
CN113127460A (en) * | 2019-12-31 | 2021-07-16 | 北京懿医云科技有限公司 | Evaluation method of data cleaning frame, device, equipment and storage medium thereof |
CN113127460B (en) * | 2019-12-31 | 2023-11-17 | 北京懿医云科技有限公司 | Evaluation method of data cleaning frame, device, equipment and storage medium thereof |
CN112000656A (en) * | 2020-09-01 | 2020-11-27 | 北京天源迪科信息技术有限公司 | Intelligent data cleaning method and device based on metadata |
CN112486969A (en) * | 2020-12-01 | 2021-03-12 | 李孔雀 | Data cleaning method applied to big data and deep learning and cloud server |
CN112486969B (en) * | 2020-12-01 | 2021-08-03 | 罗嗣扬 | Data cleaning method applied to big data and deep learning and cloud server |
CN117056576A (en) * | 2023-10-13 | 2023-11-14 | 太极计算机股份有限公司 | Data quality flexible verification method based on big data platform |
CN117056576B (en) * | 2023-10-13 | 2024-04-05 | 太极计算机股份有限公司 | Data quality flexible verification method based on big data platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106202569A (en) | A kind of cleaning method based on big data quantity | |
CN104426713B (en) | The monitoring method and device of web site access effect data | |
CN103793465B (en) | Mass users behavior real-time analysis method and system based on cloud computing | |
CA2822900C (en) | Filtering queried data on data stores | |
US9612892B2 (en) | Creating a correlation rule defining a relationship between event types | |
CN105989129B (en) | Real time data statistical method and device | |
CN102946319B (en) | Networks congestion control information analysis system and analytical method thereof | |
CN104424229B (en) | A kind of calculation method and system that various dimensions are split | |
CA2947158A1 (en) | Systems, devices and methods for generating locality-indicative data representations of data streams, and compressions thereof | |
CN104424287B (en) | Data query method and apparatus | |
CN106339331B (en) | A kind of data buffer storage stratification scaling method based on user activity | |
US20060294220A1 (en) | Diagnostics and resolution mining architecture | |
CN104978324B (en) | Data processing method and device | |
CN108021651A (en) | Network public opinion risk assessment method and device | |
CN105224529A (en) | A kind of personalized recommendation method based on user browsing behavior and device | |
CN104081364A (en) | Collaborative caching | |
CN106992886A (en) | A kind of log analysis method and device based on distributed storage | |
CN107844402A (en) | A kind of resource monitoring method, device and terminal based on super fusion storage system | |
CN106372133A (en) | Big data-based user behavior analysis processing method and system | |
CN103518200B (en) | Determine the unique visitor of network site | |
Han et al. | A comparative analysis on Weibo and Twitter | |
CN107704620A (en) | A kind of method, apparatus of file administration, equipment and storage medium | |
Wei et al. | Delle: Detecting latest local events from geotagged tweets | |
CN110019152A (en) | A kind of big data cleaning method | |
CN111459900B (en) | Big data life cycle setting method, device, storage medium and server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161207 |
|
RJ01 | Rejection of invention patent application after publication |