CN106294745A - Big data cleaning method and device - Google Patents
Big data cleaning method and device Download PDFInfo
- Publication number
- CN106294745A CN106294745A CN201610652750.6A CN201610652750A CN106294745A CN 106294745 A CN106294745 A CN 106294745A CN 201610652750 A CN201610652750 A CN 201610652750A CN 106294745 A CN106294745 A CN 106294745A
- Authority
- CN
- China
- Prior art keywords
- data
- cleaning
- carried out
- spark
- cleaning process
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Abstract
The invention provides a kind of big Data Cleaning Method and device, belong to big data compilation technical field, it is possible to significantly improve cleaning speed and the cleaning efficiency of data cleansing.This big Data Cleaning Method includes: cleaning process is carried out configuration definition;Cleaning process is resolved, cleaning process is converted to the atomic operation of Spark;Cleaning task is committed to Spark cluster;Data cleansing is carried out by Spark cluster.
Description
Technical field
The present invention relates to big data compilation technical field, in particular to the big Data Cleaning Method of one and device.
Background technology
Along with the arrival of big data age, the scale of data becomes huge, and the growth rate of data accelerates, the type of data
The most various with structure.How big data can be become useful data, how can excavate therein from huge data
Value becomes more and more urgent and important.
Data cleansing is exactly the most primary work, is capable of big data are carried out noise reduction, mainly by data cleansing
It is that the data of incomplete data, the data of mistake and repetition are got rid of, thus obtains the data that concordance is higher.
In existing data cleansing technology, cleaning procedure major part is stand-alone program, and cleaning speed and cleaning efficiency are relatively low.
It is under certain data magnitude, it is possible to realizes automaticdata by computer technology and cleans.But at big data age, along with number
According to amount and the increase of data type, existing data cleansing technology has been difficult to meet the demand that current data is cleaned.
Summary of the invention
In view of this, it is an object of the invention to provide a kind of big Data Cleaning Method and device, it is possible to significantly improve number
According to the cleaning speed cleaned and cleaning efficiency.
First aspect, embodiments provides a kind of big Data Cleaning Method, including:
Cleaning process is carried out configuration definition;
Cleaning process is resolved, cleaning process is converted to the atomic operation of Spark;
Cleaning task is committed to Spark cluster;
Data cleansing is carried out by Spark cluster.
In conjunction with first aspect, embodiments provide the first possible embodiment of first aspect, wherein, institute
State and carried out data cleansing by Spark cluster, specifically include:
From data source loading data;
Utilize the cleaning algorithm of distributed parallel, data are carried out;
The result of data cleansing is stored.
In conjunction with first aspect, embodiments provide the embodiment that the second of first aspect is possible, wherein, institute
State cleaning algorithm and include at least one in the process of processing empty value, duplicate removal and sequence process.
In conjunction with first aspect, embodiments provide the third possible embodiment of first aspect, wherein, many
Stream compression between individual cleaning algorithm is transmitted by elasticity distribution formula data set.
In conjunction with first aspect, embodiments provide the 4th kind of possible embodiment of first aspect, wherein, institute
Stating data source is data base or distributed file system.
In conjunction with first aspect, embodiments provide the 5th kind of possible embodiment of first aspect, wherein, institute
State and cleaning process is carried out configuration definition, particularly as follows:
Based on JSON form, cleaning process is carried out configuration definition.
Second aspect, the embodiment of the present invention also provides for a kind of big data cleansing device, including:
Big data cleansing engine, for carrying out configuration definition to cleaning process;Cleaning process is resolved, cleaning is flowed
Journey is converted to the atomic operation of Spark;Cleaning task is committed to Spark cluster;
Spark cluster, is used for carrying out data cleansing.
In conjunction with second aspect, embodiments provide the first possible embodiment of second aspect, wherein, institute
State Spark cluster specifically for:
From data source loading data;
Utilize the cleaning algorithm of distributed parallel, data are carried out;
The result of data cleansing is stored.
In conjunction with second aspect, embodiments provide the embodiment that the second of second aspect is possible, wherein, be somebody's turn to do
Device also includes storing assembly, for storing the result of data cleansing.
In conjunction with second aspect, embodiments provide the third possible embodiment of second aspect, wherein, institute
State cleaning algorithm and include at least one in the process of processing empty value, duplicate removal and sequence process.
The embodiment of the present invention brings following beneficial effect: use big Data Cleaning Method that the embodiment of the present invention provides and
Clean device, first cleaning process is carried out configuration definition, then cleaning process resolves and is converted to the atom behaviour of Spark
Make.After cleaning task is committed to big data analysis framework Spark cluster, Spark cluster carry out data cleansing, because each
Each step in cleaning process has been converted into the atomic operation of Spark, so each carried out in Spark cluster cleans
Step all can perform with distributed parallel such that it is able to significantly improves the cleaning speed of data cleansing, it is achieved at high speed with efficient
The data cleansing of rate, is more applicable for current big data environment.
Other features and advantages of the present invention will illustrate in the following description, and, partly become from description
Obtain it is clear that or understand by implementing the present invention.The purpose of the present invention and other advantages are at description, claims
And specifically noted structure realizes and obtains in accompanying drawing.
For making the above-mentioned purpose of the present invention, feature and advantage to become apparent, preferred embodiment cited below particularly, and coordinate
Appended accompanying drawing, is described in detail below.
Accompanying drawing explanation
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, below by embodiment required use attached
Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, and it is right to be therefore not construed as
The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to according to this
A little accompanying drawings obtain other relevant accompanying drawings.
Fig. 1 shows the flow chart of a kind of big Data Cleaning Method that the embodiment of the present invention one is provided;
Fig. 2 shows the schematic diagram of a kind of big data cleansing device that the embodiment of the present invention two is provided.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention
Middle accompanying drawing, is clearly and completely described the technical scheme in the embodiment of the present invention, it is clear that described embodiment is only
It is a part of embodiment of the present invention rather than whole embodiments.Therefore, enforcement to the present invention provided in the accompanying drawings below
The detailed description of example is not intended to limit the scope of claimed invention, but is merely representative of the selected enforcement of the present invention
Example.Based on embodiments of the invention, it is all that those skilled in the art are obtained on the premise of not making creative work
Other embodiments, broadly fall into the scope of protection of the invention.
In current data cleansing technology, cleaning procedure major part is stand-alone program, and cleaning speed and cleaning efficiency are relatively low,
It is difficult to meet the demand of data cleansing under current big data environment.
Based on this, a kind of Data Cleaning Method and the device greatly that the embodiment of the present invention provides, it is possible to significantly improve data clear
The cleaning speed washed, it is achieved at high speed with high efficiency data cleansing, be more applicable for current big data environment.
Embodiment one:
As it is shown in figure 1, the embodiment of the present invention provides a kind of big Data Cleaning Method, mainly comprise the steps that
S1: cleaning process is carried out configuration definition.
Concrete, start data cleansing engine, first load cleaning process configuration file, based on JSON (Java Script
Object Notation) form, cleaning process is carried out configuration definition, configuration item example is as follows:
JSON is the data interchange format of a kind of lightweight, is a subset based on ECMAScript.JSON has used
Entirely independent of the text formatting of language, JSON is made to become preferable data interchange language, it is easy to people reads and writes, the easiest
Resolve in machine and generate, cross-platform data transmission has the biggest advantage.
S2: resolve cleaning process, is converted to the atomic operation of Spark by cleaning process.
Cleaning process is resolved by big data cleansing engine according to the definition information of configuration file, is changed by cleaning step
Atomic operation for Spark.
Spark is the most popular general parallel computation frame after Hadoop of field of cloud calculation, is that one can
The company-data analysis platform calculating (In-Memory Computing) based on internal memory of flexible (scalable), compares Hadoop
Cluster storage method more has performance advantage.Spark distributed data based on internal memory collection, optimize iterative live load with
And interactive inquiry, thus improve speed and the efficiency that big data calculate.
S3: cleaning task is committed to Spark cluster.
First initialize Spark cluster, load Spark context environmental, submit to for cleaning operation and prepare.Then basis
The order of cleaning process definition, is committed to Spark cluster by concrete data cleansing operation.
S4: carried out data cleansing by Spark cluster.
S41: from data source loading data.
Data source can be different types of Data Source, and the data source in the present embodiment is data base or distributed document
System (Hadoop Distributed File System is called for short HDFS).
In other embodiments, it is also possible to be extended according to concrete business, growth data Source Type has only to phase
The increase data answered load atom and process, and the loading procedure of data source is also distributed variable-frequencypump.
S42: utilize the cleaning algorithm of distributed parallel, data are carried out.
The present embodiment illustrates three kinds and cleans algorithm: processing empty value, duplicate removal process and sequence processes.
As a preferred version, the stream compression between multiple cleaning algorithms passes through elasticity distribution formula data set
(Resilient Distributed Datasets is called for short RDD) transmission.Because Spark cluster is set up at unified abstract RDD
On, so that Spark cluster can tackle different big data in an essentially uniform manner processes scene, including
MapReduce, Streaming, SQL, Machine Learning, Graph etc..RDD is fault-tolerant, a parallel data knot
Structure, can allow user explicitly store data in disk and internal memory, and can control the subregion of data.Meanwhile, RDD also provides for
One group of abundant operation operates these data, such as map, flatMap, filter, join, groupBy,
ReduceByKey etc..
The present embodiment illustrates three kinds and cleans algorithm: processing empty value, duplicate removal process and sequence processes.Other embodiment party
In formula, clean algorithm and be not limited to three of the above, can be extended according to practical business demand.If newly-increased algorithm newly-increased
Individual method, carries out specifying just can applying in configuration definition simultaneously.
S43: the result of data cleansing is stored.
Start data storage engines, namely start the program that the result data cleaned is stored.Data storage is drawn
Hold up to define according to the result in configuration definition and select mode to be stored, the present embodiment can be deposited by data base or HDFS
Storage result.In other embodiments, other storage modes can be extended.
In the big Data Cleaning Method that the embodiment of the present invention provides, first cleaning process is carried out configuration definition, then to cleaning
Flow process carries out resolving and being converted to the atomic operation of Spark.Cleaning task is committed to big data analysis framework Spark cluster
After, the Spark cluster increased income carry out data cleansing, finally the result of data cleansing is stored.Stream is cleaned because each
Each step in journey has been converted into the atomic operation of Spark, so each cleaning step carried out in Spark cluster is equal
Can perform with distributed parallel such that it is able to significantly improve the cleaning speed of data cleansing, it is achieved at high speed with high efficiency number
According to cleaning, it is more applicable for current big data environment.
Additionally, Spark cluster can well support extension, mode based on configuration is carried out flow definition and can drop
The coupling of low program, increases or deletes the corresponding algorithm that cleans and can realize under minimum change.
Embodiment two:
As in figure 2 it is shown, the embodiment of the present invention provides a kind of big data cleansing device, including big data cleansing engine 1 He
Spark cluster 2.
Wherein, big data cleansing engine 1 is for carrying out configuration definition to cleaning process, and resolves cleaning process,
Cleaning process is converted to the atomic operation of Spark, and cleaning task is committed to Spark cluster;Spark cluster 2 is used for
Carry out data cleansing.
Concrete, after big data cleansing engine 1 starts, first load cleaning process configuration file, based on JSON form, right
Cleaning process carries out configuration definition.Cleaning process is carried out by the biggest data cleansing engine 1 according to the definition information of configuration file
Resolve, cleaning step is converted to the atomic operation of Spark.
After Spark cluster 2 initializes, the order that big data cleansing engine 1 defines according to cleaning process, by concrete number
It is committed to Spark cluster according to cleaning operation.
After Spark cluster 2 receives data cleansing operation, first from data source 4 loading data.Data source 4 can be different
The Data Source of type, the data source 4 in the present embodiment includes data base or HDFS.In other embodiments, it is also possible to root
Being extended according to concrete business, growth data Source Type has only to accordingly increase data loading atom and processes, data
The loading procedure in source is also distributed variable-frequencypump.
Then Spark cluster 2 utilizes the cleaning algorithm of distributed parallel, is carried out data.Clear in the present embodiment
Wash algorithm and include that processing empty value, duplicate removal process and sequence processes.In other embodiments, clean algorithm and be not limited to above three
Kind, can be extended according to practical business demand.As long as the newly-increased method of newly-increased algorithm, simultaneously in configuration definition
Carry out specifying and just can apply.
As a preferred version, the stream compression between multiple cleaning algorithms is transmitted by RDD.Because Spark cluster
Set up on unified abstract RDD, so that Spark cluster can tackle different big data in an essentially uniform manner
Process scene, including MapReduce, Streaming, SQL, Machine Learning, Graph etc..RDD be one fault-tolerant
, parallel data structure, user can be allowed explicitly to store data in disk and internal memory, and can control data point
District.Meanwhile, RDD additionally provides one group of abundant operation to operate these data, such as map, flatMap, filter, join,
GroupBy, reduceByKey etc..
Finally, Spark cluster 2 starts data storage engines, stores the result of data cleansing.The embodiment of the present invention
The big data cleansing device provided also includes storing assembly 3, for storing the result of data cleansing.
Data storage engines defines according to the result in configuration definition and selects mode to be stored, and can lead in the present embodiment
Cross data base or HDFS stores result.In other embodiments, other storage modes can be extended.
In the big data cleansing device that the embodiment of the present invention provides, big data cleansing engine 1 cleaning process is joined
Put definition, then cleaning process resolves and is converted to the atomic operation of Spark.Cleaning task is committed to big data analysis
After framework Spark cluster 2, the Spark cluster 2 increased income carry out data cleansing, finally the result of data cleansing is stored to depositing
Storage assembly 3.Because each step in each cleaning process has been converted into the atomic operation of Spark, so at Spark cluster
Each cleaning step carried out in 2 all can perform with distributed parallel such that it is able to significantly improves the cleaning speed of data cleansing,
Realize high speed and high efficiency data cleansing, be more applicable for current big data environment.
Additionally, Spark cluster 2 can well support extension, it is permissible that mode based on configuration is carried out flow definition
The coupling of reduction program, increases or deletes the corresponding algorithm that cleans and can realize under minimum change.
If described function is using the form realization of SFU software functional unit and as independent production marketing or use, permissible
It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is the most in other words
The part contributing prior art or the part of this technical scheme can embody with the form of software product, this meter
Calculation machine software product is stored in a storage medium, including some instructions with so that a computer equipment (can be individual
People's computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention.
And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random access memory are deposited
The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic disc or CD.
The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is not limited thereto, and any
Those familiar with the art, in the technical scope that the invention discloses, can readily occur in change or replace, should contain
Cover within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with scope of the claims.
Claims (10)
1. a big Data Cleaning Method, it is characterised in that including:
Cleaning process is carried out configuration definition;
Cleaning process is resolved, cleaning process is converted to the atomic operation of Spark;
Cleaning task is committed to Spark cluster;
Data cleansing is carried out by Spark cluster.
Method the most according to claim 1, it is characterised in that described carried out data cleansing by Spark cluster, specifically wraps
Include:
From data source loading data;
Utilize the cleaning algorithm of distributed parallel, data are carried out;
The result of data cleansing is stored.
Method the most according to claim 2, it is characterised in that described cleaning algorithm include processing empty value, duplicate removal process and
At least one in sequence process.
Method the most according to claim 3, it is characterised in that the stream compression between multiple cleaning algorithms is by elasticity point
Cloth data set transmits.
Method the most according to claim 2, it is characterised in that described data source is data base or distributed file system.
Method the most according to claim 1, it is characterised in that described cleaning process is carried out configuration definition, particularly as follows:
Based on JSON form, cleaning process is carried out configuration definition.
7. a big data cleansing device, it is characterised in that including:
Big data cleansing engine, for carrying out configuration definition to cleaning process;Cleaning process is resolved, cleaning process is turned
It is changed to the atomic operation of Spark;Cleaning task is committed to Spark cluster;
Spark cluster, is used for carrying out data cleansing.
Device the most according to claim 7, it is characterised in that described Spark cluster specifically for:
From data source loading data;
Utilize the cleaning algorithm of distributed parallel, data are carried out;
The result of data cleansing is stored.
Device the most according to claim 8, it is characterised in that also include storing assembly, for storing the knot of data cleansing
Really.
Device the most according to claim 8, it is characterised in that described cleaning algorithm include processing empty value, duplicate removal process and
At least one in sequence process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610652750.6A CN106294745A (en) | 2016-08-10 | 2016-08-10 | Big data cleaning method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610652750.6A CN106294745A (en) | 2016-08-10 | 2016-08-10 | Big data cleaning method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106294745A true CN106294745A (en) | 2017-01-04 |
Family
ID=57667952
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610652750.6A Pending CN106294745A (en) | 2016-08-10 | 2016-08-10 | Big data cleaning method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294745A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107688592A (en) * | 2017-04-06 | 2018-02-13 | 平安科技(深圳)有限公司 | The method and terminal of data cleansing |
CN107832451A (en) * | 2017-11-23 | 2018-03-23 | 安徽科创智慧知识产权服务有限公司 | A kind of big data cleaning way of simplification |
CN109684082A (en) * | 2018-12-11 | 2019-04-26 | 中科恒运股份有限公司 | The data cleaning method and system of rule-based algorithm |
CN109753496A (en) * | 2018-11-27 | 2019-05-14 | 天聚地合(苏州)数据股份有限公司 | A kind of data cleaning method for big data |
CN110019152A (en) * | 2017-07-27 | 2019-07-16 | 润泽科技发展有限公司 | A kind of big data cleaning method |
CN110427356A (en) * | 2018-04-26 | 2019-11-08 | 中移(苏州)软件技术有限公司 | One parameter configuration method and equipment |
CN110502509A (en) * | 2019-08-27 | 2019-11-26 | 广东工业大学 | A kind of traffic big data cleaning method and relevant apparatus based on Hadoop Yu Spark frame |
WO2020211299A1 (en) * | 2019-04-17 | 2020-10-22 | 苏宁云计算有限公司 | Data cleansing method |
CN113377829A (en) * | 2021-05-14 | 2021-09-10 | 中国民生银行股份有限公司 | Big data statistical method and device |
WO2021174791A1 (en) * | 2020-03-05 | 2021-09-10 | 百度在线网络技术(北京)有限公司 | Task migration method and apparatus, and electronic device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103177094A (en) * | 2013-03-14 | 2013-06-26 | 成都康赛电子科大信息技术有限责任公司 | Cleaning method of data of internet of things |
CN103593352A (en) * | 2012-08-15 | 2014-02-19 | 阿里巴巴集团控股有限公司 | Method and device for cleaning mass data |
CN104680328A (en) * | 2015-03-16 | 2015-06-03 | 朗新科技股份有限公司 | Power grid construction quality monitoring method based on client perception values |
CN105138650A (en) * | 2015-08-28 | 2015-12-09 | 成都康赛信息技术有限公司 | Hadoop data cleaning method and system based on outlier mining |
-
2016
- 2016-08-10 CN CN201610652750.6A patent/CN106294745A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593352A (en) * | 2012-08-15 | 2014-02-19 | 阿里巴巴集团控股有限公司 | Method and device for cleaning mass data |
CN103177094A (en) * | 2013-03-14 | 2013-06-26 | 成都康赛电子科大信息技术有限责任公司 | Cleaning method of data of internet of things |
CN104680328A (en) * | 2015-03-16 | 2015-06-03 | 朗新科技股份有限公司 | Power grid construction quality monitoring method based on client perception values |
CN105138650A (en) * | 2015-08-28 | 2015-12-09 | 成都康赛信息技术有限公司 | Hadoop data cleaning method and system based on outlier mining |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107688592B (en) * | 2017-04-06 | 2020-03-17 | 平安科技(深圳)有限公司 | Data cleaning method and terminal |
WO2018184418A1 (en) * | 2017-04-06 | 2018-10-11 | 平安科技(深圳)有限公司 | Data cleaning method, terminal and computer readable storage medium |
CN107688592A (en) * | 2017-04-06 | 2018-02-13 | 平安科技(深圳)有限公司 | The method and terminal of data cleansing |
CN110019152A (en) * | 2017-07-27 | 2019-07-16 | 润泽科技发展有限公司 | A kind of big data cleaning method |
CN107832451A (en) * | 2017-11-23 | 2018-03-23 | 安徽科创智慧知识产权服务有限公司 | A kind of big data cleaning way of simplification |
CN110427356A (en) * | 2018-04-26 | 2019-11-08 | 中移(苏州)软件技术有限公司 | One parameter configuration method and equipment |
CN110427356B (en) * | 2018-04-26 | 2021-08-13 | 中移(苏州)软件技术有限公司 | Parameter configuration method and equipment |
CN109753496A (en) * | 2018-11-27 | 2019-05-14 | 天聚地合(苏州)数据股份有限公司 | A kind of data cleaning method for big data |
CN109684082A (en) * | 2018-12-11 | 2019-04-26 | 中科恒运股份有限公司 | The data cleaning method and system of rule-based algorithm |
WO2020211299A1 (en) * | 2019-04-17 | 2020-10-22 | 苏宁云计算有限公司 | Data cleansing method |
CN110502509A (en) * | 2019-08-27 | 2019-11-26 | 广东工业大学 | A kind of traffic big data cleaning method and relevant apparatus based on Hadoop Yu Spark frame |
CN110502509B (en) * | 2019-08-27 | 2023-04-18 | 广东工业大学 | Traffic big data cleaning method based on Hadoop and Spark framework and related device |
WO2021174791A1 (en) * | 2020-03-05 | 2021-09-10 | 百度在线网络技术(北京)有限公司 | Task migration method and apparatus, and electronic device and storage medium |
US11822957B2 (en) | 2020-03-05 | 2023-11-21 | Baidu Online Network Technology (Beijing) Co., Ltd. | Task migration method, apparatus, electronic device and storage medium |
CN113377829A (en) * | 2021-05-14 | 2021-09-10 | 中国民生银行股份有限公司 | Big data statistical method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106294745A (en) | Big data cleaning method and device | |
JP6542785B2 (en) | Implementation of semi-structured data as first class database element | |
US10318882B2 (en) | Optimized training of linear machine learning models | |
US11226963B2 (en) | Method and system for executing queries on indexed views | |
CN106649670A (en) | Streaming computing-based data monitoring method and apparatus | |
CN105740424A (en) | Spark platform based high efficiency text classification method | |
CN103440288A (en) | Big data storage method and device | |
CN102880709A (en) | Data warehouse management system and data warehouse management method | |
CN103514205A (en) | Mass data processing method and system | |
CN109033109A (en) | Data processing method and system | |
Gu et al. | Chronos: An elastic parallel framework for stream benchmark generation and simulation | |
CN109684319A (en) | Data clean system, method, apparatus and storage medium | |
Singh et al. | Spatial data analysis with ArcGIS and MapReduce | |
Mehmood et al. | Distributed real-time ETL architecture for unstructured big data | |
CN106570173A (en) | High-dimensional sparse text data clustering method based on Spark | |
Mondal et al. | Casqd: continuous detection of activity-based subgraph pattern queries on dynamic graphs | |
CN110018997B (en) | Mass small file storage optimization method based on HDFS | |
Papadakis et al. | Blocking for large-scale entity resolution: Challenges, algorithms, and practical examples | |
US20220335270A1 (en) | Knowledge graph compression | |
CN108287889B (en) | A kind of multi-source heterogeneous date storage method and system based on elastic table model | |
CN110019152A (en) | A kind of big data cleaning method | |
CN103699627B (en) | A kind of super large file in parallel data block localization method based on Hadoop clusters | |
CN108319604A (en) | The associated optimization method of size table in a kind of hive | |
CN106599244B (en) | General original log cleaning device and method | |
Sinthong et al. | AFrame: Extending DataFrames for large-scale modern data analysis (Extended Version) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170104 |
|
RJ01 | Rejection of invention patent application after publication |