CN106294745A - Big data cleaning method and device - Google Patents

Big data cleaning method and device Download PDF

Info

Publication number
CN106294745A
CN106294745A CN201610652750.6A CN201610652750A CN106294745A CN 106294745 A CN106294745 A CN 106294745A CN 201610652750 A CN201610652750 A CN 201610652750A CN 106294745 A CN106294745 A CN 106294745A
Authority
CN
China
Prior art keywords
data
cleaning
carried out
spark
cleaning process
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610652750.6A
Other languages
Chinese (zh)
Inventor
赵伟伟
张丛喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netposa Technologies Ltd
Original Assignee
Netposa Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netposa Technologies Ltd filed Critical Netposa Technologies Ltd
Priority to CN201610652750.6A priority Critical patent/CN106294745A/en
Publication of CN106294745A publication Critical patent/CN106294745A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Abstract

The invention provides a kind of big Data Cleaning Method and device, belong to big data compilation technical field, it is possible to significantly improve cleaning speed and the cleaning efficiency of data cleansing.This big Data Cleaning Method includes: cleaning process is carried out configuration definition;Cleaning process is resolved, cleaning process is converted to the atomic operation of Spark;Cleaning task is committed to Spark cluster;Data cleansing is carried out by Spark cluster.

Description

Big Data Cleaning Method and device
Technical field
The present invention relates to big data compilation technical field, in particular to the big Data Cleaning Method of one and device.
Background technology
Along with the arrival of big data age, the scale of data becomes huge, and the growth rate of data accelerates, the type of data The most various with structure.How big data can be become useful data, how can excavate therein from huge data Value becomes more and more urgent and important.
Data cleansing is exactly the most primary work, is capable of big data are carried out noise reduction, mainly by data cleansing It is that the data of incomplete data, the data of mistake and repetition are got rid of, thus obtains the data that concordance is higher.
In existing data cleansing technology, cleaning procedure major part is stand-alone program, and cleaning speed and cleaning efficiency are relatively low. It is under certain data magnitude, it is possible to realizes automaticdata by computer technology and cleans.But at big data age, along with number According to amount and the increase of data type, existing data cleansing technology has been difficult to meet the demand that current data is cleaned.
Summary of the invention
In view of this, it is an object of the invention to provide a kind of big Data Cleaning Method and device, it is possible to significantly improve number According to the cleaning speed cleaned and cleaning efficiency.
First aspect, embodiments provides a kind of big Data Cleaning Method, including:
Cleaning process is carried out configuration definition;
Cleaning process is resolved, cleaning process is converted to the atomic operation of Spark;
Cleaning task is committed to Spark cluster;
Data cleansing is carried out by Spark cluster.
In conjunction with first aspect, embodiments provide the first possible embodiment of first aspect, wherein, institute State and carried out data cleansing by Spark cluster, specifically include:
From data source loading data;
Utilize the cleaning algorithm of distributed parallel, data are carried out;
The result of data cleansing is stored.
In conjunction with first aspect, embodiments provide the embodiment that the second of first aspect is possible, wherein, institute State cleaning algorithm and include at least one in the process of processing empty value, duplicate removal and sequence process.
In conjunction with first aspect, embodiments provide the third possible embodiment of first aspect, wherein, many Stream compression between individual cleaning algorithm is transmitted by elasticity distribution formula data set.
In conjunction with first aspect, embodiments provide the 4th kind of possible embodiment of first aspect, wherein, institute Stating data source is data base or distributed file system.
In conjunction with first aspect, embodiments provide the 5th kind of possible embodiment of first aspect, wherein, institute State and cleaning process is carried out configuration definition, particularly as follows:
Based on JSON form, cleaning process is carried out configuration definition.
Second aspect, the embodiment of the present invention also provides for a kind of big data cleansing device, including:
Big data cleansing engine, for carrying out configuration definition to cleaning process;Cleaning process is resolved, cleaning is flowed Journey is converted to the atomic operation of Spark;Cleaning task is committed to Spark cluster;
Spark cluster, is used for carrying out data cleansing.
In conjunction with second aspect, embodiments provide the first possible embodiment of second aspect, wherein, institute State Spark cluster specifically for:
From data source loading data;
Utilize the cleaning algorithm of distributed parallel, data are carried out;
The result of data cleansing is stored.
In conjunction with second aspect, embodiments provide the embodiment that the second of second aspect is possible, wherein, be somebody's turn to do Device also includes storing assembly, for storing the result of data cleansing.
In conjunction with second aspect, embodiments provide the third possible embodiment of second aspect, wherein, institute State cleaning algorithm and include at least one in the process of processing empty value, duplicate removal and sequence process.
The embodiment of the present invention brings following beneficial effect: use big Data Cleaning Method that the embodiment of the present invention provides and Clean device, first cleaning process is carried out configuration definition, then cleaning process resolves and is converted to the atom behaviour of Spark Make.After cleaning task is committed to big data analysis framework Spark cluster, Spark cluster carry out data cleansing, because each Each step in cleaning process has been converted into the atomic operation of Spark, so each carried out in Spark cluster cleans Step all can perform with distributed parallel such that it is able to significantly improves the cleaning speed of data cleansing, it is achieved at high speed with efficient The data cleansing of rate, is more applicable for current big data environment.
Other features and advantages of the present invention will illustrate in the following description, and, partly become from description Obtain it is clear that or understand by implementing the present invention.The purpose of the present invention and other advantages are at description, claims And specifically noted structure realizes and obtains in accompanying drawing.
For making the above-mentioned purpose of the present invention, feature and advantage to become apparent, preferred embodiment cited below particularly, and coordinate Appended accompanying drawing, is described in detail below.
Accompanying drawing explanation
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, below by embodiment required use attached Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, and it is right to be therefore not construed as The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to according to this A little accompanying drawings obtain other relevant accompanying drawings.
Fig. 1 shows the flow chart of a kind of big Data Cleaning Method that the embodiment of the present invention one is provided;
Fig. 2 shows the schematic diagram of a kind of big data cleansing device that the embodiment of the present invention two is provided.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention Middle accompanying drawing, is clearly and completely described the technical scheme in the embodiment of the present invention, it is clear that described embodiment is only It is a part of embodiment of the present invention rather than whole embodiments.Therefore, enforcement to the present invention provided in the accompanying drawings below The detailed description of example is not intended to limit the scope of claimed invention, but is merely representative of the selected enforcement of the present invention Example.Based on embodiments of the invention, it is all that those skilled in the art are obtained on the premise of not making creative work Other embodiments, broadly fall into the scope of protection of the invention.
In current data cleansing technology, cleaning procedure major part is stand-alone program, and cleaning speed and cleaning efficiency are relatively low, It is difficult to meet the demand of data cleansing under current big data environment.
Based on this, a kind of Data Cleaning Method and the device greatly that the embodiment of the present invention provides, it is possible to significantly improve data clear The cleaning speed washed, it is achieved at high speed with high efficiency data cleansing, be more applicable for current big data environment.
Embodiment one:
As it is shown in figure 1, the embodiment of the present invention provides a kind of big Data Cleaning Method, mainly comprise the steps that
S1: cleaning process is carried out configuration definition.
Concrete, start data cleansing engine, first load cleaning process configuration file, based on JSON (Java Script Object Notation) form, cleaning process is carried out configuration definition, configuration item example is as follows:
JSON is the data interchange format of a kind of lightweight, is a subset based on ECMAScript.JSON has used Entirely independent of the text formatting of language, JSON is made to become preferable data interchange language, it is easy to people reads and writes, the easiest Resolve in machine and generate, cross-platform data transmission has the biggest advantage.
S2: resolve cleaning process, is converted to the atomic operation of Spark by cleaning process.
Cleaning process is resolved by big data cleansing engine according to the definition information of configuration file, is changed by cleaning step Atomic operation for Spark.
Spark is the most popular general parallel computation frame after Hadoop of field of cloud calculation, is that one can The company-data analysis platform calculating (In-Memory Computing) based on internal memory of flexible (scalable), compares Hadoop Cluster storage method more has performance advantage.Spark distributed data based on internal memory collection, optimize iterative live load with And interactive inquiry, thus improve speed and the efficiency that big data calculate.
S3: cleaning task is committed to Spark cluster.
First initialize Spark cluster, load Spark context environmental, submit to for cleaning operation and prepare.Then basis The order of cleaning process definition, is committed to Spark cluster by concrete data cleansing operation.
S4: carried out data cleansing by Spark cluster.
S41: from data source loading data.
Data source can be different types of Data Source, and the data source in the present embodiment is data base or distributed document System (Hadoop Distributed File System is called for short HDFS).
In other embodiments, it is also possible to be extended according to concrete business, growth data Source Type has only to phase The increase data answered load atom and process, and the loading procedure of data source is also distributed variable-frequencypump.
S42: utilize the cleaning algorithm of distributed parallel, data are carried out.
The present embodiment illustrates three kinds and cleans algorithm: processing empty value, duplicate removal process and sequence processes.
As a preferred version, the stream compression between multiple cleaning algorithms passes through elasticity distribution formula data set (Resilient Distributed Datasets is called for short RDD) transmission.Because Spark cluster is set up at unified abstract RDD On, so that Spark cluster can tackle different big data in an essentially uniform manner processes scene, including MapReduce, Streaming, SQL, Machine Learning, Graph etc..RDD is fault-tolerant, a parallel data knot Structure, can allow user explicitly store data in disk and internal memory, and can control the subregion of data.Meanwhile, RDD also provides for One group of abundant operation operates these data, such as map, flatMap, filter, join, groupBy, ReduceByKey etc..
The present embodiment illustrates three kinds and cleans algorithm: processing empty value, duplicate removal process and sequence processes.Other embodiment party In formula, clean algorithm and be not limited to three of the above, can be extended according to practical business demand.If newly-increased algorithm newly-increased Individual method, carries out specifying just can applying in configuration definition simultaneously.
S43: the result of data cleansing is stored.
Start data storage engines, namely start the program that the result data cleaned is stored.Data storage is drawn Hold up to define according to the result in configuration definition and select mode to be stored, the present embodiment can be deposited by data base or HDFS Storage result.In other embodiments, other storage modes can be extended.
In the big Data Cleaning Method that the embodiment of the present invention provides, first cleaning process is carried out configuration definition, then to cleaning Flow process carries out resolving and being converted to the atomic operation of Spark.Cleaning task is committed to big data analysis framework Spark cluster After, the Spark cluster increased income carry out data cleansing, finally the result of data cleansing is stored.Stream is cleaned because each Each step in journey has been converted into the atomic operation of Spark, so each cleaning step carried out in Spark cluster is equal Can perform with distributed parallel such that it is able to significantly improve the cleaning speed of data cleansing, it is achieved at high speed with high efficiency number According to cleaning, it is more applicable for current big data environment.
Additionally, Spark cluster can well support extension, mode based on configuration is carried out flow definition and can drop The coupling of low program, increases or deletes the corresponding algorithm that cleans and can realize under minimum change.
Embodiment two:
As in figure 2 it is shown, the embodiment of the present invention provides a kind of big data cleansing device, including big data cleansing engine 1 He Spark cluster 2.
Wherein, big data cleansing engine 1 is for carrying out configuration definition to cleaning process, and resolves cleaning process, Cleaning process is converted to the atomic operation of Spark, and cleaning task is committed to Spark cluster;Spark cluster 2 is used for Carry out data cleansing.
Concrete, after big data cleansing engine 1 starts, first load cleaning process configuration file, based on JSON form, right Cleaning process carries out configuration definition.Cleaning process is carried out by the biggest data cleansing engine 1 according to the definition information of configuration file Resolve, cleaning step is converted to the atomic operation of Spark.
After Spark cluster 2 initializes, the order that big data cleansing engine 1 defines according to cleaning process, by concrete number It is committed to Spark cluster according to cleaning operation.
After Spark cluster 2 receives data cleansing operation, first from data source 4 loading data.Data source 4 can be different The Data Source of type, the data source 4 in the present embodiment includes data base or HDFS.In other embodiments, it is also possible to root Being extended according to concrete business, growth data Source Type has only to accordingly increase data loading atom and processes, data The loading procedure in source is also distributed variable-frequencypump.
Then Spark cluster 2 utilizes the cleaning algorithm of distributed parallel, is carried out data.Clear in the present embodiment Wash algorithm and include that processing empty value, duplicate removal process and sequence processes.In other embodiments, clean algorithm and be not limited to above three Kind, can be extended according to practical business demand.As long as the newly-increased method of newly-increased algorithm, simultaneously in configuration definition Carry out specifying and just can apply.
As a preferred version, the stream compression between multiple cleaning algorithms is transmitted by RDD.Because Spark cluster Set up on unified abstract RDD, so that Spark cluster can tackle different big data in an essentially uniform manner Process scene, including MapReduce, Streaming, SQL, Machine Learning, Graph etc..RDD be one fault-tolerant , parallel data structure, user can be allowed explicitly to store data in disk and internal memory, and can control data point District.Meanwhile, RDD additionally provides one group of abundant operation to operate these data, such as map, flatMap, filter, join, GroupBy, reduceByKey etc..
Finally, Spark cluster 2 starts data storage engines, stores the result of data cleansing.The embodiment of the present invention The big data cleansing device provided also includes storing assembly 3, for storing the result of data cleansing.
Data storage engines defines according to the result in configuration definition and selects mode to be stored, and can lead in the present embodiment Cross data base or HDFS stores result.In other embodiments, other storage modes can be extended.
In the big data cleansing device that the embodiment of the present invention provides, big data cleansing engine 1 cleaning process is joined Put definition, then cleaning process resolves and is converted to the atomic operation of Spark.Cleaning task is committed to big data analysis After framework Spark cluster 2, the Spark cluster 2 increased income carry out data cleansing, finally the result of data cleansing is stored to depositing Storage assembly 3.Because each step in each cleaning process has been converted into the atomic operation of Spark, so at Spark cluster Each cleaning step carried out in 2 all can perform with distributed parallel such that it is able to significantly improves the cleaning speed of data cleansing, Realize high speed and high efficiency data cleansing, be more applicable for current big data environment.
Additionally, Spark cluster 2 can well support extension, it is permissible that mode based on configuration is carried out flow definition The coupling of reduction program, increases or deletes the corresponding algorithm that cleans and can realize under minimum change.
If described function is using the form realization of SFU software functional unit and as independent production marketing or use, permissible It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is the most in other words The part contributing prior art or the part of this technical scheme can embody with the form of software product, this meter Calculation machine software product is stored in a storage medium, including some instructions with so that a computer equipment (can be individual People's computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention. And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random access memory are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic disc or CD.
The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is not limited thereto, and any Those familiar with the art, in the technical scope that the invention discloses, can readily occur in change or replace, should contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with scope of the claims.

Claims (10)

1. a big Data Cleaning Method, it is characterised in that including:
Cleaning process is carried out configuration definition;
Cleaning process is resolved, cleaning process is converted to the atomic operation of Spark;
Cleaning task is committed to Spark cluster;
Data cleansing is carried out by Spark cluster.
Method the most according to claim 1, it is characterised in that described carried out data cleansing by Spark cluster, specifically wraps Include:
From data source loading data;
Utilize the cleaning algorithm of distributed parallel, data are carried out;
The result of data cleansing is stored.
Method the most according to claim 2, it is characterised in that described cleaning algorithm include processing empty value, duplicate removal process and At least one in sequence process.
Method the most according to claim 3, it is characterised in that the stream compression between multiple cleaning algorithms is by elasticity point Cloth data set transmits.
Method the most according to claim 2, it is characterised in that described data source is data base or distributed file system.
Method the most according to claim 1, it is characterised in that described cleaning process is carried out configuration definition, particularly as follows:
Based on JSON form, cleaning process is carried out configuration definition.
7. a big data cleansing device, it is characterised in that including:
Big data cleansing engine, for carrying out configuration definition to cleaning process;Cleaning process is resolved, cleaning process is turned It is changed to the atomic operation of Spark;Cleaning task is committed to Spark cluster;
Spark cluster, is used for carrying out data cleansing.
Device the most according to claim 7, it is characterised in that described Spark cluster specifically for:
From data source loading data;
Utilize the cleaning algorithm of distributed parallel, data are carried out;
The result of data cleansing is stored.
Device the most according to claim 8, it is characterised in that also include storing assembly, for storing the knot of data cleansing Really.
Device the most according to claim 8, it is characterised in that described cleaning algorithm include processing empty value, duplicate removal process and At least one in sequence process.
CN201610652750.6A 2016-08-10 2016-08-10 Big data cleaning method and device Pending CN106294745A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610652750.6A CN106294745A (en) 2016-08-10 2016-08-10 Big data cleaning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610652750.6A CN106294745A (en) 2016-08-10 2016-08-10 Big data cleaning method and device

Publications (1)

Publication Number Publication Date
CN106294745A true CN106294745A (en) 2017-01-04

Family

ID=57667952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610652750.6A Pending CN106294745A (en) 2016-08-10 2016-08-10 Big data cleaning method and device

Country Status (1)

Country Link
CN (1) CN106294745A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688592A (en) * 2017-04-06 2018-02-13 平安科技(深圳)有限公司 The method and terminal of data cleansing
CN107832451A (en) * 2017-11-23 2018-03-23 安徽科创智慧知识产权服务有限公司 A kind of big data cleaning way of simplification
CN109684082A (en) * 2018-12-11 2019-04-26 中科恒运股份有限公司 The data cleaning method and system of rule-based algorithm
CN109753496A (en) * 2018-11-27 2019-05-14 天聚地合(苏州)数据股份有限公司 A kind of data cleaning method for big data
CN110019152A (en) * 2017-07-27 2019-07-16 润泽科技发展有限公司 A kind of big data cleaning method
CN110427356A (en) * 2018-04-26 2019-11-08 中移(苏州)软件技术有限公司 One parameter configuration method and equipment
CN110502509A (en) * 2019-08-27 2019-11-26 广东工业大学 A kind of traffic big data cleaning method and relevant apparatus based on Hadoop Yu Spark frame
WO2020211299A1 (en) * 2019-04-17 2020-10-22 苏宁云计算有限公司 Data cleansing method
CN113377829A (en) * 2021-05-14 2021-09-10 中国民生银行股份有限公司 Big data statistical method and device
WO2021174791A1 (en) * 2020-03-05 2021-09-10 百度在线网络技术(北京)有限公司 Task migration method and apparatus, and electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103177094A (en) * 2013-03-14 2013-06-26 成都康赛电子科大信息技术有限责任公司 Cleaning method of data of internet of things
CN103593352A (en) * 2012-08-15 2014-02-19 阿里巴巴集团控股有限公司 Method and device for cleaning mass data
CN104680328A (en) * 2015-03-16 2015-06-03 朗新科技股份有限公司 Power grid construction quality monitoring method based on client perception values
CN105138650A (en) * 2015-08-28 2015-12-09 成都康赛信息技术有限公司 Hadoop data cleaning method and system based on outlier mining

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593352A (en) * 2012-08-15 2014-02-19 阿里巴巴集团控股有限公司 Method and device for cleaning mass data
CN103177094A (en) * 2013-03-14 2013-06-26 成都康赛电子科大信息技术有限责任公司 Cleaning method of data of internet of things
CN104680328A (en) * 2015-03-16 2015-06-03 朗新科技股份有限公司 Power grid construction quality monitoring method based on client perception values
CN105138650A (en) * 2015-08-28 2015-12-09 成都康赛信息技术有限公司 Hadoop data cleaning method and system based on outlier mining

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688592B (en) * 2017-04-06 2020-03-17 平安科技(深圳)有限公司 Data cleaning method and terminal
WO2018184418A1 (en) * 2017-04-06 2018-10-11 平安科技(深圳)有限公司 Data cleaning method, terminal and computer readable storage medium
CN107688592A (en) * 2017-04-06 2018-02-13 平安科技(深圳)有限公司 The method and terminal of data cleansing
CN110019152A (en) * 2017-07-27 2019-07-16 润泽科技发展有限公司 A kind of big data cleaning method
CN107832451A (en) * 2017-11-23 2018-03-23 安徽科创智慧知识产权服务有限公司 A kind of big data cleaning way of simplification
CN110427356A (en) * 2018-04-26 2019-11-08 中移(苏州)软件技术有限公司 One parameter configuration method and equipment
CN110427356B (en) * 2018-04-26 2021-08-13 中移(苏州)软件技术有限公司 Parameter configuration method and equipment
CN109753496A (en) * 2018-11-27 2019-05-14 天聚地合(苏州)数据股份有限公司 A kind of data cleaning method for big data
CN109684082A (en) * 2018-12-11 2019-04-26 中科恒运股份有限公司 The data cleaning method and system of rule-based algorithm
WO2020211299A1 (en) * 2019-04-17 2020-10-22 苏宁云计算有限公司 Data cleansing method
CN110502509A (en) * 2019-08-27 2019-11-26 广东工业大学 A kind of traffic big data cleaning method and relevant apparatus based on Hadoop Yu Spark frame
CN110502509B (en) * 2019-08-27 2023-04-18 广东工业大学 Traffic big data cleaning method based on Hadoop and Spark framework and related device
WO2021174791A1 (en) * 2020-03-05 2021-09-10 百度在线网络技术(北京)有限公司 Task migration method and apparatus, and electronic device and storage medium
US11822957B2 (en) 2020-03-05 2023-11-21 Baidu Online Network Technology (Beijing) Co., Ltd. Task migration method, apparatus, electronic device and storage medium
CN113377829A (en) * 2021-05-14 2021-09-10 中国民生银行股份有限公司 Big data statistical method and device

Similar Documents

Publication Publication Date Title
CN106294745A (en) Big data cleaning method and device
JP6542785B2 (en) Implementation of semi-structured data as first class database element
US10318882B2 (en) Optimized training of linear machine learning models
US11226963B2 (en) Method and system for executing queries on indexed views
CN106649670A (en) Streaming computing-based data monitoring method and apparatus
CN105740424A (en) Spark platform based high efficiency text classification method
CN103440288A (en) Big data storage method and device
CN102880709A (en) Data warehouse management system and data warehouse management method
CN103514205A (en) Mass data processing method and system
CN109033109A (en) Data processing method and system
Gu et al. Chronos: An elastic parallel framework for stream benchmark generation and simulation
CN109684319A (en) Data clean system, method, apparatus and storage medium
Singh et al. Spatial data analysis with ArcGIS and MapReduce
Mehmood et al. Distributed real-time ETL architecture for unstructured big data
CN106570173A (en) High-dimensional sparse text data clustering method based on Spark
Mondal et al. Casqd: continuous detection of activity-based subgraph pattern queries on dynamic graphs
CN110018997B (en) Mass small file storage optimization method based on HDFS
Papadakis et al. Blocking for large-scale entity resolution: Challenges, algorithms, and practical examples
US20220335270A1 (en) Knowledge graph compression
CN108287889B (en) A kind of multi-source heterogeneous date storage method and system based on elastic table model
CN110019152A (en) A kind of big data cleaning method
CN103699627B (en) A kind of super large file in parallel data block localization method based on Hadoop clusters
CN108319604A (en) The associated optimization method of size table in a kind of hive
CN106599244B (en) General original log cleaning device and method
Sinthong et al. AFrame: Extending DataFrames for large-scale modern data analysis (Extended Version)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170104

RJ01 Rejection of invention patent application after publication