CN106021285A - Method for incremental extraction and analysis of mass data based on Hadoop platform - Google Patents

Method for incremental extraction and analysis of mass data based on Hadoop platform Download PDF

Info

Publication number
CN106021285A
CN106021285A CN201610283542.3A CN201610283542A CN106021285A CN 106021285 A CN106021285 A CN 106021285A CN 201610283542 A CN201610283542 A CN 201610283542A CN 106021285 A CN106021285 A CN 106021285A
Authority
CN
China
Prior art keywords
data
analysis
file
extraction
hdfs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610283542.3A
Other languages
Chinese (zh)
Inventor
肖骏
骆金松
李磊
夏循国
许卫国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Bai Cheng Cheng Technology Co Ltd
Original Assignee
Wuhan Bai Cheng Cheng Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Bai Cheng Cheng Technology Co Ltd filed Critical Wuhan Bai Cheng Cheng Technology Co Ltd
Priority to CN201610283542.3A priority Critical patent/CN106021285A/en
Publication of CN106021285A publication Critical patent/CN106021285A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for incremental extraction and analysis of mass data based on a Hadoop platform. The method is characterized in that an HDFS distributed document system is used to store unstructured and semi-structured data; MapReduce is taken as a computation engine of big data; and Golden Gate is used to extract source data from a relationship-type database. The technical scheme disclosed by the invention is characterized in that based on data extraction, the data is extracted in the manner of Golden Gate increment. The method has the advantages that operations are efficient, rapid and reliable; defects in Sqoop data extraction are avoided; and the source database is only influenced slightly. During the extraction and analysis of the data with a large data size, MapReduce processing is carried out to the extracted data; based on characteristics of Hadoop distributed storage, parallel processing is carried out to the extracted data; and the reliable, complete and effective analysis data can be obtained. The data processed by the parallel processing is the data obtained after analysis and excavation, so that the data size can be greatly reduced, and a foundation can be laid for further analysis and processing.

Description

A kind of mass data increment extraction based on Hadoop platform and the method for analysis
Technical field
The present invention relates to extraction and analyze method and technology field, a kind of sea based on Hadoop platform Amount data increment extraction and the method for analysis.
Background technology
Along with the Internet+development, we have welcome the epoch of a mass data, have used Hadoop platform It is analyzed mass data being treated as a kind of trend.The extraction of existing mass data with analysis method is Using Sqoop to realize the extraction of data, (distributed file system is deposited to leave the data of extraction in HDFS Storage system), with MapReduce as computing engines, mass data is analyzed, processes and assembled, During calculating, according to the size of data volume, calculating needs to reinstate how many Map and Reduce and completes number According to analyzing and processing, finally final result is exported in HDFS.In existing method, due to Sqoop The limitation of assembly self causes for specific scene inapplicable, to TB level Volume data extraction effect Rate is low, and carrying out incremental data extraction is the structure needing to change source data table.The reason of its inefficiency Be the extraction of data be serial operation, do not given play to the advantage of Hadoop distributed structure/architecture, used Sqoop Extracted data, it is impossible to mass data is carried out pretreatment, the efficiency of impact analysis.
Summary of the invention
It is an object of the invention to provide extraction and analysis method, to solve proposition in above-mentioned background technology Problem.
For achieving the above object, the present invention provides following technical scheme: a kind of sea based on Hadoop platform Amount data increment extraction and the method for analysis, comprise the following steps:
The first step, utilizes Golden Gate to extract Trial File from source database;
Second step, resolves to Flat File by described Trial File, and identifies the Flat of amendment with U File, identifies the Flat File of increase with A, identifies the Flat File of deletion with D;
3rd step, is uploaded directly into HDFS by described Flat File;
4th step, accelerates the process to the data in described HDFS with MapReduce parallel computation engine;
5th step, the data after being processed by described MapReduce are written in HBase.
Preferably, in described HDFS, HDFS can divide data into multiple pieces (block) and deposit respectively Store up in different computers, and back up storage.
Preferably, in the 4th step, described MapReduce parallel computation engine can carry out integrity to data The process check, sort out, sorted, now system can be according to the size of data volume and the free time of computer Degree determines to enable Map and Reduce of quantity Matching to complete the process work of data.
Preferably, in the 5th step, the method that specifically used Bulkload batch imports writes data into institute Stating in HBase, described HBase, when writing data, first writes data into write-ahead log file, so After write data in buffer zone, just data therein are once write after waiting buffer zone to expire Enter in disk.
Compared with prior art, the invention has the beneficial effects as follows: compared with existing technical scheme, this Bright technical scheme, on data pick-up, uses the mode of Golden Gate increment to extract data, efficiently, Quickly, reliably, solve the drawback using Sqoop extracted data to exist, source database is affected very Little, Volume data extraction and analysis.
The data of extraction are carried out MapReduce process, utilizes the feature of Hadoop distributed storage, right The data of extraction carry out parallel process operation, obtain the most complete effective analytical data.For further Analyzing and processing lays the foundation.
Accompanying drawing explanation
Fig. 1 is the FB(flow block) of the present invention.
Detailed description of the invention
Below in conjunction with specific embodiment, the present invention is described further, but protection scope of the present invention is also It is not limited to this.
Embodiment 1:
When using Golden Gate to extract data, Trial File is resolved to Flat File, Flat File Can be uploaded directly in HDFS, when extracting data, use the increasing extracted primary data and change Amount data, wherein identify the data of amendment with U, and the data identifying increase with A identify with D The data deleted.
Uploading the data in HDFS system, HDFS can divide data into multiple pieces (block) respectively Store in different computers, and use backup storage, it is ensured that the safety of data is with complete.
Use MapReduce engine accelerate data process, mass data often have imperfect, be not inconsistent Close the data of specification, need before analysis data are included the integrity checking of data, classification, Sequence processes.Now system can enable many according to the idle degrees decision of the size of data volume and computer Few Map and Reduce completes the pretreatment work of data, finally remittance the long and is exported HDFS In system.
The method using Bulkload batch to import writes data in HBase, and HBase is at write number According to time, first write data into write-ahead log file, then write data in buffer zone, wait until Buffer zone has been expired afterwards just by data write-once therein to disk.Bulkload uses also Row loading data, more much higher than single client's loading data efficiency.
Compared with existing technical scheme, technical solution of the present invention, on data pick-up, uses Golden Gate The mode of increment extracts data, efficiently, quickly, reliably, solves employing Sqoop extracted data and exists Drawback, on source database impact the least, Volume data extraction with analyze.
The data of extraction are carried out MapReduce process, utilizes the feature of Hadoop distributed storage, right The parallel process that carries out of extraction operates, and obtains the most complete effective analytical data.For analyzing further Process lays the foundation.
The above, the only present invention preferably detailed description of the invention, but protection scope of the present invention is not Be confined to this, any those familiar with the art in the technical scope that the invention discloses, root According to technical scheme and inventive concept equivalent or change in addition thereof, all should contain in the present invention Protection domain within.

Claims (4)

1. a mass data increment extraction based on Hadoop platform and the method for analysis, it is characterised in that: bag Include following steps:
The first step, utilizes Golden Gate to extract Trial File from source database;
Second step, resolves to Flat File by described Trial File, and identifies the Flat of amendment with U File, identifies the Flat File of increase with A, identifies the Flat File of deletion with D;
3rd step, is uploaded directly into HDFS by described Flat File;
4th step, accelerates the process to the data in described HDFS with MapReduce parallel computation engine;
5th step, the data after being processed by described MapReduce are written in HBase.
A kind of mass data increment extraction based on Hadoop platform the most according to claim 1 with point Analysis method, it is characterised in that: in described HDFS, HDFS can divide data into multiple pieces (block) It is respectively stored in different computers, and backs up storage.
A kind of mass data increment extraction based on Hadoop platform the most according to claim 1 with point Analysis method, it is characterised in that: in the 4th step, data can be entered by described MapReduce parallel computation engine Row integrity checking, the process sorted out, sort, now system can be according to the size of data volume and calculating The idle degrees of machine determines to enable Map and Reduce of quantity Matching to complete the process work of data.
A kind of mass data increment extraction based on Hadoop platform the most according to claim 1 with point Analysis method, it is characterised in that: in the 5th step, the method that specifically used Bulkload batch imports is by data Being written in described HBase, described HBase, when writing data, first writes data into write-ahead log File, then writes data in buffer zone, just by number therein after waiting buffer zone to expire According in write-once to disk.
CN201610283542.3A 2016-04-29 2016-04-29 Method for incremental extraction and analysis of mass data based on Hadoop platform Pending CN106021285A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610283542.3A CN106021285A (en) 2016-04-29 2016-04-29 Method for incremental extraction and analysis of mass data based on Hadoop platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610283542.3A CN106021285A (en) 2016-04-29 2016-04-29 Method for incremental extraction and analysis of mass data based on Hadoop platform

Publications (1)

Publication Number Publication Date
CN106021285A true CN106021285A (en) 2016-10-12

Family

ID=57081423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610283542.3A Pending CN106021285A (en) 2016-04-29 2016-04-29 Method for incremental extraction and analysis of mass data based on Hadoop platform

Country Status (1)

Country Link
CN (1) CN106021285A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599184A (en) * 2016-12-13 2017-04-26 西北师范大学 Hadoop system optimization method
CN107871013A (en) * 2017-11-23 2018-04-03 安徽科创智慧知识产权服务有限公司 A kind of mass data efficient decimation method
CN107967319A (en) * 2017-11-23 2018-04-27 安徽科创智慧知识产权服务有限公司 A kind of mass data efficient decimation platform
CN109685375A (en) * 2018-12-26 2019-04-26 重庆誉存大数据科技有限公司 A kind of business risk regulation engine operation method based on semi-structured text data
CN109801319A (en) * 2019-01-03 2019-05-24 杭州电子科技大学 Method for registering is grouped based on the Hadoop classification figure accelerated parallel

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521246A (en) * 2011-11-11 2012-06-27 国网信息通信有限公司 Cloud data warehouse system
CN102546247A (en) * 2011-12-29 2012-07-04 华中科技大学 Massive data continuous analysis system suitable for stream processing
CN102663117A (en) * 2012-04-18 2012-09-12 中国人民大学 OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform
CN104331435A (en) * 2014-10-22 2015-02-04 国家电网公司 Low-influence high-efficiency mass data extraction method based on Hadoop big data platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521246A (en) * 2011-11-11 2012-06-27 国网信息通信有限公司 Cloud data warehouse system
CN102546247A (en) * 2011-12-29 2012-07-04 华中科技大学 Massive data continuous analysis system suitable for stream processing
CN102663117A (en) * 2012-04-18 2012-09-12 中国人民大学 OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform
CN104331435A (en) * 2014-10-22 2015-02-04 国家电网公司 Low-influence high-efficiency mass data extraction method based on Hadoop big data platform

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599184A (en) * 2016-12-13 2017-04-26 西北师范大学 Hadoop system optimization method
CN106599184B (en) * 2016-12-13 2020-03-27 西北师范大学 Hadoop system optimization method
CN107871013A (en) * 2017-11-23 2018-04-03 安徽科创智慧知识产权服务有限公司 A kind of mass data efficient decimation method
CN107967319A (en) * 2017-11-23 2018-04-27 安徽科创智慧知识产权服务有限公司 A kind of mass data efficient decimation platform
CN109685375A (en) * 2018-12-26 2019-04-26 重庆誉存大数据科技有限公司 A kind of business risk regulation engine operation method based on semi-structured text data
CN109685375B (en) * 2018-12-26 2020-10-30 重庆誉存大数据科技有限公司 Enterprise risk rule engine operation method based on semi-structured text data
CN109801319A (en) * 2019-01-03 2019-05-24 杭州电子科技大学 Method for registering is grouped based on the Hadoop classification figure accelerated parallel

Similar Documents

Publication Publication Date Title
CN106021285A (en) Method for incremental extraction and analysis of mass data based on Hadoop platform
CN110489445B (en) Rapid mass data query method based on polymorphic composition
CN104331435B (en) A kind of efficient mass data abstracting method of low influence based on Hadoop big data platforms
CN106815203B (en) Method and device for analyzing amount of money in referee document
CN106844507B (en) A kind of method and apparatus of data batch processing
CN102930210B (en) Rogue program behavior automated analysis, detection and classification system and method
CN105205105B (en) A kind of ETL process system and processing method based on storm
CN111597243B (en) Method and system for abstract data loading based on data warehouse
WO2019148713A1 (en) Sql statement processing method and apparatus, computer device, and storage medium
CN107992764B (en) Sensitive webpage identification and detection method and device
CN106055618B (en) Data processing method based on web crawler and structured storage
CN102012896B (en) Method and device for realizing bulk editing of file contents
CN112527948B (en) Sentence-level index-based real-time data deduplication method and system
CN105373607B (en) Method for compressing SQL access log of power business system
CN111881447B (en) Intelligent evidence obtaining method and system for malicious code fragments
CN104253863B (en) A kind of TCP flow recombination method based on Hadoop platform and distributed treatment programming model
CN105389471A (en) Method for reducing training set of machine learning
CN103761337A (en) Method and system for processing unstructured data
CN109117426A (en) Distributed networks database query method, apparatus, equipment and storage medium
US20190050298A1 (en) Method and apparatus for improving database recovery speed using log data analysis
US20160275134A1 (en) Nosql database data validation
CN113407495A (en) SIMHASH-based file similarity determination method and system
CN110399432A (en) A kind of classification method of table, device, computer equipment and storage medium
Tulkinbekov et al. CLeveldb: Coalesced leveldb for small data
CN111045920B (en) Workload-aware multi-branch software change-level defect prediction method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161012