CN106021285A

CN106021285A - Method for incremental extraction and analysis of mass data based on Hadoop platform

Info

Publication number: CN106021285A
Application number: CN201610283542.3A
Authority: CN
Inventors: 肖骏; 骆金松; 李磊; 夏循国; 许卫国
Original assignee: Wuhan Bai Cheng Cheng Technology Co Ltd
Current assignee: Wuhan Bai Cheng Cheng Technology Co Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2016-10-12

Abstract

The invention discloses a method for incremental extraction and analysis of mass data based on a Hadoop platform. The method is characterized in that an HDFS distributed document system is used to store unstructured and semi-structured data; MapReduce is taken as a computation engine of big data; and Golden Gate is used to extract source data from a relationship-type database. The technical scheme disclosed by the invention is characterized in that based on data extraction, the data is extracted in the manner of Golden Gate increment. The method has the advantages that operations are efficient, rapid and reliable; defects in Sqoop data extraction are avoided; and the source database is only influenced slightly. During the extraction and analysis of the data with a large data size, MapReduce processing is carried out to the extracted data; based on characteristics of Hadoop distributed storage, parallel processing is carried out to the extracted data; and the reliable, complete and effective analysis data can be obtained. The data processed by the parallel processing is the data obtained after analysis and excavation, so that the data size can be greatly reduced, and a foundation can be laid for further analysis and processing.

Description

A kind of mass data increment extraction based on Hadoop platform and the method for analysis

Technical field

The present invention relates to extraction and analyze method and technology field, a kind of sea based on Hadoop platform Amount data increment extraction and the method for analysis.

Background technology

Along with the Internet+development, we have welcome the epoch of a mass data, have used Hadoop platform It is analyzed mass data being treated as a kind of trend.The extraction of existing mass data with analysis method is Using Sqoop to realize the extraction of data, (distributed file system is deposited to leave the data of extraction in HDFS Storage system), with MapReduce as computing engines, mass data is analyzed, processes and assembled, During calculating, according to the size of data volume, calculating needs to reinstate how many Map and Reduce and completes number According to analyzing and processing, finally final result is exported in HDFS.In existing method, due to Sqoop The limitation of assembly self causes for specific scene inapplicable, to TB level Volume data extraction effect Rate is low, and carrying out incremental data extraction is the structure needing to change source data table.The reason of its inefficiency Be the extraction of data be serial operation, do not given play to the advantage of Hadoop distributed structure/architecture, used Sqoop Extracted data, it is impossible to mass data is carried out pretreatment, the efficiency of impact analysis.

Summary of the invention

It is an object of the invention to provide extraction and analysis method, to solve proposition in above-mentioned background technology Problem.

For achieving the above object, the present invention provides following technical scheme: a kind of sea based on Hadoop platform Amount data increment extraction and the method for analysis, comprise the following steps:

The first step, utilizes Golden Gate to extract Trial File from source database；

Second step, resolves to Flat File by described Trial File, and identifies the Flat of amendment with U File, identifies the Flat File of increase with A, identifies the Flat File of deletion with D；

3rd step, is uploaded directly into HDFS by described Flat File；

4th step, accelerates the process to the data in described HDFS with MapReduce parallel computation engine；

5th step, the data after being processed by described MapReduce are written in HBase.

Preferably, in described HDFS, HDFS can divide data into multiple pieces (block) and deposit respectively Store up in different computers, and back up storage.

Preferably, in the 4th step, described MapReduce parallel computation engine can carry out integrity to data The process check, sort out, sorted, now system can be according to the size of data volume and the free time of computer Degree determines to enable Map and Reduce of quantity Matching to complete the process work of data.

Preferably, in the 5th step, the method that specifically used Bulkload batch imports writes data into institute Stating in HBase, described HBase, when writing data, first writes data into write-ahead log file, so After write data in buffer zone, just data therein are once write after waiting buffer zone to expire Enter in disk.

Compared with prior art, the invention has the beneficial effects as follows: compared with existing technical scheme, this Bright technical scheme, on data pick-up, uses the mode of Golden Gate increment to extract data, efficiently, Quickly, reliably, solve the drawback using Sqoop extracted data to exist, source database is affected very Little, Volume data extraction and analysis.

The data of extraction are carried out MapReduce process, utilizes the feature of Hadoop distributed storage, right The data of extraction carry out parallel process operation, obtain the most complete effective analytical data.For further Analyzing and processing lays the foundation.

Accompanying drawing explanation

Fig. 1 is the FB(flow block) of the present invention.

Detailed description of the invention

Below in conjunction with specific embodiment, the present invention is described further, but protection scope of the present invention is also It is not limited to this.

Embodiment 1:

When using Golden Gate to extract data, Trial File is resolved to Flat File, Flat File Can be uploaded directly in HDFS, when extracting data, use the increasing extracted primary data and change Amount data, wherein identify the data of amendment with U, and the data identifying increase with A identify with D The data deleted.

Uploading the data in HDFS system, HDFS can divide data into multiple pieces (block) respectively Store in different computers, and use backup storage, it is ensured that the safety of data is with complete.

Use MapReduce engine accelerate data process, mass data often have imperfect, be not inconsistent Close the data of specification, need before analysis data are included the integrity checking of data, classification, Sequence processes.Now system can enable many according to the idle degrees decision of the size of data volume and computer Few Map and Reduce completes the pretreatment work of data, finally remittance the long and is exported HDFS In system.

The method using Bulkload batch to import writes data in HBase, and HBase is at write number According to time, first write data into write-ahead log file, then write data in buffer zone, wait until Buffer zone has been expired afterwards just by data write-once therein to disk.Bulkload uses also Row loading data, more much higher than single client's loading data efficiency.

Compared with existing technical scheme, technical solution of the present invention, on data pick-up, uses Golden Gate The mode of increment extracts data, efficiently, quickly, reliably, solves employing Sqoop extracted data and exists Drawback, on source database impact the least, Volume data extraction with analyze.

The data of extraction are carried out MapReduce process, utilizes the feature of Hadoop distributed storage, right The parallel process that carries out of extraction operates, and obtains the most complete effective analytical data.For analyzing further Process lays the foundation.

The above, the only present invention preferably detailed description of the invention, but protection scope of the present invention is not Be confined to this, any those familiar with the art in the technical scope that the invention discloses, root According to technical scheme and inventive concept equivalent or change in addition thereof, all should contain in the present invention Protection domain within.

Claims

1. a mass data increment extraction based on Hadoop platform and the method for analysis, it is characterised in that: bag Include following steps:

3rd step, is uploaded directly into HDFS by described Flat File；

A kind of mass data increment extraction based on Hadoop platform the most according to claim 1 with point Analysis method, it is characterised in that: in described HDFS, HDFS can divide data into multiple pieces (block) It is respectively stored in different computers, and backs up storage.

A kind of mass data increment extraction based on Hadoop platform the most according to claim 1 with point Analysis method, it is characterised in that: in the 4th step, data can be entered by described MapReduce parallel computation engine Row integrity checking, the process sorted out, sort, now system can be according to the size of data volume and calculating The idle degrees of machine determines to enable Map and Reduce of quantity Matching to complete the process work of data.

A kind of mass data increment extraction based on Hadoop platform the most according to claim 1 with point Analysis method, it is characterised in that: in the 5th step, the method that specifically used Bulkload batch imports is by data Being written in described HBase, described HBase, when writing data, first writes data into write-ahead log File, then writes data in buffer zone, just by number therein after waiting buffer zone to expire According in write-once to disk.