CN106021285A - Method for incremental extraction and analysis of mass data based on Hadoop platform - Google Patents
Method for incremental extraction and analysis of mass data based on Hadoop platform Download PDFInfo
- Publication number
- CN106021285A CN106021285A CN201610283542.3A CN201610283542A CN106021285A CN 106021285 A CN106021285 A CN 106021285A CN 201610283542 A CN201610283542 A CN 201610283542A CN 106021285 A CN106021285 A CN 106021285A
- Authority
- CN
- China
- Prior art keywords
- data
- analysis
- file
- extraction
- hdfs
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method for incremental extraction and analysis of mass data based on a Hadoop platform. The method is characterized in that an HDFS distributed document system is used to store unstructured and semi-structured data; MapReduce is taken as a computation engine of big data; and Golden Gate is used to extract source data from a relationship-type database. The technical scheme disclosed by the invention is characterized in that based on data extraction, the data is extracted in the manner of Golden Gate increment. The method has the advantages that operations are efficient, rapid and reliable; defects in Sqoop data extraction are avoided; and the source database is only influenced slightly. During the extraction and analysis of the data with a large data size, MapReduce processing is carried out to the extracted data; based on characteristics of Hadoop distributed storage, parallel processing is carried out to the extracted data; and the reliable, complete and effective analysis data can be obtained. The data processed by the parallel processing is the data obtained after analysis and excavation, so that the data size can be greatly reduced, and a foundation can be laid for further analysis and processing.
Description
Technical field
The present invention relates to extraction and analyze method and technology field, a kind of sea based on Hadoop platform
Amount data increment extraction and the method for analysis.
Background technology
Along with the Internet+development, we have welcome the epoch of a mass data, have used Hadoop platform
It is analyzed mass data being treated as a kind of trend.The extraction of existing mass data with analysis method is
Using Sqoop to realize the extraction of data, (distributed file system is deposited to leave the data of extraction in HDFS
Storage system), with MapReduce as computing engines, mass data is analyzed, processes and assembled,
During calculating, according to the size of data volume, calculating needs to reinstate how many Map and Reduce and completes number
According to analyzing and processing, finally final result is exported in HDFS.In existing method, due to Sqoop
The limitation of assembly self causes for specific scene inapplicable, to TB level Volume data extraction effect
Rate is low, and carrying out incremental data extraction is the structure needing to change source data table.The reason of its inefficiency
Be the extraction of data be serial operation, do not given play to the advantage of Hadoop distributed structure/architecture, used Sqoop
Extracted data, it is impossible to mass data is carried out pretreatment, the efficiency of impact analysis.
Summary of the invention
It is an object of the invention to provide extraction and analysis method, to solve proposition in above-mentioned background technology
Problem.
For achieving the above object, the present invention provides following technical scheme: a kind of sea based on Hadoop platform
Amount data increment extraction and the method for analysis, comprise the following steps:
The first step, utilizes Golden Gate to extract Trial File from source database;
Second step, resolves to Flat File by described Trial File, and identifies the Flat of amendment with U
File, identifies the Flat File of increase with A, identifies the Flat File of deletion with D;
3rd step, is uploaded directly into HDFS by described Flat File;
4th step, accelerates the process to the data in described HDFS with MapReduce parallel computation engine;
5th step, the data after being processed by described MapReduce are written in HBase.
Preferably, in described HDFS, HDFS can divide data into multiple pieces (block) and deposit respectively
Store up in different computers, and back up storage.
Preferably, in the 4th step, described MapReduce parallel computation engine can carry out integrity to data
The process check, sort out, sorted, now system can be according to the size of data volume and the free time of computer
Degree determines to enable Map and Reduce of quantity Matching to complete the process work of data.
Preferably, in the 5th step, the method that specifically used Bulkload batch imports writes data into institute
Stating in HBase, described HBase, when writing data, first writes data into write-ahead log file, so
After write data in buffer zone, just data therein are once write after waiting buffer zone to expire
Enter in disk.
Compared with prior art, the invention has the beneficial effects as follows: compared with existing technical scheme, this
Bright technical scheme, on data pick-up, uses the mode of Golden Gate increment to extract data, efficiently,
Quickly, reliably, solve the drawback using Sqoop extracted data to exist, source database is affected very
Little, Volume data extraction and analysis.
The data of extraction are carried out MapReduce process, utilizes the feature of Hadoop distributed storage, right
The data of extraction carry out parallel process operation, obtain the most complete effective analytical data.For further
Analyzing and processing lays the foundation.
Accompanying drawing explanation
Fig. 1 is the FB(flow block) of the present invention.
Detailed description of the invention
Below in conjunction with specific embodiment, the present invention is described further, but protection scope of the present invention is also
It is not limited to this.
Embodiment 1:
When using Golden Gate to extract data, Trial File is resolved to Flat File, Flat File
Can be uploaded directly in HDFS, when extracting data, use the increasing extracted primary data and change
Amount data, wherein identify the data of amendment with U, and the data identifying increase with A identify with D
The data deleted.
Uploading the data in HDFS system, HDFS can divide data into multiple pieces (block) respectively
Store in different computers, and use backup storage, it is ensured that the safety of data is with complete.
Use MapReduce engine accelerate data process, mass data often have imperfect, be not inconsistent
Close the data of specification, need before analysis data are included the integrity checking of data, classification,
Sequence processes.Now system can enable many according to the idle degrees decision of the size of data volume and computer
Few Map and Reduce completes the pretreatment work of data, finally remittance the long and is exported HDFS
In system.
The method using Bulkload batch to import writes data in HBase, and HBase is at write number
According to time, first write data into write-ahead log file, then write data in buffer zone, wait until
Buffer zone has been expired afterwards just by data write-once therein to disk.Bulkload uses also
Row loading data, more much higher than single client's loading data efficiency.
Compared with existing technical scheme, technical solution of the present invention, on data pick-up, uses Golden Gate
The mode of increment extracts data, efficiently, quickly, reliably, solves employing Sqoop extracted data and exists
Drawback, on source database impact the least, Volume data extraction with analyze.
The data of extraction are carried out MapReduce process, utilizes the feature of Hadoop distributed storage, right
The parallel process that carries out of extraction operates, and obtains the most complete effective analytical data.For analyzing further
Process lays the foundation.
The above, the only present invention preferably detailed description of the invention, but protection scope of the present invention is not
Be confined to this, any those familiar with the art in the technical scope that the invention discloses, root
According to technical scheme and inventive concept equivalent or change in addition thereof, all should contain in the present invention
Protection domain within.
Claims (4)
1. a mass data increment extraction based on Hadoop platform and the method for analysis, it is characterised in that: bag
Include following steps:
The first step, utilizes Golden Gate to extract Trial File from source database;
Second step, resolves to Flat File by described Trial File, and identifies the Flat of amendment with U
File, identifies the Flat File of increase with A, identifies the Flat File of deletion with D;
3rd step, is uploaded directly into HDFS by described Flat File;
4th step, accelerates the process to the data in described HDFS with MapReduce parallel computation engine;
5th step, the data after being processed by described MapReduce are written in HBase.
A kind of mass data increment extraction based on Hadoop platform the most according to claim 1 with point
Analysis method, it is characterised in that: in described HDFS, HDFS can divide data into multiple pieces (block)
It is respectively stored in different computers, and backs up storage.
A kind of mass data increment extraction based on Hadoop platform the most according to claim 1 with point
Analysis method, it is characterised in that: in the 4th step, data can be entered by described MapReduce parallel computation engine
Row integrity checking, the process sorted out, sort, now system can be according to the size of data volume and calculating
The idle degrees of machine determines to enable Map and Reduce of quantity Matching to complete the process work of data.
A kind of mass data increment extraction based on Hadoop platform the most according to claim 1 with point
Analysis method, it is characterised in that: in the 5th step, the method that specifically used Bulkload batch imports is by data
Being written in described HBase, described HBase, when writing data, first writes data into write-ahead log
File, then writes data in buffer zone, just by number therein after waiting buffer zone to expire
According in write-once to disk.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610283542.3A CN106021285A (en) | 2016-04-29 | 2016-04-29 | Method for incremental extraction and analysis of mass data based on Hadoop platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610283542.3A CN106021285A (en) | 2016-04-29 | 2016-04-29 | Method for incremental extraction and analysis of mass data based on Hadoop platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106021285A true CN106021285A (en) | 2016-10-12 |
Family
ID=57081423
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610283542.3A Pending CN106021285A (en) | 2016-04-29 | 2016-04-29 | Method for incremental extraction and analysis of mass data based on Hadoop platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021285A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599184A (en) * | 2016-12-13 | 2017-04-26 | 西北师范大学 | Hadoop system optimization method |
CN107871013A (en) * | 2017-11-23 | 2018-04-03 | 安徽科创智慧知识产权服务有限公司 | A kind of mass data efficient decimation method |
CN107967319A (en) * | 2017-11-23 | 2018-04-27 | 安徽科创智慧知识产权服务有限公司 | A kind of mass data efficient decimation platform |
CN109685375A (en) * | 2018-12-26 | 2019-04-26 | 重庆誉存大数据科技有限公司 | A kind of business risk regulation engine operation method based on semi-structured text data |
CN109801319A (en) * | 2019-01-03 | 2019-05-24 | 杭州电子科技大学 | Method for registering is grouped based on the Hadoop classification figure accelerated parallel |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102521246A (en) * | 2011-11-11 | 2012-06-27 | 国网信息通信有限公司 | Cloud data warehouse system |
CN102546247A (en) * | 2011-12-29 | 2012-07-04 | 华中科技大学 | Massive data continuous analysis system suitable for stream processing |
CN102663117A (en) * | 2012-04-18 | 2012-09-12 | 中国人民大学 | OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform |
CN104331435A (en) * | 2014-10-22 | 2015-02-04 | 国家电网公司 | Low-influence high-efficiency mass data extraction method based on Hadoop big data platform |
-
2016
- 2016-04-29 CN CN201610283542.3A patent/CN106021285A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102521246A (en) * | 2011-11-11 | 2012-06-27 | 国网信息通信有限公司 | Cloud data warehouse system |
CN102546247A (en) * | 2011-12-29 | 2012-07-04 | 华中科技大学 | Massive data continuous analysis system suitable for stream processing |
CN102663117A (en) * | 2012-04-18 | 2012-09-12 | 中国人民大学 | OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform |
CN104331435A (en) * | 2014-10-22 | 2015-02-04 | 国家电网公司 | Low-influence high-efficiency mass data extraction method based on Hadoop big data platform |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599184A (en) * | 2016-12-13 | 2017-04-26 | 西北师范大学 | Hadoop system optimization method |
CN106599184B (en) * | 2016-12-13 | 2020-03-27 | 西北师范大学 | Hadoop system optimization method |
CN107871013A (en) * | 2017-11-23 | 2018-04-03 | 安徽科创智慧知识产权服务有限公司 | A kind of mass data efficient decimation method |
CN107967319A (en) * | 2017-11-23 | 2018-04-27 | 安徽科创智慧知识产权服务有限公司 | A kind of mass data efficient decimation platform |
CN109685375A (en) * | 2018-12-26 | 2019-04-26 | 重庆誉存大数据科技有限公司 | A kind of business risk regulation engine operation method based on semi-structured text data |
CN109685375B (en) * | 2018-12-26 | 2020-10-30 | 重庆誉存大数据科技有限公司 | Enterprise risk rule engine operation method based on semi-structured text data |
CN109801319A (en) * | 2019-01-03 | 2019-05-24 | 杭州电子科技大学 | Method for registering is grouped based on the Hadoop classification figure accelerated parallel |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106021285A (en) | Method for incremental extraction and analysis of mass data based on Hadoop platform | |
CN110489445B (en) | Rapid mass data query method based on polymorphic composition | |
CN104331435B (en) | A kind of efficient mass data abstracting method of low influence based on Hadoop big data platforms | |
CN106815203B (en) | Method and device for analyzing amount of money in referee document | |
CN106844507B (en) | A kind of method and apparatus of data batch processing | |
CN102930210B (en) | Rogue program behavior automated analysis, detection and classification system and method | |
CN105205105B (en) | A kind of ETL process system and processing method based on storm | |
CN111597243B (en) | Method and system for abstract data loading based on data warehouse | |
WO2019148713A1 (en) | Sql statement processing method and apparatus, computer device, and storage medium | |
CN107992764B (en) | Sensitive webpage identification and detection method and device | |
CN106055618B (en) | Data processing method based on web crawler and structured storage | |
CN102012896B (en) | Method and device for realizing bulk editing of file contents | |
CN112527948B (en) | Sentence-level index-based real-time data deduplication method and system | |
CN105373607B (en) | Method for compressing SQL access log of power business system | |
CN111881447B (en) | Intelligent evidence obtaining method and system for malicious code fragments | |
CN104253863B (en) | A kind of TCP flow recombination method based on Hadoop platform and distributed treatment programming model | |
CN105389471A (en) | Method for reducing training set of machine learning | |
CN103761337A (en) | Method and system for processing unstructured data | |
CN109117426A (en) | Distributed networks database query method, apparatus, equipment and storage medium | |
US20190050298A1 (en) | Method and apparatus for improving database recovery speed using log data analysis | |
US20160275134A1 (en) | Nosql database data validation | |
CN113407495A (en) | SIMHASH-based file similarity determination method and system | |
CN110399432A (en) | A kind of classification method of table, device, computer equipment and storage medium | |
Tulkinbekov et al. | CLeveldb: Coalesced leveldb for small data | |
CN111045920B (en) | Workload-aware multi-branch software change-level defect prediction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161012 |