CN106873945A

CN106873945A - Data processing architecture and data processing method based on batch processing and Stream Processing

Info

Publication number: CN106873945A
Application number: CN201611245710.6A
Authority: CN
Inventors: 吴贺俊; 冯辉
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2017-06-20

Abstract

The present invention relates to a kind of data processing architecture based on batch processing and Stream Processing, including：Data acquisition module, obtains the real time data of collection from multiple data collection stations, and the data transfer that will be gathered is to batch processing module and Stream Processing module；Batch processing module, the real time data to receiving carries out persistence treatment, and batch processing is carried out to the real time data processed through persistence using the mechanism of re-computation, and the result according to treatment generates varigrained batch view；Stream Processing module, carries out Stream Processing, and generate varigrained Stream Processing view according to the result for the treatment of for the real time data to receiving using the mechanism of incremental computations；Data combiners block, is merged using corresponding consolidation strategy to batch view, Stream Processing view；Data visualization module, is shown to the batch view after batch view, Stream Processing view or merging, Stream Processing view；Monitoring resource module, for carrying out monitoring resource with upper module.

Description

Data processing architecture and data processing method based on batch processing and Stream Processing

Technical field

The present invention relates to technical field of data processing, more particularly, to a kind of number based on batch processing and Stream Processing According to processing framework and data processing method.

Background technology

With widely using for the equipment such as the popularization of internet, the fast development of Internet of Things and smart mobile phone so that people Can whenever and wherever possible produce data, cause the explosive growth of data.For large-scale data, it has been proposed that distributed Batch processing model and Stream Processing model.

Wherein, the height of the extensive historical data of batch processing model realization is handled up, magnanimity analysis and is excavated, after it is first stored Calculate, it is often not high suitable for requirement of real-time, while the accuracy of data and comprehensive even more important scene, batch processing mould Type is widely used in the fields such as off-line analysis, offline machine learning.And Stream Processing model is more concerned with streaming data Real-time analysis, data reach in a streaming manner, carry bulk information, and the stream data of only fraction is stored in limited Internal memory in.Stream Processing model is widely used in the field of the low delays such as online recommendation, on-line analysis, online machine learning Jing Zhong.

However, the data processing mode of batch processing model and Stream Processing model is single, usage scenario is limited, they are all For the solution that single problem and scene are proposed, versatility is not had between the two.Batch processing model can be processed More comprehensive data and then more accurately result is obtained, but time delay is than larger.Carry out Stream Processing model energy low delay Calculate, but only cached in internal memory causes computational accuracy than relatively low than relatively limited data.And with the development of science and technology modern enterprise Industry has increasing need for a kind of method of low delay and processes historical data and real time data simultaneously.Both can guarantee that to whole data set Overall treatment, can guarantee that the efficiency for the treatment of again.

The content of the invention

The present invention is the problem of solution above technology, there is provided a kind of data processing frame based on batch processing and Stream Processing Structure, the framework possesses the ability of batch processing and Stream Processing, thus can while ensureing to carry out overall treatment to data set Take into account the efficiency for the treatment of.

To realize above goal of the invention, the technical scheme of use is：

A kind of data processing architecture based on batch processing and Stream Processing, including at data acquisition module, batch processing module, streaming Reason module, data combiners block, data visualization module and monitoring resource module；

Wherein data acquisition module is used to be obtained from multiple data collection stations the real time data of collection, and the data that will be gathered Transmit to batch processing module and Stream Processing module；

The batch processing module is used to carry out persistence treatment to the real time data for receiving, and is then meeting execution batch processing condition In the case of, batch processing is carried out to the real time data processed through persistence using the mechanism of re-computation, and according to the knot for the treatment of Fruit generates varigrained batch view；

The Stream Processing module is used to carry out Stream Processing using the mechanism of incremental computations to the real time data for receiving, and according to The result for the treatment of generates varigrained Stream Processing view；

The data combiners block is used for according to specific query demand, using corresponding consolidation strategy to batch view, streaming Treatment view is merged；

The data visualization module is used for batch view, the streaming after batch view, Stream Processing view or merging Treatment view is shown；

The monitoring resource module be used for data acquisition module, batch processing module, Stream Processing module, data combiners block, Data visualization module carries out monitoring resource.

Preferably, the data acquisition module includes Data Collection submodule and data cleansing submodule, and the data are received Collection submodule is used to receive the real time data for obtaining collection from multiple data collection stations, and the data cleansing submodule is used for The real time data for receiving is cleaned using corresponding filtering rule.

Preferably, the batch processing module includes data prediction submodule, data processing submodule and batch view Sub-module stored；

The data prediction submodule is used to use Data Integration, data converter technique, number to the real time data for receiving Persistence treatment is carried out according to stipulations technology；

The data processing submodule meet perform batch processing condition in the case of, using the mechanism of re-computation to through persistence The real time data for the treatment of carries out batch processing；

The batch view sub-module stored is used to be stored in the result that data processing submodule is obtained in Hbase, To generate varigrained batch view.

Preferably, the Stream Processing module includes data processing submodule, Stream Processing view sub-module stored, wherein The data processing submodule is used to carry out real time data Stream Processing using the mechanism of incremental computations, and the Stream Processing is regarded Figure sub-module stored is used to be stored in Hbase the data processed result that data processing submodule is produced, to generate different grains The Stream Processing view of degree.

Preferably, the data acquisition module is realized using Flume Log Collect Systems.

Preferably, the batch processing module is realized using Spark clusters.

Preferably, the Stream Processing module is realized using Storm clusters.

Meanwhile, present invention also offers a kind of data processing method based on above framework, its scheme specifically includes following Step：

S1. data acquisition module is used to be obtained from multiple data collection stations the real time data of collection, and the data that will be gathered Transmit to batch processing module and Stream Processing module；

S2. batch processing module carries out persistence treatment to the real time data for receiving, and is then meeting the feelings of execution batch processing condition Under condition, batch processing is carried out to the real time data processed through persistence using the mechanism of re-computation, and according to the result life for the treatment of Into varigrained batch view；

S3. Stream Processing module carries out Stream Processing to the real time data for receiving using the mechanism of incremental computations, and according to treatment Result generate varigrained Stream Processing view；

S4. data combiners block is regarded using corresponding consolidation strategy according to specific query demand to batch view, Stream Processing Figure is merged；

S5. data visualization module is to batch view, the Stream Processing after batch view, Stream Processing view or merging View is shown；

S6. monitoring resource module merges to data acquisition module, batch processing module, Stream Processing module, data in above flow Module, data visualization module carry out monitoring resource.

Compared with prior art, the beneficial effects of the invention are as follows：

The framework that the present invention is provided is arranged in pairs or groups and is used by by batch processing module, Stream Processing module, it is ensured that whole to calculate knot The precision of fruit, while taking into account data-handling efficiency.

Brief description of the drawings

The structure chart of the framework that Fig. 1 is provided for the present invention.

Fig. 2 is the schematic diagram of data collection module.

Fig. 3 performs figure for the calculating task of Spark clusters.

Fig. 4 is the flow chart of incremental computations in Stream Processing module.

Fig. 5, Fig. 6, Fig. 7 are batch processing module and the synchronous schematic diagram of Stream Processing module data.

Fig. 8 is the schematic flow sheet that data combiners block performs data processing.

Specific embodiment

Accompanying drawing being for illustration only property explanation, it is impossible to be interpreted as the limitation to this patent；

Below in conjunction with drawings and Examples, the present invention is further elaborated.

Embodiment 1

Batch processing and the data processing architecture of Stream Processing, as shown in figure 1, including data acquisition module 10, batch processing module 20, Stream Processing module 30, data combiners block 40, data visualization module 50 and monitoring resource module 60；

In specific implementation process, the specific embodiment of data acquisition module 10 can be：Using distributed, Gao Ke Real-time reception, such as Flume Log Collect Systems are carried out to multi-source data by the massive logs collection with High Availabitity and Transmission system. As shown in Fig. 2 being provided with three agencies, respectively Agent1, Agent2 and Master Agent in the framework.Flume daily records Acquisition system receives external data using two Source, and one is the Avro Source in Agent1, for monitoring an IP And port numbers, another is the Spooldir in Agent2, for monitoring a catalogue.Enter by the real time data for collecting After the preliminary data filtering of row, the Avro during the data received from two Source are issued Master Agent Source.The framework using replication strategy the data received in Avro Source and meanwhile be sent to File Channel and In Memory Channel, then data are eventually communicated in HDFS Sink and Kafka Sink, for batch processing and stream Formula treatment.

As shown in figure 3, batch processing module 20 is realized using Spark clusters, building Spark first during realization should With the running environment of program, then application program is submitted on Resource Scheduler, the resource needed for the application can be disposable It is ready to, now belongs to coarseness constructing environment.Then application program is converted into DAG figures, Spark turns RDD dependences Turn to different stage.Here dependence is divided into narrow dependence and dependence wide, and each subregion of father RDD can only be by one in narrow dependence Individual sub- RDD multidomain treat-ments, and the father RDD that relies on wide can give many sub- RDD subregions.Spark tried one's best by greedy algorithm make it is narrow Rely on and divide in a single stage, and the parallel processing multiple tasks in each stage.When performing DAG figures, it is first carried out disobeying Rely the stage in other stages, rerun the dependence stage completed stage, it is the same with the Optimization Mechanism in MapReduce, Spark can consider data locality and speculate execution mechanism.The result of batch processing module is stored in Hbase, to generate Varigrained batch view, the result of batch processing module, batch view are stored in Hbase primarily to propping up Hold random read-write.

In specific implementation process, Stream Processing module 30 is realized using Storm clusters, and its concrete function is sketched such as Under：

In Storm clusters, a real-time application is designed to a Topology, and Topology is submitted into cluster, Code is distributed by the main controlled node in cluster, working node execution is assigned the task to.One Topology include spout and Two kinds of roles of bolt, wherein spout sends message, is responsible in the form of tuples sending data flow;And bolt is then responsible for Transmitting data flow, the operation such as can complete to calculate, filter in bolt, and bolt itself can also at random send the data to other bolt.The wherein result of Stream Processing module 30 and the view of generation is all stored in Hbase, when being reached so as to new data Operation can be updated with low delay.

Meanwhile, in order to improve the treatment effeciency of data, Stream Processing module 30 can use the mechanism of incremental computations, specifically Process is summarized as follows：As shown in figure 4, when Stream Processing module has new data to reach, can first determine whether whether the data can shadows Ring to data with existing；If new data has influence on data with existing, data with existing is taken out from Hbase, and and new data Merge；If new data does not interfere with data with existing, do not process；The result of above-mentioned steps is counted as new According to, take corresponding algorithm to calculate new data, then the corresponding RUNTIME VIEW of generation new data regards in real time by what is generated Figure is updated in existing Stream Processing view.

In specific implementation process, in order to ensure that the data for flowing into batch processing module and Stream Processing module are only processed Once, it is necessary to consider the data synchronization problems between batch processing module and Stream Processing module, its process is as follows：

The data that batch processing module and Stream Processing module are collected simultaneously, batch processing module saves the data in HDFS On, Stream Processing module is saved the data in table, and table name current date and the data content for receiving are identified, by dynamic dimension Two tables are protected to solve the problems, such as data syn-chronization.As shown in figure 5, system brings into operation after a period of time, batch processing module and stream Formula processing module preserves identical data, but batch processing module does not arrive the time point of triggering re-computation, namely batch processing The data of module are not calculated.Now, it is assumed that the table of Stream Processing module is i_click.

As shown in fig. 6, after the time point for having arrived batch processing module re-computation, the re-computation of batch processing module is triggered, batch Processing module can again build a table according to the current time in system before re-computation, for preserving real time data.Table name is i+ 1_click.Assuming that the data received during re-computation are block1 and block2, then now in Stream Processing module altogether Two tables are deposited, one is i_click, and one is i+1_click.What i_click was preserved is the real time data for receiving for i-th day, i What is preserved in+1_click is the i+1 days new real time datas for receiving, and that is to say block1 and block2.

As shown in fig. 7, being the result after system carries out data syn-chronization.Batch processing module can delete table after re-computation is carried out I_click, now Stream Processing module there was only the data in i+1_click tables.Because now the data in i_click exist Calculated in batch processing module, so Stream Processing module no longer calculates this partial data, will otherwise cause the re-computation of data.

In specific implementation process, the specific embodiment of data combiners block 40 can be：For the specific of user Business demand, merges the result of calculation of batch processing module and Stream Processing module, so as to realize the inquiry on whole data set. Therefore its key point is how to merge the batch view that batch processing module calculates and the reality that Stream Processing module is calculated When view, then according to specific service logic, select corresponding consolidation strategy.If query function meets Monoid characteristics, Meet Percentage bound, directly can merge batch view and Stream Processing view result.If as shown in figure 8, will Inquiry first determines whether the span of input time section in the click volume of different time sections article, if it is completely in batch processing mould Block, then need to only inquire about from batch view and obtain corresponding result；If its completely in Stream Processing module, only need to from Inquiry obtains corresponding result in Stream Processing view；If it needs across in batch processing module and Stream Processing module Inquired about from batch view and Stream Processing view respectively, then merge Query Result, namely to identical items Purchase volume is simply added.If query function is unsatisfactory for Monoid characteristics, query function can be converted to multiple full The query function of sufficient Monoid characteristics carries out computing, for single each query function respectively from batch view and streaming Query Result in reason view, then carries out correlation computations and obtains final required result again.

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not right The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no need and unable to be exhaustive to all of implementation method.It is all this Any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims

1. a kind of data processing architecture based on batch processing and Stream Processing, it is characterised in that：Including data acquisition module, batch at Reason module, Stream Processing module, data combiners block, data visualization module and monitoring resource module；

2. the data processing architecture based on batch processing and Stream Processing according to claim 1, it is characterised in that：The number Include Data Collection submodule and data cleansing submodule according to acquisition module, the Data Collection submodule is used to receive from multiple The real time data of collection is obtained in data collection station, the data cleansing submodule is used for using the docking of corresponding filtering rule The real time data of receipts is cleaned.

3. the data processing architecture based on batch processing and Stream Processing according to claim 1, it is characterised in that：Described batch Processing module includes data prediction submodule, data processing submodule and batch view sub-module stored；

4. the data processing architecture based on batch processing and Stream Processing according to claim 1, it is characterised in that：The stream Formula processing module includes data processing submodule, Stream Processing view sub-module stored, wherein the data processing submodule is used In Stream Processing is carried out to real time data using the mechanism of incremental computations, the Stream Processing view sub-module stored is used for logarithm The data processed result produced according to treatment submodule is stored in Hbase, to generate varigrained Stream Processing view.

5. the data processing architecture based on batch processing and Stream Processing according to claim 2, it is characterised in that：The number Realized using Flume Log Collect Systems according to acquisition module.

6. the data processing architecture based on batch processing and Stream Processing according to claim 3, it is characterised in that：Described batch Processing module is realized using Spark clusters.

7. the data processing architecture based on batch processing and Stream Processing according to claim 4, it is characterised in that：The stream Formula processing module is realized using Storm clusters.

8. the data processing method of one kind framework according to above any one of claim 1 ~ 7, it is characterised in that：Including following Step：