CN103345514B

CN103345514B - Streaming data processing method under big data environment

Info

Publication number: CN103345514B
Application number: CN201310287554.XA
Authority: CN
Inventors: 东方; 罗军舟; 张毅; 王宇翔; 徐晓冬
Original assignee: Southeast University; Focus Technology Co Ltd
Current assignee: Southeast University; Focus Technology Co Ltd
Priority date: 2013-07-09
Filing date: 2013-07-09
Publication date: 2016-06-08
Anticipated expiration: 2033-07-09
Also published as: CN103345514A

Abstract

The present invention discloses the streaming data processing method under a kind of big data environment, relate generally to the improvement of MapReduce computation module, specifically comprise: the localized irredundant of data is deposited and treatment mechanism, allow each computing node only store and process corresponding interval in data; Dispatch Map and Reduce related linear program in pipelined fashion with speed up processing; The internal memory of intermediate result deposits mechanism, in order to ensure data localization and effective enforcement of streamline, it is provided that at a high speed, memory access patterns easily. By above three modules, ensure data to flow to reliability and the high efficiency of row relax, meets the demand of data processing in practical application under big data environment.

Description

Streaming data processing method under big data environment

Technical field

The present invention relates to cloud computing, related generally to big data processing, streaming data processing, specifically to the streaming data processing method under a kind of big data environment in practical application.

Background technology

In recent years, along with the high speed development of portal website, the social network application such as network, ecommerce, and the lasting growth of business and extension, produce and have accumulated a large amount of business datums, it is big that these data have data total amount, data structure variation, the features such as data rate of increase height are typical big data.

On the other hand, the constant accessing of user and these network applications of use, to obtain required service, define a series of real-time streaming data. In order to meet the real-time demand for services of user, network application not only needs a large amount of historical datas is carried out analyzing and processing, in addition it is also necessary to is made by real time streaming data further and processing fast. On big data basis, make an application scene for fast processing like this for these streaming data, it is a kind of typical big data, services.

Such big data, services contrasts general data, services, has and compares special character: first, and business datum is big data, and the new streaming data scale arrived is little, and structure is simple; Secondly, data stream continues to arrive, and business datum continues to increase, regular update; Finally, on big data, convection type data fast processing must be made, to provide efficient service.

Such as, in ecommerce real-time recommendation system, storing the information of a large amount of commodity in system, the recorded information such as the registration of user, search, collection, purchase, is referred to as historical data; Simultaneously along with the real-time access of a large number of users, create again the real-time service request data continuing arrival. In order to realize the real-time recommendation to user, not only needing to make analysis for historical data, in addition it is also necessary to carry out fast processing for real time streaming data, only both combine and could realize effective real-time recommendation. Consequently, it is desirable to study the real time streaming data process technology under big data environment, to provide the support to the service of this type of big market demand.

In order to realize the process to big data, Google proposes MapReduce batch processing method the earliest, and the data processing method major face of this kind of framework based on MapReduce processes to batch data, does not support the process of streaming data.

In streaming data processing, it is partial to specific applied environment. Such as, S4(SimpleScalableStreamingSystem) it is a distributive type data handling system inspired by MapReduce, it is mainly used in solving search, error detector, the real world applications such as internet dating. For avoiding the complicacy of system, S4 is only intended for streaming data processing.

At present, effective ways are lacked in the streaming data processing under big data environment.

Summary of the invention

Technical problem: as can be seen from above, lack effective big data environment downflow system data handling system at present, the present invention based on MapReduce to build efficient real time streaming data processing shelf, mainly comprise three modules: the localized irredundant of data is deposited and treatment mechanism, and the mode of streamline dispatches Map and Reduce related linear program. The internal memory of data/intermediate result deposits mechanism.

Technical scheme: the streaming data processing method under a kind of big data environment of the present invention comprises the following steps:

1): process accumulation big data and historical data generate intermediate result set, divide this result set and distributed buffer to each computing node;

2): each computing node periodically accepts whole streaming data, obtains intermediate result by Map process;

: filtered the intermediate result obtaining this node by intermediate result division methods, be cached on this ground node, 3) form a burst after reaching threshold value 10,000, send this burst;

: after intermediate result burst arrives, 4) dispatch algorithm according to streamline, historical data intermediate result is inputted as Reduce together with this intermediate result burst;

: export calculation result, 5) this calculation result is that a task different times part exports, and these all results is integrated in same file, forms final Output rusults.

The big data of the described accumulation in step 1) and historical data, all back up on a distributed, before system starts or starts calculation task, it is necessary to read this part data and do pre-treatment, and they are stored into each computing node in a distributed manner, for calculating below is ready.

Described step 2) in each computing node periodically accept whole streaming data, all this part data of computing node buffer memory, by process generate intermediate result; Simultaneously according to concrete historical data backup method, these part streaming data need regular update to history data set.

Intermediate result division methods in described step 3), the keyword value interval that each node is corresponding certain, each node only processes the data in this interval, namely for all accepting and generated the streaming data of intermediate result, filter out the intermediate result in this interval; Further, after reaching appointment threshold value, form an intermediate result burst, send this data fragmentation.

After described step 4) intermediate result burst arrives, integrating step 1) intermediate result grouping as the input data of next step Reduce task; Meanwhile, owing to a calculation task generally all can produce multiple burst in step 3), it is necessary to for each such Reduce task distributes a thread, calculate asynchronously.

Described step 5) exports calculation result, and this result carries out a Reduce to calculate the calculation result produced;After all Reduce tasks complete, merge these Reduce calculation result, export a final result file.

Useful effect: effective part of the present invention is:

By increasing three modules on existing hadoop platform, to support the streaming data quick-processing under big data environment, this kind of application of ecommerce real-time recommendation can be supported in,

This invention contrast prior art, has the following advantages:

1, the present invention is by integrating static big data processing technique and real-time streaming data processing technique, between two classes application, data processing method provides reference;

2, by probability model dividing data collection so that system loading is tending towards balanced, effectively increases the through-put rate of system, provides new thinking for optimizing the parallel processing of big data;

3, by streamline so that data batch treatment, the refinement granularity of data processing, accelerates computing velocity, meets the mission requirements of high response ratio;

4, by increasing memory management, for effective operation of above two modules provides support, and the adaptive faculty of system in the face of different scales data is enhanced.

Accompanying drawing explanation

Fig. 1 is processing flow chart,

Fig. 2 is system tray composition,

Fig. 3 is internal memory structure iron.

Embodiment

Streaming data processing under big data environment, achieves based on the framework of MapReduce by existing hadoop mono-kind, increases by three modules, to support the process of stream data in existing function. The real-time streaming data process being particularly suitable on the historical data basis of a large amount of complex structure, specifically comprises data localization module, streamline scheduling module and memory management module, and specific implementation method is as follows:

In data localization module, general interval by Hash function division keyword value, reach the irredundant store data of each node, in order to ensure that Data Placement is harmonious, the method based on probability-statistics is adopted to divide so that data are substantially obeyed and are uniformly distributed, the following step of main execution:

Step 1: random collecting part historical data or streaming data, as sample, with keyword rank, if keyword is character string, then sort with its encoded radio;

Step 2: the frequency that all keywords after sort method occur;

Step 3: according to computing node number N, obtains each node desired load factor, is generally the 1/N reciprocal of node number;

Step 4: according to keyword frequency, distributes keyword successively continuously to N number of set from Keyword List, makes each frequency gathered and closest to load factor as far as possible;

Step 5: each computing node receives the keyword in a set, forms its keyword interval (set).

Step 6: each node receives only or process the data in keyword interval.

Dispatch in module at streamline, accelerate computing velocity by asynchronous distribution intermediate data (map output) and calculation result (reduce output), it is necessary to control this asynchronous process according to system loading. Parameter when being run by Monitoring systems here, regulates distributing of distribution speed and calculation task. Mainly comprise the following steps:

Step 1: monitoring streaming data intermediate result burst number, namely streaming data every time through map process after intermediate result buffer memory number;

Step 2: monitoring has divided the intermediate result number sent out, namely the reduce stage also untreated, locate intermediate result number in the buffer;

Step 3: the map number of threads that statistical system has been distributed, reduce number of threads and platform are to the maximum Thread Count of system assignment;

After obtaining these parameters, system regulates the execution speed of each task by following step, when ensureing that resource is not overflowed, and maximumization the speed of performing task:

Step 1: when system assignment number of threads is limited, if data stream arrival speed is less, so that map task is relatively light, can perform by step 2, otherwise performs step 3;

Step 2: if the reduce stage consumes the production rate of speed faster than map of intermediate result, reduce map intermediate result buffer memory; Otherwise, increase reduce stage buffer memory, or suitably increase reduce Thread Count;

Step 3: increase as far as possible map Thread Count to ensure data do not lose, increase map buffer memory;

Step 4: return step 1.

Generally, when completing light weight operation, system loading is relatively light, and number of threads is more abundant, it is possible to cancel these buffer memorys and map and reduce task distributes restriction, makes system reach the process of quick response data.

In memory management module, by expanding storage space in conjunction with internal memory and external memory (mainly disk), ensureing the extensibility of intermediate data buffer memory, what ensure data searches reading simultaneously fast. This kind is heavier mainly for system task, the situation that data volume is bigger, mainly comprises the following steps:

Step 1: by keyword Hash index, sets up intermediate result index district, and resident internal memory, hash pointed information header structure;

Step 2: set up intermediate result buffer memory district in internal memory, can according to internal memory size configure size; Information head in step 1 searches data in this buffer memory district, and returns results;

Step 3: when intermediate result is relatively big, internal memory insufficient space, sets up storage zone at external memory. External memory district is provided with retrieval district and data field equally, but does not set up buffer memory district;

Step 4: when information head cannot find corresponding data in buffer memory district, Zhuan Qu external memory district searches.

In systems in practice, the intermediate result of generation, is generally key value pair. If spatial cache still has living space, directly stored in this intermediate result, and upgrade retrieval district table item, assignment information head simultaneously; Otherwise, data point to external memory stored in external memory, information head.

In general application, history data set is bigger, it may be considered that historical data stored in external memory, external memory switched area in setting up; For real-time stream, deposit data is in internal memory, and these concrete technology have ready-made algorithm and data structure.

Below in conjunction with the drawings and the specific embodiments, the present invention is further described in more detail.

As shown in Figure 1, the present invention simultaneously has obvious hierarchical structure to base conditioning flow process of the present invention, as shown in Figure 2. Main increasing newly has localization, streamline and memory management three modules, memory management is mainly other two module service, to ensure the reliability that its data store, by increasing this three modules under hadoop framework, make supporting on big data processing basis, thering is provided the support to data stream, specific implementation method is as follows:

In data localization module, the data of general process are<key, value>key value pair, by the byte code of key, its divide value is interval, specifically adopt probability statistics model: in historical data and partial data stream, randomly draw sample, analyzing key value to the frequency of occurrences of key, demarcation interval makes key value substantially be evenly distributed; After division, node N_iThe interval S of corresponding keys value_i, it is assumed that node number is N, and interval division algorithm is as follows:

After interval division, each node only processes<the key in corresponding interval, value>, like this, item<the key (x) of node k in stream of processing data, value (y)>time, it is possible to by judge key (x) whether in interval array [k] decision whether filter this data item.

Dispatch in module at streamline, first need to gather parameter information when relevant operation runs, then judge system health state according to these parameters, and then change operation running status by controlling these parameters.

The information gathered is needed to have:

V_in(split/s) streaming data average arrival rate,

V_mc(split/s) map thread average consumption speed; V_mp(piece/s) map thread average production rate,

V_rc(piece/s) reduce thread average consumption speed,

T_NThe Thread Count of operation distribution; T_MMap Thread Count; T_RReduce Thread Count.

First T is judged_NWhether being greater than Thread Count needed for general operation, if be greater than, illustrative system resource is more abundant, it is possible to produce after intermediate result forms a burst at map, directly issues the reduce stage and processes; The words of no person need the buffer memory according to circumstances controlling map or reduce Thread Count and intermediate result, and specific algorithm is as follows:

In memory management module, by again organizing depositing and access mode of intermediate result, ensure the high reliability that intermediate result is deposited and the transparency, as shown in Figure 3, first by hash tree table to the now addressing fast of middle fruitage, seek after information head, by information head direct access memory data or outer deposit data, information head and hash table are stored in internal memory, intermediate data part is stored in internal memory, part is stored in external memory, and the page (key value is to record) of interior external memory is exchanged and can be completed by LRU lru algorithm.

Claims

1. the streaming data processing method under big data environment, is characterised in that the method comprises the following steps:

2. the streaming data processing method under big data environment according to claim 1, it is characterized in that: described step 1) in the big data of accumulation and historical data, all back up on a distributed, before system starts or starts calculation task, need to read this part data and do pre-treatment, and they are stored into each computing node in a distributed manner, for calculating below is ready.

3. the streaming data processing method under big data environment according to claim 1, it is characterized in that: described step 2) in each computing node periodically accept whole streaming data, these whole streaming data of all computing node buffer memorys, generate intermediate result by process; Simultaneously according to concrete historical data backup method, these whole streaming data need regular update to history data set.

4. the streaming data processing method under big data environment according to claim 1, it is characterized in that: described step 3) in intermediate result division methods, the keyword value interval that each node is corresponding certain, each node only processes the data in this interval, namely for all accepting and generated the streaming data of intermediate result, filter out the intermediate result in this interval; Further, after reaching appointment threshold value, form an intermediate result burst, send this intermediate result burst.

5. the streaming data processing method under big data environment according to claim 1, it is characterised in that: described step 4) after intermediate result burst arrives, integrating step 1) intermediate result grouping as the input data of next step Reduce task;Meanwhile, owing to a calculation task is in step 3) in generally all can produce multiple burst, it is necessary to for each such Reduce task distributes a thread, calculate asynchronously.

6. the streaming data processing method under big data environment according to claim 1, it is characterised in that: described step 5) in output calculation result, this result carries out Reduce to calculate the calculation result produced; After all Reduce tasks complete, merge these Reduce calculation result, export a final result file.