CN103345514B - Streaming data processing method under big data environment - Google Patents

Streaming data processing method under big data environment Download PDF

Info

Publication number
CN103345514B
CN103345514B CN201310287554.XA CN201310287554A CN103345514B CN 103345514 B CN103345514 B CN 103345514B CN 201310287554 A CN201310287554 A CN 201310287554A CN 103345514 B CN103345514 B CN 103345514B
Authority
CN
China
Prior art keywords
data
intermediate result
streaming data
result
burst
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310287554.XA
Other languages
Chinese (zh)
Other versions
CN103345514A (en
Inventor
东方
罗军舟
张毅
王宇翔
徐晓冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Focus Technology Co Ltd
Original Assignee
Southeast University
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University, Focus Technology Co Ltd filed Critical Southeast University
Priority to CN201310287554.XA priority Critical patent/CN103345514B/en
Publication of CN103345514A publication Critical patent/CN103345514A/en
Application granted granted Critical
Publication of CN103345514B publication Critical patent/CN103345514B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses the streaming data processing method under a kind of big data environment, relate generally to the improvement of MapReduce computation module, specifically comprise: the localized irredundant of data is deposited and treatment mechanism, allow each computing node only store and process corresponding interval in data; Dispatch Map and Reduce related linear program in pipelined fashion with speed up processing; The internal memory of intermediate result deposits mechanism, in order to ensure data localization and effective enforcement of streamline, it is provided that at a high speed, memory access patterns easily. By above three modules, ensure data to flow to reliability and the high efficiency of row relax, meets the demand of data processing in practical application under big data environment.

Description

Streaming data processing method under big data environment
Technical field
The present invention relates to cloud computing, related generally to big data processing, streaming data processing, specifically to the streaming data processing method under a kind of big data environment in practical application.
Background technology
In recent years, along with the high speed development of portal website, the social network application such as network, ecommerce, and the lasting growth of business and extension, produce and have accumulated a large amount of business datums, it is big that these data have data total amount, data structure variation, the features such as data rate of increase height are typical big data.
On the other hand, the constant accessing of user and these network applications of use, to obtain required service, define a series of real-time streaming data. In order to meet the real-time demand for services of user, network application not only needs a large amount of historical datas is carried out analyzing and processing, in addition it is also necessary to is made by real time streaming data further and processing fast. On big data basis, make an application scene for fast processing like this for these streaming data, it is a kind of typical big data, services.
Such big data, services contrasts general data, services, has and compares special character: first, and business datum is big data, and the new streaming data scale arrived is little, and structure is simple; Secondly, data stream continues to arrive, and business datum continues to increase, regular update; Finally, on big data, convection type data fast processing must be made, to provide efficient service.
Such as, in ecommerce real-time recommendation system, storing the information of a large amount of commodity in system, the recorded information such as the registration of user, search, collection, purchase, is referred to as historical data; Simultaneously along with the real-time access of a large number of users, create again the real-time service request data continuing arrival. In order to realize the real-time recommendation to user, not only needing to make analysis for historical data, in addition it is also necessary to carry out fast processing for real time streaming data, only both combine and could realize effective real-time recommendation. Consequently, it is desirable to study the real time streaming data process technology under big data environment, to provide the support to the service of this type of big market demand.
In order to realize the process to big data, Google proposes MapReduce batch processing method the earliest, and the data processing method major face of this kind of framework based on MapReduce processes to batch data, does not support the process of streaming data.
In streaming data processing, it is partial to specific applied environment. Such as, S4(SimpleScalableStreamingSystem) it is a distributive type data handling system inspired by MapReduce, it is mainly used in solving search, error detector, the real world applications such as internet dating. For avoiding the complicacy of system, S4 is only intended for streaming data processing.
At present, effective ways are lacked in the streaming data processing under big data environment.
Summary of the invention
Technical problem: as can be seen from above, lack effective big data environment downflow system data handling system at present, the present invention based on MapReduce to build efficient real time streaming data processing shelf, mainly comprise three modules: the localized irredundant of data is deposited and treatment mechanism, and the mode of streamline dispatches Map and Reduce related linear program. The internal memory of data/intermediate result deposits mechanism.
Technical scheme: the streaming data processing method under a kind of big data environment of the present invention comprises the following steps:
1): process accumulation big data and historical data generate intermediate result set, divide this result set and distributed buffer to each computing node;
2): each computing node periodically accepts whole streaming data, obtains intermediate result by Map process;
: filtered the intermediate result obtaining this node by intermediate result division methods, be cached on this ground node, 3) form a burst after reaching threshold value 10,000, send this burst;
: after intermediate result burst arrives, 4) dispatch algorithm according to streamline, historical data intermediate result is inputted as Reduce together with this intermediate result burst;
: export calculation result, 5) this calculation result is that a task different times part exports, and these all results is integrated in same file, forms final Output rusults.
The big data of the described accumulation in step 1) and historical data, all back up on a distributed, before system starts or starts calculation task, it is necessary to read this part data and do pre-treatment, and they are stored into each computing node in a distributed manner, for calculating below is ready.
Described step 2) in each computing node periodically accept whole streaming data, all this part data of computing node buffer memory, by process generate intermediate result; Simultaneously according to concrete historical data backup method, these part streaming data need regular update to history data set.
Intermediate result division methods in described step 3), the keyword value interval that each node is corresponding certain, each node only processes the data in this interval, namely for all accepting and generated the streaming data of intermediate result, filter out the intermediate result in this interval; Further, after reaching appointment threshold value, form an intermediate result burst, send this data fragmentation.
After described step 4) intermediate result burst arrives, integrating step 1) intermediate result grouping as the input data of next step Reduce task; Meanwhile, owing to a calculation task generally all can produce multiple burst in step 3), it is necessary to for each such Reduce task distributes a thread, calculate asynchronously.
Described step 5) exports calculation result, and this result carries out a Reduce to calculate the calculation result produced;After all Reduce tasks complete, merge these Reduce calculation result, export a final result file.
Useful effect: effective part of the present invention is:
By increasing three modules on existing hadoop platform, to support the streaming data quick-processing under big data environment, this kind of application of ecommerce real-time recommendation can be supported in,
This invention contrast prior art, has the following advantages:
1, the present invention is by integrating static big data processing technique and real-time streaming data processing technique, between two classes application, data processing method provides reference;
2, by probability model dividing data collection so that system loading is tending towards balanced, effectively increases the through-put rate of system, provides new thinking for optimizing the parallel processing of big data;
3, by streamline so that data batch treatment, the refinement granularity of data processing, accelerates computing velocity, meets the mission requirements of high response ratio;
4, by increasing memory management, for effective operation of above two modules provides support, and the adaptive faculty of system in the face of different scales data is enhanced.
Accompanying drawing explanation
Fig. 1 is processing flow chart,
Fig. 2 is system tray composition,
Fig. 3 is internal memory structure iron.
Embodiment
Streaming data processing under big data environment, achieves based on the framework of MapReduce by existing hadoop mono-kind, increases by three modules, to support the process of stream data in existing function. The real-time streaming data process being particularly suitable on the historical data basis of a large amount of complex structure, specifically comprises data localization module, streamline scheduling module and memory management module, and specific implementation method is as follows:
In data localization module, general interval by Hash function division keyword value, reach the irredundant store data of each node, in order to ensure that Data Placement is harmonious, the method based on probability-statistics is adopted to divide so that data are substantially obeyed and are uniformly distributed, the following step of main execution:
Step 1: random collecting part historical data or streaming data, as sample, with keyword rank, if keyword is character string, then sort with its encoded radio;
Step 2: the frequency that all keywords after sort method occur;
Step 3: according to computing node number N, obtains each node desired load factor, is generally the 1/N reciprocal of node number;
Step 4: according to keyword frequency, distributes keyword successively continuously to N number of set from Keyword List, makes each frequency gathered and closest to load factor as far as possible;
Step 5: each computing node receives the keyword in a set, forms its keyword interval (set).
Step 6: each node receives only or process the data in keyword interval.
Dispatch in module at streamline, accelerate computing velocity by asynchronous distribution intermediate data (map output) and calculation result (reduce output), it is necessary to control this asynchronous process according to system loading. Parameter when being run by Monitoring systems here, regulates distributing of distribution speed and calculation task. Mainly comprise the following steps:
Step 1: monitoring streaming data intermediate result burst number, namely streaming data every time through map process after intermediate result buffer memory number;
Step 2: monitoring has divided the intermediate result number sent out, namely the reduce stage also untreated, locate intermediate result number in the buffer;
Step 3: the map number of threads that statistical system has been distributed, reduce number of threads and platform are to the maximum Thread Count of system assignment;
After obtaining these parameters, system regulates the execution speed of each task by following step, when ensureing that resource is not overflowed, and maximumization the speed of performing task:
Step 1: when system assignment number of threads is limited, if data stream arrival speed is less, so that map task is relatively light, can perform by step 2, otherwise performs step 3;
Step 2: if the reduce stage consumes the production rate of speed faster than map of intermediate result, reduce map intermediate result buffer memory; Otherwise, increase reduce stage buffer memory, or suitably increase reduce Thread Count;
Step 3: increase as far as possible map Thread Count to ensure data do not lose, increase map buffer memory;
Step 4: return step 1.
Generally, when completing light weight operation, system loading is relatively light, and number of threads is more abundant, it is possible to cancel these buffer memorys and map and reduce task distributes restriction, makes system reach the process of quick response data.
In memory management module, by expanding storage space in conjunction with internal memory and external memory (mainly disk), ensureing the extensibility of intermediate data buffer memory, what ensure data searches reading simultaneously fast. This kind is heavier mainly for system task, the situation that data volume is bigger, mainly comprises the following steps:
Step 1: by keyword Hash index, sets up intermediate result index district, and resident internal memory, hash pointed information header structure;
Step 2: set up intermediate result buffer memory district in internal memory, can according to internal memory size configure size; Information head in step 1 searches data in this buffer memory district, and returns results;
Step 3: when intermediate result is relatively big, internal memory insufficient space, sets up storage zone at external memory. External memory district is provided with retrieval district and data field equally, but does not set up buffer memory district;
Step 4: when information head cannot find corresponding data in buffer memory district, Zhuan Qu external memory district searches.
In systems in practice, the intermediate result of generation, is generally key value pair. If spatial cache still has living space, directly stored in this intermediate result, and upgrade retrieval district table item, assignment information head simultaneously; Otherwise, data point to external memory stored in external memory, information head.
In general application, history data set is bigger, it may be considered that historical data stored in external memory, external memory switched area in setting up; For real-time stream, deposit data is in internal memory, and these concrete technology have ready-made algorithm and data structure.
Below in conjunction with the drawings and the specific embodiments, the present invention is further described in more detail.
As shown in Figure 1, the present invention simultaneously has obvious hierarchical structure to base conditioning flow process of the present invention, as shown in Figure 2. Main increasing newly has localization, streamline and memory management three modules, memory management is mainly other two module service, to ensure the reliability that its data store, by increasing this three modules under hadoop framework, make supporting on big data processing basis, thering is provided the support to data stream, specific implementation method is as follows:
In data localization module, the data of general process are<key, value>key value pair, by the byte code of key, its divide value is interval, specifically adopt probability statistics model: in historical data and partial data stream, randomly draw sample, analyzing key value to the frequency of occurrences of key, demarcation interval makes key value substantially be evenly distributed; After division, node NiThe interval S of corresponding keys valuei, it is assumed that node number is N, and interval division algorithm is as follows:
After interval division, each node only processes<the key in corresponding interval, value>, like this, item<the key (x) of node k in stream of processing data, value (y)>time, it is possible to by judge key (x) whether in interval array [k] decision whether filter this data item.
Dispatch in module at streamline, first need to gather parameter information when relevant operation runs, then judge system health state according to these parameters, and then change operation running status by controlling these parameters.
The information gathered is needed to have:
Vin(split/s) streaming data average arrival rate,
Vmc(split/s) map thread average consumption speed; Vmp(piece/s) map thread average production rate,
Vrc(piece/s) reduce thread average consumption speed,
TNThe Thread Count of operation distribution; TMMap Thread Count; TRReduce Thread Count.
First T is judgedNWhether being greater than Thread Count needed for general operation, if be greater than, illustrative system resource is more abundant, it is possible to produce after intermediate result forms a burst at map, directly issues the reduce stage and processes; The words of no person need the buffer memory according to circumstances controlling map or reduce Thread Count and intermediate result, and specific algorithm is as follows:
In memory management module, by again organizing depositing and access mode of intermediate result, ensure the high reliability that intermediate result is deposited and the transparency, as shown in Figure 3, first by hash tree table to the now addressing fast of middle fruitage, seek after information head, by information head direct access memory data or outer deposit data, information head and hash table are stored in internal memory, intermediate data part is stored in internal memory, part is stored in external memory, and the page (key value is to record) of interior external memory is exchanged and can be completed by LRU lru algorithm.

Claims (6)

1. the streaming data processing method under big data environment, is characterised in that the method comprises the following steps:
1): process accumulation big data and historical data generate intermediate result set, divide this result set and distributed buffer to each computing node;
2): each computing node periodically accepts whole streaming data, obtains intermediate result by Map process;
: filtered the intermediate result obtaining this node by intermediate result division methods, be cached on this ground node, 3) form a burst after reaching threshold value 10,000, send this burst;
: after intermediate result burst arrives, 4) dispatch algorithm according to streamline, historical data intermediate result is inputted as Reduce together with this intermediate result burst;
: export calculation result, 5) this calculation result is that a task different times part exports, and these all results is integrated in same file, forms final Output rusults.
2. the streaming data processing method under big data environment according to claim 1, it is characterized in that: described step 1) in the big data of accumulation and historical data, all back up on a distributed, before system starts or starts calculation task, need to read this part data and do pre-treatment, and they are stored into each computing node in a distributed manner, for calculating below is ready.
3. the streaming data processing method under big data environment according to claim 1, it is characterized in that: described step 2) in each computing node periodically accept whole streaming data, these whole streaming data of all computing node buffer memorys, generate intermediate result by process; Simultaneously according to concrete historical data backup method, these whole streaming data need regular update to history data set.
4. the streaming data processing method under big data environment according to claim 1, it is characterized in that: described step 3) in intermediate result division methods, the keyword value interval that each node is corresponding certain, each node only processes the data in this interval, namely for all accepting and generated the streaming data of intermediate result, filter out the intermediate result in this interval; Further, after reaching appointment threshold value, form an intermediate result burst, send this intermediate result burst.
5. the streaming data processing method under big data environment according to claim 1, it is characterised in that: described step 4) after intermediate result burst arrives, integrating step 1) intermediate result grouping as the input data of next step Reduce task;Meanwhile, owing to a calculation task is in step 3) in generally all can produce multiple burst, it is necessary to for each such Reduce task distributes a thread, calculate asynchronously.
6. the streaming data processing method under big data environment according to claim 1, it is characterised in that: described step 5) in output calculation result, this result carries out Reduce to calculate the calculation result produced; After all Reduce tasks complete, merge these Reduce calculation result, export a final result file.
CN201310287554.XA 2013-07-09 2013-07-09 Streaming data processing method under big data environment Expired - Fee Related CN103345514B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310287554.XA CN103345514B (en) 2013-07-09 2013-07-09 Streaming data processing method under big data environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310287554.XA CN103345514B (en) 2013-07-09 2013-07-09 Streaming data processing method under big data environment

Publications (2)

Publication Number Publication Date
CN103345514A CN103345514A (en) 2013-10-09
CN103345514B true CN103345514B (en) 2016-06-08

Family

ID=49280309

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310287554.XA Expired - Fee Related CN103345514B (en) 2013-07-09 2013-07-09 Streaming data processing method under big data environment

Country Status (1)

Country Link
CN (1) CN103345514B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106850257A (en) * 2016-12-22 2017-06-13 北京锐安科技有限公司 The detection method and device of a kind of stream data

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978232A (en) * 2014-04-09 2015-10-14 阿里巴巴集团控股有限公司 Computation resource capacity expansion method for real-time stream-oriented computation, computation resource release method for real-time stream-oriented computation, computation resource capacity expansion device for real-time stream-oriented computation and computation resource release device for real-time stream-oriented computation
CN105243063B (en) * 2014-06-18 2019-11-15 北京新媒传信科技有限公司 The method and apparatus of information recommendation
WO2016123808A1 (en) * 2015-02-06 2016-08-11 华为技术有限公司 Data processing system, calculation node and data processing method
CN104636209B (en) * 2015-02-15 2018-08-24 大连云动力科技有限公司 The resource scheduling system and method optimized based on big data and cloud storage system directional properties
CN104636199A (en) * 2015-03-13 2015-05-20 华存数据信息技术有限公司 Real-time large data processing system and method based on distributed internal memory calculation
CN104683488B (en) * 2015-03-31 2018-03-30 百度在线网络技术(北京)有限公司 Streaming computing system and its dispatching method and device
US9900386B2 (en) 2015-04-09 2018-02-20 International Business Machines Corporation Provisioning data to distributed computing systems
GB201516727D0 (en) * 2015-09-22 2015-11-04 Ibm Distributed merging of data sets
CN105205563B (en) * 2015-09-28 2017-02-08 国网山东省电力公司菏泽供电公司 Short-term load predication platform based on large data
CN106681991A (en) * 2015-11-05 2017-05-17 阿里巴巴集团控股有限公司 Method and equipment for detecting continuous time signal data
CN105930203B (en) * 2015-12-29 2019-08-13 中国银联股份有限公司 A kind of method and device of control message distribution
CN105681303B (en) * 2016-01-15 2019-02-01 中国科学院计算机网络信息中心 A kind of network safety situation monitoring of big data driving and method for visualizing
CN105976140B (en) * 2016-04-27 2019-10-11 大连海事大学 Vehicle goods real-time matching method under extensive stream data environment
CN105975521A (en) * 2016-04-28 2016-09-28 乐视控股(北京)有限公司 Stream data uploading method and device
JP7002459B2 (en) 2016-08-22 2022-01-20 オラクル・インターナショナル・コーポレイション Systems and methods for ontology induction with statistical profiling and reference schema matching
CN106383886B (en) * 2016-09-21 2019-08-30 深圳市博瑞得科技有限公司 A kind of big data based on the distributed programmed frame of big data is united system and method in advance
CN106815299A (en) * 2016-12-09 2017-06-09 中电科华云信息技术有限公司 The detection method of the Density Estimator outlier based on distributed traffic
CN106844712A (en) * 2017-02-07 2017-06-13 济南浪潮高新科技投资发展有限公司 The implementation method of the real-time analysis for crawl data is calculated using streaming
CN106850849A (en) * 2017-03-15 2017-06-13 联想(北京)有限公司 A kind of data processing method, device and server
CN107341084B (en) * 2017-05-16 2021-07-06 创新先进技术有限公司 Data processing method and device
CN108289125B (en) * 2018-01-26 2021-05-28 华南理工大学 TCP session recombination and statistical data extraction method based on stream processing
CN110533183B (en) * 2019-08-30 2021-08-20 东南大学 Task placement method for heterogeneous network perception in pipeline distributed deep learning
CN111210340B (en) * 2020-01-03 2023-08-18 中国建设银行股份有限公司 Automatic task processing method, device, server and storage medium
CN111399851B (en) * 2020-06-06 2021-01-15 四川新网银行股份有限公司 Batch processing execution method based on distributed system
CN112202692A (en) * 2020-09-30 2021-01-08 北京百度网讯科技有限公司 Data distribution method, device, equipment and storage medium
WO2023066248A1 (en) * 2021-10-22 2023-04-27 华为技术有限公司 Data processing method and apparatus, device, and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1858735A (en) * 2005-12-30 2006-11-08 华为技术有限公司 Method for processing mass data
CN102467570A (en) * 2010-11-17 2012-05-23 日电(中国)有限公司 Connection query system and method for distributed data warehouse

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1858735A (en) * 2005-12-30 2006-11-08 华为技术有限公司 Method for processing mass data
CN102467570A (en) * 2010-11-17 2012-05-23 日电(中国)有限公司 Connection query system and method for distributed data warehouse

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106850257A (en) * 2016-12-22 2017-06-13 北京锐安科技有限公司 The detection method and device of a kind of stream data

Also Published As

Publication number Publication date
CN103345514A (en) 2013-10-09

Similar Documents

Publication Publication Date Title
CN103345514B (en) Streaming data processing method under big data environment
CN110166282B (en) Resource allocation method, device, computer equipment and storage medium
US9740706B2 (en) Management of intermediate data spills during the shuffle phase of a map-reduce job
US9235611B1 (en) Data growth balancing
CN111176832A (en) Performance optimization and parameter configuration method based on memory computing framework Spark
CN107329814A (en) A kind of distributed memory database query engine system based on RDMA
CN109885397A (en) The loading commissions migration algorithm of time delay optimization in a kind of edge calculations environment
CN107103068A (en) The update method and device of service buffer
CN105808358B (en) A kind of data dependence thread packet mapping method for many-core system
CN103324765A (en) Multi-core synchronization data query optimization method based on column storage
CN102054000A (en) Data querying method, device and system
CN103440246A (en) Intermediate result data sequencing method and system for MapReduce
Tang et al. An intermediate data partition algorithm for skew mitigation in spark computing environment
Fan et al. Intelligent resource scheduling based on locality principle in data center networks
CN106909624B (en) Real-time sequencing optimization method for mass data
Theeten et al. Chive: Bandwidth optimized continuous querying in distributed clouds
CN108389152B (en) Graph processing method and device for graph structure perception
Wang et al. An Improved Memory Cache Management Study Based on Spark.
CN107544848B (en) Cluster expansion method, apparatus, electronic equipment and storage medium
Chen et al. MRSIM: mitigating reducer skew In MapReduce
CN112445776A (en) Presto-based dynamic barrel dividing method, system, equipment and readable storage medium
Wang et al. Waterwheel: Realtime indexing and temporal range query processing over massive data streams
CN108932258A (en) Data directory processing method and processing device
CN103324577A (en) Large-scale itemizing file distributing system based on minimum IO access conflict and file itemizing
KR20180077830A (en) Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160608