CN105930373A

CN105930373A - Spark streaming based big data stream processing method and system

Info

Publication number: CN105930373A
Application number: CN201610228189.9A
Authority: CN
Inventors: 杜旭苗
Original assignee: Beijing Si Tech Information Technology Co Ltd
Current assignee: Beijing Si Tech Information Technology Co Ltd
Priority date: 2016-04-13
Filing date: 2016-04-13
Publication date: 2016-09-07

Abstract

The invention relates to a spark streaming based big data stream processing method and system. The method includes: a step S1, receiving data sent by a data source at an appointed position, executing a step S2 if the data source is an HDFS, and executing a step S3 if the data source is an FLUME; the step S2, storing the data in a file form, and executing the step S3; the step S3, processing the received data or file through the spark streaming; and a step S4, writing the processing result of the file or the data in a result catalogue through the spark streaming according to a time interval. The method and system provide good fault-tolerant state calculation for fault-tolerant and data assurance, can support Scala programming and Java programming in the aspect of API programming; and in cluster management integration, the Spark Streaming can run on clusters thereof, and can run on a YARN and an Mesos.

Description

A kind of high amount of traffic processing method based on spark streaming and system

Technical field

The present invention relates to high amount of traffic process field, particularly relate to a kind of based on spark streaming High amount of traffic processing method and system.

Background technology

In prior art, commonly used Storm realizes data flow model, uses Storm to realize data stream mould During type, wherein data continue to flow through a conversion entity network.The abstract of one data stream is referred to as one Stream, this is a unlimited tuple sequence.Tuple uses some additional serializing codes just as a kind of Represent standard data type (such as integer, floating-point and byte arrays) or the structure of user defined type. Each stream is defined by a unique ID, and this ID can be used for building data source and the topological structure of receiver.

But Storm has the defect of himself, such as: in terms of fault-tolerant, data guarantee, in Storm Each single record is by necessary tracked, so Storm can at least ensure each record during system To be processed once, but allow to duplicate record when recovering from mistake, it means that Variableness may be updated twice improperly；In terms of realizing, programming API, due to Storm's Kernel is that clojure writes (but most expansion work is all java write), for us Understand that its realization brings certain difficulty；At the integrated aspect of cluster management, Storm may operate in On the cluster of oneself, Storm can also operate on Mesos, but when operating in YARN, it is desirable to have one Individual third party supporting assembly Storm on YARN, is not primary support.

Summary of the invention

The technical problem to be solved is for the deficiencies in the prior art, it is provided that a kind of based on The high amount of traffic processing method of spark streaming and system.

The technical scheme is that a kind of based on spark streaming High amount of traffic processing method, comprise the steps:

Step S1, receives the data that data source sends at appointed position, if data source is HDFS, then holds Row step S2, if data source is FLUME, then performs step S3；

Data are stored by step S2 with document form, perform step S3；

Step S3, data or the file of reception are processed by spark streaming；

Step S4, the result of file or data is write by spark streaming according to time interval Result list.

The invention has the beneficial effects as follows: the present invention by spark streaming by the file received or Person's data carry out batch processing according to time interval and write result list according to time interval, compared to Using Storm to next the process of data or file one in prior art, and individually record processes As a result, the present invention can speed up processing, improve treatment effeciency, and owing to being according to time interval Record result, therefore fault-tolerance is more preferably.

On the basis of technique scheme, the present invention can also do following improvement.

Further, in step S1, if data source is HDFS, the most described appointed position is that HDFS fixes Catalogue, if data source is FLUME, the most described appointed position is the agreement port of agreement main frame.

Using above-mentioned further scheme to provide the benefit that: according to the difference of data source, reasonably distribution connects Receive the position of data source, it is possible to avoid data to omit, it is ensured that data primitiveness and accuracy.

Further, if data source is HDFS, also included before performing step S3:

Spark streaming fixes whether there is newly-increased literary composition under catalogue according to time interval monitoring HDFS Part, if it has, then perform step S3, processes newly-increased file, otherwise continues monitoring.

Above-mentioned further scheme is used to provide the benefit that: in the case of data source is HDFS, according to Time interval monitoring HDFS fixes catalogue and determines whether newly-increased file, can be collected by newly-increased file, So that file is carried out subsequent treatment.

Further, if data source is FLUME, and the pattern of FLUME is push-model, is performing step Also include before S1:

Start spark streaming and based on spark streaming according to time interval monitoring agreement Whether the agreement port of main frame has newly-increased data, if it has, then perform step S1, receives newly-increased number According to, otherwise continue monitoring.

Above-mentioned further scheme is used to provide the benefit that: in the pattern that data source is FLUME and FLUME In the case of push-model, arrange port according to time interval monitoring and determine whether newly-increased data, can With by newly-increased data summarization, in order to data are carried out subsequent treatment.

Further, if data source is FLUME, and the pattern of FLUME is pull-mode, is performing step Also include before S3:

Start spark streaming and based on spark streaming according to time interval monitoring agreement Whether the agreement port of main frame has newly-increased data, if it has, then perform step S3, to newly-increased data Process, otherwise continue monitoring.

Above-mentioned further scheme is used to provide the benefit that: in the pattern that data source is FLUME and FLUME In the case of pull-mode, arrange port according to time interval monitoring and determine whether newly-increased data, can With by newly-increased data summarization, in order to data are carried out subsequent treatment.

The another kind of technical scheme that the present invention solves above-mentioned technical problem is as follows: a kind of based on spark The high amount of traffic processing system of streaming, including data reception module, file storage module, process Module and writing module:

Described data reception module, for receiving the data that data source sends at appointed position, if data Source is HDFS, then call described file storage module, if data source is FLUME, then calls described process Module；

Described file storage module, for data being stored with document form, and calls described processing module；

Described processing module, for carrying out data or the file of reception based on spark streaming Process, and call said write module；

Said write module, for based on spark streaming according to time interval by file or data Result write result list.

Further, if data source is HDFS, the most described appointed position is that HDFS fixes catalogue, if number Being FLUME according to source, the most described appointed position is the agreement port of agreement main frame.

Further, if data source is HDFS, the most also include:

First monitoring module, is connected with described file storage module and described processing module, respectively for base Fix under catalogue, whether there is newly-increased file in spark streaming according to time interval monitoring HDFS, If it has, then call described processing module, otherwise continue monitoring.

Further, if data source is FLUME, and the pattern of FLUME is push-model, the most also includes:

Second monitoring module, is connected with described data reception module, is used for starting spark streaming And whether have newly-increased based on spark streaming according to the agreement port of time interval monitoring agreement main frame Data, if it has, then call described data reception module, otherwise continue monitoring.

Further, if data source is FLUME, and the pattern of FLUME is pull-mode, the most also includes:

3rd monitoring module, is connected with described data reception module and described processing module respectively, is used for opening Dynamic spark streaming also arranges main frame based on spark streaming according to time interval monitoring Whether agreement port has newly-increased data, if it has, then call described processing module, otherwise continues monitoring.

Accompanying drawing explanation

Fig. 1 is heretofore described high amount of traffic process flow figure based on spark streaming；

When Fig. 2 is that in the present invention, data source is HDFS, Spark Streaming carries out the flow chart of stream process；

When Fig. 3 is that in the present invention, data source is FLUME, during push-model, Spark Streaming is carried out at stream The flow chart of reason；

When Fig. 4 is that in the present invention, data source is FLUME, during pull-mode, Spark Streaming is carried out at stream The flow chart of reason；

Fig. 5 is heretofore described high amount of traffic processing system structure chart based on spark streaming；

When Fig. 6 is that in the present invention, data source is HDFS, Spark Streaming carries out the system knot of stream process Composition；

When Fig. 7 is that in the present invention, data source is FLUME, during push-model, Spark Streaming is carried out at stream The system construction drawing of reason；

When Fig. 8 is that in the present invention, data source is FLUME, during pull-mode, Spark Streaming is carried out at stream The system construction drawing of reason.

Detailed description of the invention

Being described principle and the feature of the present invention below in conjunction with accompanying drawing, example is served only for explaining this Invention, is not intended to limit the scope of the present invention.

Spark Streaming is the extension of spark Core API, can realize the height to real-time stream Handling capacity, fault-tolerant stream process.The data source of Spark Streaming can have a lot, including kafka, Flume, twitter, ZeroMQ or traditional TCP sockets.

Spark Streaming is an extension of core Spark API, it can't as Storm that Sample processes data stream one at a time, but is one section one by its cutting the most in advance The batch processing job of section.Spark is referred to as DStream for the abstract of persistent data stream (DiscretizedStream), a DStream is a micro-batch processing (micro-batching) RDD (elasticity distribution formula data set)；RDD is then a kind of distributed data collection, it is possible to two kinds of sides Formula functioning in parallel, is the conversion of arbitrary function and sliding window data respectively.

Fig. 1 is heretofore described high amount of traffic process flow figure based on spark streaming.

As it is shown in figure 1, a kind of high amount of traffic processing method based on spark streaming, including such as Lower step:

Step S1, receives the data that data source sends at appointed position；If data source is HDFS, then Appointed position is that HDFS fixes catalogue, if data source is FLUME, then appointed position is the pact of agreement main frame Fixed end mouth.If data source is HDFS, then performs step S2, if data source is FLUME, then perform step S3；

Data are stored by step S2 with document form, perform step S3；

Step S3, data or the file of reception are processed by spark streaming；

When Fig. 2 is that in the present invention, data source is HDFS, Spark Streaming carries out the flow process of stream process Figure.If as in figure 2 it is shown, data source is HDFS, also included before performing step S3: spark Streaming fixes whether there is newly-increased file under catalogue according to time interval monitoring HDFS, if it has, Then perform step S3, newly-increased file is processed, otherwise continue monitoring.

When Fig. 3 is that in the present invention, data source is FLUME, during push-model, Spark Streaming is carried out at stream The flow chart of reason.If as it is shown on figure 3, data source is FLUME, and the pattern of FLUME is push-model, Also included before performing step S1: start spark streaming and based on spark streaming Newly-increased data whether are had, if it has, then perform according to the agreement port of time interval monitoring agreement main frame Step S1, receives newly-increased data, otherwise continues monitoring.

When Fig. 4 is that in the present invention, data source is FLUME, during pull-mode, Spark Streaming is carried out at stream The flow chart of reason.As shown in Figure 4, if data source is FLUME, and the pattern of FLUME is pull-mode, Also included before performing step S3: start spark streaming and based on spark streaming Newly-increased data whether are had, if it has, then perform according to the agreement port of time interval monitoring agreement main frame Newly-increased data are processed by step S3, otherwise continue monitoring.

Fig. 5 is heretofore described high amount of traffic processing system structure chart based on spark streaming. Can draw as described in Figure 5 according to above-mentioned high amount of traffic processing method based on spark streaming A kind of high amount of traffic processing system based on spark streaming, including data reception module, file Memory module, processing module and writing module.Data reception module, for receiving number at appointed position The data sent according to source；If data source is HDFS, then appointed position is that HDFS fixes catalogue, if data Source is FLUME, then appointed position is the agreement port of agreement main frame.If data source is HDFS, then call File storage module, if data source is FLUME, then calls processing module.File storage module, is used for Data are stored with document form, and calls processing module；Processing module, for based on spark Data or the file of reception are processed by streaming, and call writing module.Writing module, For the result of file or data being write result based on spark streaming according to time interval Catalogue.

When Fig. 6 is that in the present invention, data source is HDFS, Spark Streaming carries out the system of stream process Structure chart.As shown in Figure 6, if data source is HDFS, then system also includes: the first monitoring module, point It is not connected with file storage module and processing module, between based on spark streaming according to the time Fix under catalogue, whether there is newly-increased file every monitoring HDFS, if it has, then call processing module, no Then continue monitoring.

When Fig. 7 is that in the present invention, data source is FLUME, during push-model, Spark Streaming is carried out at stream The system construction drawing of reason.If as it is shown in fig. 7, data source is FLUME, and the pattern of FLUME is for pushing away mould Formula, then system also includes: the second monitoring module, is connected with data reception module, is used for starting spark Streaming and based on spark streaming according to time interval monitoring agreement main frame agreement port Whether there are newly-increased data, if it has, then call data reception module, otherwise continue monitoring.

When Fig. 8 is that in the present invention, data source is FLUME, during pull-mode, Spark Streaming is carried out at stream The system construction drawing of reason.As shown in Figure 8, if data source is FLUME, and the pattern of FLUME is drawing-die Formula, the most also includes: the 3rd monitoring module, is connected with data reception module and processing module respectively, is used for Start spark streaming and based on spark streaming according to time interval monitoring agreement main frame Agreement port whether have newly-increased data, if it has, then call processing module, otherwise continue monitoring.

Compared with the Storm of prior art, it is an advantage of the current invention that: in terms of fault-tolerant, data guarantee, Spark Streaming provides and preferably supports fault-tolerant state computation；In terms of realizing, programming API, Spark Streaming is to program with Scala, also supports Java；Spark Streaming mono-is good Characteristic be that it operates on Spark, this makes it possible to write the same code of batch processing, without Write single code to process real-time streaming data and historical data；At the integrated aspect of cluster management, Spark Streaming may operate on the cluster of oneself, and Spark Streaming is on YARN and Mesos Also all can run, Spark Streaming is primary adaptive YARN.

In the description of this specification, reference term " embodiment one ", " embodiment two ", " example ", The description of " concrete example " or " some examples " etc. means to combine this embodiment or the tool of example description Body method, device or feature are contained at least one embodiment or the example of the present invention.In this explanation In book, the schematic representation of above-mentioned term is necessarily directed to identical embodiment or example.And, The specific features, method, device or the feature that describe can be with in one or more embodiments in office or examples Combine in an appropriate manner.Additionally, in the case of the most conflicting, those skilled in the art is permissible The feature of the different embodiments described in this specification or example and different embodiment or example is carried out In conjunction with and combination.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all in the present invention Spirit and principle within, any modification, equivalent substitution and improvement etc. made, should be included in this Within bright protection domain.

Claims

1. a high amount of traffic processing method based on spark streami ng, it is characterised in that include Following steps:

Data are stored by step S2 with document form, perform step S3；

Step S3, data or the file of reception are processed by spark streami ng；

Step S4, the result of file or data is write by spark streami ng according to time interval Result list.

High amount of traffic processing method based on spark streami ng the most according to claim 1, It is characterized in that, in step S1, if data source is HDFS, the most described appointed position is that HDFS fixes mesh Record, if data source is FLUME, the most described appointed position is the agreement port of agreement main frame.

High amount of traffic processing method based on spark streami ng the most according to claim 2, It is characterized in that, if data source is HDFS, also included before performing step S3:

Spark streami ng fixes whether there is newly-increased literary composition under catalogue according to time interval monitoring HDFS Part, if it has, then perform step S3, processes newly-increased file, otherwise continues monitoring.

High amount of traffic processing method based on spark streami ng the most according to claim 2, It is characterized in that, if data source is FLUME, and the pattern of FLUME is push-model, is performing step S1 The most also include:

Start spark streami ng and based on spark streami ng according to time interval monitoring agreement Whether the agreement port of main frame has newly-increased data, if it has, then perform step S1, receives newly-increased number According to, otherwise continue monitoring.

High amount of traffic processing method based on spark streami ng the most according to claim 2, It is characterized in that, if data source is FLUME, and the pattern of FLUME is pull-mode, is performing step S3 The most also include:

Start spark streami ng and based on spark streami ng according to time interval monitoring agreement Whether the agreement port of main frame has newly-increased data, if it has, then perform step S3, to newly-increased data Process, otherwise continue monitoring.

6. a high amount of traffic processing system based on spark streami ng, it is characterised in that include Data reception module, file storage module, processing module and writing module:

Described processing module, for carrying out data or the file of reception based on spark streami ng Process, and call said write module；

Said write module, for based on spark streami ng according to time interval by file or data Result write result list.

High amount of traffic processing system based on spark streami ng the most according to claim 6, It is characterized in that, if data source is HDFS, the most described appointed position is that HDFS fixes catalogue, if data Source is FLUME, and the most described appointed position is the agreement port of agreement main frame.

High amount of traffic processing system based on spark streami ng the most according to claim 7, It is characterized in that, if data source is HDFS, the most also include:

First monitoring module, is connected with described file storage module and described processing module, respectively for base Fix under catalogue, whether there is newly-increased file in spark streami ng according to time interval monitoring HDFS, If it has, then call described processing module, otherwise continue monitoring.

High amount of traffic processing system based on spark streami ng the most according to claim 7, It is characterized in that, if data source is FLUME, and the pattern of FLUME is push-model, the most also includes:

Second monitoring module, is connected with described data reception module, is used for starting spark streami ng And whether have newly-increased based on spark streami ng according to the agreement port of time interval monitoring agreement main frame Data, if it has, then call described data reception module, otherwise continue monitoring.

High amount of traffic processing system based on spark streami ng the most according to claim 7, It is characterized in that, if data source is FLUME, and the pattern of FLUME is pull-mode, the most also includes:

3rd monitoring module, is connected with described data reception module and described processing module respectively, is used for opening Dynamic spark streami ng also arranges main frame based on spark streami ng according to time interval monitoring Whether agreement port has newly-increased data, if it has, then call described processing module, otherwise continues monitoring.