CN105677752A

CN105677752A - Streaming computing and batch computing combined processing system and method

Info

Publication number: CN105677752A
Application number: CN201511019708.2A
Authority: CN
Inventors: 范小朋; 卞嫣然; 杨望仙; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2015-12-30
Filing date: 2015-12-30
Publication date: 2016-06-15

Abstract

The invention provides a stream computing and batch computing combined processing system and a method. The system comprises: an infrastructure layer which is used for a hardware environment for operating the system and includes virtualization, machine room, network and cluster; a data storage management layer which is used for storing distributed data, preserving real-time data and batch data, and at the same time managing metadata; a data computing layer which is used for providing streaming computing and batch computing; a task scheduling layer which is used for scheduling system tasks, determining the system tasks, and controlling the data computing layer to conduct real-time processing and batch processing on metadata based on determination results; a data analysis layer which is used for analyzing processed data; a data view and inquiry optimizing layer which are used for providing data speed view, data batch view and inquiry optimization; and a data presentation layer which is used for providing visual information of the data analysis result. The method of the invention not only effectively increases data processing efficiency and optimizes data inquiry method.

Description

A kind of streaming calculates and batch processing calculates treatment system and the method for combining

Technical field

The invention belongs to streaming calculating and the technical field of data processing of batch processing calculating, particularly relate to a kind of streaming calculating and calculate, with batch processing, treatment system and the method for combining.

Background technology

Along with the fast development of science, technology and engineering, over nearly 20 years, many fields all create the data of magnanimity, and big data grows causes the attention of people. The main tupe of big data processing can be divided into batch processing and process two kinds in real time, and during batch processing, first data are stored, analyzed subsequently. Real-time process is then a kind of dynamic processing mode, just calculates when data flow into, and streaming calculating is the important derivative model of process in real time.

A current part is absorbed in the calculating of single streaming, single batch processing, single computation model is adopted to carry out data processing, but along with the extensive growth of data volume and the variation day by day of customer need, in actual demand, people are more and more higher to the processing requirements of data, and single computation model can not independently undertake service. Another part is absorbed in the combination of Stream Processing and batch processing, but fails effectively to merge, and existing big data analysis system calculates work in fusion in streaming calculating and batch processing, main employing three kinds of modes:

First kind of way: on the basis of streaming computing system, increases the support that batch processing calculates. This method only needs the query function considered in data and data at batch processing layer, and therefore batch processing layer is controlled very well. Need to use delta algorithm and complicated NoSQL database at real-time layer, independent for all complicated problems to, in real-time layer, robustness, the reliability of system can be made important improvement by this. But, in the realization of reality, it is not an easy thing by setting up simple and unified data query function, in the past Database Systems based on relation type were the data handling systems being based upon on complete relational model, so tackling different types of structurizing and non-structure data are difficult to there is such simple function model.

2nd kind of mode: start with in the basis calculated from batch processing, in conjunction with streaming data processing, carries out real-time stream process as by amendment MapReduce programming model.This kind of exist several shortcomings based on MapReduce Stream Processing: the fragment that input Interval data a) becomes fixed size, process by MapReduce platform again, the delay of process is directly proportional to the length of data fragment, the expense of initialization process task, dependence management between fragment is more complicated, and optimum fragment size depends on embody rule; B) in order to support Stream Processing, MapReduce is transformed into the pattern of Pipeline, instead of Reduce directly exports. In order to improve processing efficiency, intermediate result is only kept in internal memory. Change like this makes the complexity of original MapReduce framework greatly increase, and is unfavorable for the maintenance and expansion of system; C) interface that user is forced to use MapReduce is to define streaming operation, and this makes the scalable property of user program reduce.

The third mode: associative combination pattern. With TwitterSummingbird for example, although Summingbird utilizes unified programming interface to carry out integration platform, versatility is good, and extensibility is strong, but execution efficiency is not ideal enough in actually operating.

Summary of the invention

The present invention is directed to the above-mentioned problems in the prior art, it is provided that a kind of data-handling efficiency height and a kind of streaming of being optimized by data enquire method calculates and batch processing calculates the treatment system that combines.

Embodiments of the invention provide a kind of streaming to calculate and calculate, with batch processing, the treatment system that combines, and comprising:

Infrastructure layer, for providing the hardware environment of system cloud gray model, comprises virtualization, machine room, network and cluster;

Data store management layer, for distributed storage data, real time data after specimens preserving and batching data, manage metadata simultaneously;

Data computation layer, for providing streaming to calculate and batch processing account form;

Task dispatch layer, for system task being dispatched, and judges system job, carries out processing in real time or batch processing to metadata according to judged result control data computation layer;

Data analysis layer, for carrying out data analysis to the data after process;

Data View and inquiry Optimization Layer, for providing generating date view, data batch view and inquiry to optimize;

Data display layer, for providing the visual information of data analysis result.

Preferably, described data computation layer comprises:

Stream calculation framework, for processing for convection type data when described task dispatch layer judges that the current operation of system is real-time processing operation;

Batch processing Computational frame, for processing for convection type data when described task dispatch layer judges that the current operation of system is batch processing job.

Preferably, described Data View and inquiry Optimization Layer comprise:

RUNTIME VIEW processing module, adopts delta algorithm stream calculation framework to be calculated the result produced and is saved in described data store management layer, and form different real-time process views;

Batch view module, calculates batch processing Computational frame the result produced and is saved in described data store management layer, and form different batch view;

Module is optimized in inquiry, for providing user to utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.

Preferably, described optimization inquiry module comprises:

Parsing unit, for receiving the character string of user's input, and carries out grammatical analysis, exports abstract syntax tree;

Lexical analysis unit, for carrying out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;

Optimisation strategy actuator unit, for being optimized the data filtering out and not needing to LogicalPlan;

Paying actuator unit, on Spark cluster, really carrying out data analysis for LogicalPlan being converted to RDD.

Embodiments of the invention also provide a kind of streaming to calculate and calculate, with batch processing, the treatment process that combines, and comprise the following steps:

Build the hardware environment of system cloud gray model, comprise virtualization, machine room, network and cluster;

Obtain work order, system task is dispatched, and system job is judged, carry out processing in real time or batch processing to metadata according to judged result control;

To the real time data after process, batching data carries out and metadata carries out store management;

Form different generating date views and data batch view and show, the inquiry optimization of data processing is provided simultaneously.

Preferably, obtaining work order, system task dispatched, and judged by system job, the step controlling metadata to be carried out process or batch processing in real time according to judged result is specially:

Obtain work order, system job is dispatched, operation is judged meanwhile;

If processing operation in real time, Stream Processing technology is then adopted to carry out real time data processing, streaming is calculated and resolves into a series of short and small batch processing job, namely input data are divided into sectional data according to batchsize, every one piece of data all converts the RDD in Spark to, then the Transformation operation of DStream is turned into being operated by the Transformation of RDD in Spark by SparkStreaming, RDD is become intermediate result through operation and is kept in internal memory;

If batch processing job, batch system is then adopted to carry out batch processing, distributed data set is abstracted into elasticity distribution formula data set, achieve application task scheduling, RPC, serializing and compression, and carry API for it runs upper strata thereon, and by data subregion under distributed environment, then operation is converted into directed acyclic graph, and carry out the scheduling of DAG and the distributed variable-frequencypump of task stage by stage.

Preferably, described formation is different generating date view and data batch view the step that carries out showing specifically comprise:

Adopt delta algorithm that the result that process calculates in real time is stored, form different real-time process figure; And the result that described batch processing calculates stored, and form different batch view.

Preferably, the method for the inquiry optimization of described offer data processing is, user can utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.

Preferably, also comprise the following steps:

Grammatical analysis: the character string receiving user's input, and carry out grammatical analysis, export abstract syntax tree;

Voice is analyzed: carry out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;

Optimisation strategy performs: LogicalPlan is optimized the data filtering out and not needing;

Pay and perform: LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis.

Preferably, the step of described grammatical analysis, is specially:

According to the morphology defined, the character set of input is converted to word;

On the basis of lexical analysis, judge whether the word that user inputs meets grammer logic;

According to analysis the output of process abstract syntax tree.

In above technical scheme, when system task is dispatched, system job is judged, carry out processing in real time or batch processing to metadata according to judged result control, the interoperability between different computation model is achieved from level layer face, effectively improve fault-tolerant processing efficiency, optimize data enquire method simultaneously, be more convenient for the Conjoint Analysis of historical data and real time data.

Accompanying drawing explanation

Fig. 1 is that the streaming of embodiments of the invention calculates and batch processing combines the system tray composition for the treatment of system.

Fig. 2 is the data computation layer structural representation of embodiments of the invention.

Fig. 3 is Data View and the inquiry Optimization Layer structural representation of embodiments of the invention.

Fig. 4 is the structural representation of the inquiry optimization module of embodiments of the invention.

Fig. 5 is the query grammar analysis process figure of embodiments of the invention.

Embodiment

In order to make technical problem solved by the invention, technical scheme and useful effect clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated. It is to be understood that specific embodiment described herein is only in order to explain the present invention, it is not intended to limit the present invention.

As shown in Figure 1, embodiments of the invention provide a kind of streaming to calculate and calculate, with batch processing, the treatment system that combines, and comprising:

Infrastructure layer 10, for providing the hardware environment of system cloud gray model, is specially system infrastructure construction, comprises virtualization, machine room, network, cluster etc.;

Data store management layer 20, for distributed storage data, real time data after specimens preserving and batching data, manage metadata simultaneously; The storage of described distributed data comprises but does not limit an innings HDFS, distributed MySQL data warehouse, Cassandra.

Data computation layer 30, for providing streaming to calculate and batch processing account form, namely provides stream calculation and batch processing Computational frame and model.

Task dispatch layer 40, for system task is dispatched, and system job is judged, according to judged result control data computation layer, metadata being carried out process or batch processing in real time, the mode dispatched by system job includes but not limited to FIFO, FAIR.

Data analysis layer 50, for the data after process are carried out data analysis, includes but not limited to data mining, machine learning, degree of depth study.

Data View and inquiry Optimization Layer 60, for providing generating date view, data batch view and inquiry to optimize;

Data display layer 70, for providing the visual information of data analysis result.

Further, as shown in Figure 2, described data computation layer 30 comprises:

Stream calculation framework 301, for processing for convection type data when described task dispatch layer judges that the current operation of system is real-time processing operation;

It is specially, Stream Processing technology (such as SparkStreaming) is adopted to carry out real time data processing, streaming is calculated and resolves into a series of short and small batch processing job, namely input data are divided into sectional data according to batchsize, every one piece of data all converts the RDD in Spark to, then the Transformation operation of DStream is turned into being operated by the Transformation of RDD in Spark by SparkStreaming, RDD is become intermediate result through operation and is kept in internal memory.

Batch processing Computational frame 302, for processing for convection type data when described task dispatch layer judges that the current operation of system is batch processing job;

It is specially, adopts batch system (such as Spark) to carry out batch processing, distributed data set is abstracted into elasticity distribution formula data set, it is achieved that application task scheduling, RPC, serializing and compression, and carry API for it runs upper strata thereon. Operation, by data subregion under distributed environment, is then converted into directed acyclic graph (DAG), and carries out the scheduling of DAG and the distributed variable-frequencypump of task stage by stage by Spark.

In the above-mentioned process that system job is dispatched, operation is judged, if real-time processing operation, then call described stream calculation framework 301 and calculate; If batch processing job, then call described batch processing Computational frame 302 and calculate.

Further, as shown in Figure 3, described Data View and inquiry Optimization Layer 60 comprise:

RUNTIME VIEW module 601, it is provided that process view in real time. Being specially, the result adopting delta algorithm that stream calculation framework 301 is produced is kept in described data store management layer 20, forms different real-time process views (speedview). When the data set recalculated in batch processing comprises the data set of process in real time, from current RUNTIME VIEW, delete corresponding data.

Batch view module 602, it is provided that batch view. It is specially, the result that described batch processing Computational frame 302 produces is kept in described data store management layer 20, form different batch view (batchview).

Module 603 is optimized in inquiry, it is provided that the query manipulation that more multi-semantic meaning is abundant. It is specially, the basis of SparkSQL proposes a kind of data query optimisation strategy, solve the primary API inquiry velocity of SparkSQL not abundant problem slow, semantic. Relative to traditional SQL, add some new keywords according to concrete application scene, such as " every " keyword, it is possible to make user utilize the data of the Stream Processing after this keyword query for some time or a certain segment distance and batch processing. Meanwhile, the present invention is optimized according to concrete application scene while the parsing being sql, decreases the data of traversal, significantly improves the speed of inquiry.

Preferably, as shown in Figure 4, described optimization inquiry module 603 comprises:

Parsing unit 6031, for receiving the character string of user's input, and carries out grammatical analysis, exports abstract syntax tree;

Lexical analysis unit 6032, for carrying out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify.

Optimisation strategy actuator unit 6033, for being optimized the data filtering out and not needing to LogicalPlan;

Paying actuator unit 6034, on Spark cluster, really carrying out data analysis for LogicalPlan being converted to RDD.

Embodiments of the invention also provide a kind of streaming to calculate and calculate, with batch processing, the treatment process that combines, it is characterised in that, comprise the following steps:

Further, obtaining work order, system task dispatched, and judged by system job, the step controlling metadata to be carried out process or batch processing in real time according to judged result is specially:

Obtain work order, system job is dispatched, operation is judged meanwhile;

Further, described formation is different generating date view and data batch view the step that carries out showing specifically comprise:

Further, the method for the inquiry optimization of described offer data processing is, user can utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.

Further, described inquiry optimization flow process specifically comprises the following steps:

Shown in composition graphs 5, this step preferably includes:

S60311 lexical analysis. It is specially, according to the morphology defined, the character set of input is converted to " word ". Such as, during input selectfoo+100frombooks, after lexical analysis device processes, export one " sentence " that be made up of " word ":

(keyWord:SELECT) (Identifier:Foo) (keyword:+) (Number:100) (keyword:From) (Identifier:books)

The present invention as explanation, adds entry keyword " every " by following operation using " every " citing

S60312 grammatical analysis. It is specially, grammatical analysis judges on the basis of lexical analysis whether the word that user inputs meets grammer logic, * SELECTFOO+100FROMPOKES* is exactly a sentence meeting grammer, and * SELECTFOO+100FROM* is an illegal statement, because after FROM, a table name must be followed, otherwise grammatical analysis device can report an error.

S60313 exports abstract syntax tree. It is specially, abstract syntax tree is along with the procedure construction of grammatical analysis, after grammatical analysis normally terminates, grammatical analysis device will export an abstract syntax tree, the input of user and the structure content one_to_one corresponding of abstract syntax tree, so far, " character string " becoming " structure " completely of user's input.

Lexical analysis: carry out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;

Being specially, lexical analysis exports an inquiry plan according to the abstract syntax tree of upper one-phase, generally comprises semantic analysis phase and physical analysis stage. The analysis process of logic analysis substantially pure algebraically, it to be analyzed the SQL statement inputted and be used to What for, and which has operate. In general, a SQL statement always has an input, an output, and input data obtain exporting data after SQL processes. Physical query plan is that the logical query plan by producing before generates, and in the process of conversion, Spark programming framework does adaptation, the DAG that generation system can identify.

Being specially, the operation that optimization part is done is optimized by LogicalPlan, and used rule is as follows:

Pushing away under FilterPushdown strainer, the unit that earlier filtering out does not need usually reduces expense.The present invention changes under FilterPus realizes filtering out concrete scene the data not needed

Pay and perform: LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis

It is specially, LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis. The switching process of LogicalPlan to RDD introduces SparkPlan, and its main task generates RDD.

A kind of streaming that the above embodiment of the present invention proposes calculates the system and method that combines with batch processing, it is the framework being structured on Spark treatment S treaming data, basic principle is that Streaming data are divided into little time segment (several seconds), processes these small portion data in the way of similar batch batch processing. SparkStreaming is structured on Spark, it is because the low delay enforcement engine of Spark may be used for real-time calculating on the one hand, comparing the other processing framework (such as Storm) based on Record on the other hand, RDD data set more easily does efficient fault-tolerant processing. In addition the solution of the present invention optimizes data enquire method, is more convenient for historical data and real time data Conjoint Analysis.

The foregoing is only the better embodiment of the present invention, not in order to limit the present invention, all any amendment, equivalent replacement and improvement etc. done within the spirit and principles in the present invention, all should be included within protection scope of the present invention.

Claims

1. a streaming calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, comprising:

Data analysis layer, for carrying out data analysis to the data after process;

2. streaming according to claim 1 calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, described data computation layer comprises:

3. streaming according to claim 2 calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, described Data View and inquiry Optimization Layer comprise:

4. streaming according to claim 3 calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, described optimization inquiry module comprises:

5. a streaming calculates and calculates, with batch processing, the treatment process that combines, it is characterised in that, comprise the following steps:

6. method according to claim 5, it is characterised in that, obtain work order, system task is dispatched, and system job is judged, the step controlling metadata to be carried out process or batch processing in real time according to judged result is specially:

Obtain work order, system job is dispatched, operation is judged meanwhile;

7. method according to claim 5, it is characterised in that, the generating date view that described formation is different and data batch view the step that carries out showing specifically comprise:

8. method according to claim 5, it is characterised in that, the method for the inquiry optimization of described offer data processing is, user can utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.

9. method according to claim 8, it is characterised in that, also comprise the following steps:

10. method according to claim 9, it is characterised in that, the step of described grammatical analysis, is specially:

According to analysis the output of process abstract syntax tree.