CN105677752A - Streaming computing and batch computing combined processing system and method - Google Patents

Streaming computing and batch computing combined processing system and method Download PDF

Info

Publication number
CN105677752A
CN105677752A CN201511019708.2A CN201511019708A CN105677752A CN 105677752 A CN105677752 A CN 105677752A CN 201511019708 A CN201511019708 A CN 201511019708A CN 105677752 A CN105677752 A CN 105677752A
Authority
CN
China
Prior art keywords
data
processing
batch
layer
batch processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201511019708.2A
Other languages
Chinese (zh)
Inventor
范小朋
卞嫣然
杨望仙
须成忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201511019708.2A priority Critical patent/CN105677752A/en
Publication of CN105677752A publication Critical patent/CN105677752A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching

Abstract

The invention provides a stream computing and batch computing combined processing system and a method. The system comprises: an infrastructure layer which is used for a hardware environment for operating the system and includes virtualization, machine room, network and cluster; a data storage management layer which is used for storing distributed data, preserving real-time data and batch data, and at the same time managing metadata; a data computing layer which is used for providing streaming computing and batch computing; a task scheduling layer which is used for scheduling system tasks, determining the system tasks, and controlling the data computing layer to conduct real-time processing and batch processing on metadata based on determination results; a data analysis layer which is used for analyzing processed data; a data view and inquiry optimizing layer which are used for providing data speed view, data batch view and inquiry optimization; and a data presentation layer which is used for providing visual information of the data analysis result. The method of the invention not only effectively increases data processing efficiency and optimizes data inquiry method.

Description

A kind of streaming calculates and batch processing calculates treatment system and the method for combining
Technical field
The invention belongs to streaming calculating and the technical field of data processing of batch processing calculating, particularly relate to a kind of streaming calculating and calculate, with batch processing, treatment system and the method for combining.
Background technology
Along with the fast development of science, technology and engineering, over nearly 20 years, many fields all create the data of magnanimity, and big data grows causes the attention of people. The main tupe of big data processing can be divided into batch processing and process two kinds in real time, and during batch processing, first data are stored, analyzed subsequently. Real-time process is then a kind of dynamic processing mode, just calculates when data flow into, and streaming calculating is the important derivative model of process in real time.
A current part is absorbed in the calculating of single streaming, single batch processing, single computation model is adopted to carry out data processing, but along with the extensive growth of data volume and the variation day by day of customer need, in actual demand, people are more and more higher to the processing requirements of data, and single computation model can not independently undertake service. Another part is absorbed in the combination of Stream Processing and batch processing, but fails effectively to merge, and existing big data analysis system calculates work in fusion in streaming calculating and batch processing, main employing three kinds of modes:
First kind of way: on the basis of streaming computing system, increases the support that batch processing calculates. This method only needs the query function considered in data and data at batch processing layer, and therefore batch processing layer is controlled very well. Need to use delta algorithm and complicated NoSQL database at real-time layer, independent for all complicated problems to, in real-time layer, robustness, the reliability of system can be made important improvement by this. But, in the realization of reality, it is not an easy thing by setting up simple and unified data query function, in the past Database Systems based on relation type were the data handling systems being based upon on complete relational model, so tackling different types of structurizing and non-structure data are difficult to there is such simple function model.
2nd kind of mode: start with in the basis calculated from batch processing, in conjunction with streaming data processing, carries out real-time stream process as by amendment MapReduce programming model.This kind of exist several shortcomings based on MapReduce Stream Processing: the fragment that input Interval data a) becomes fixed size, process by MapReduce platform again, the delay of process is directly proportional to the length of data fragment, the expense of initialization process task, dependence management between fragment is more complicated, and optimum fragment size depends on embody rule; B) in order to support Stream Processing, MapReduce is transformed into the pattern of Pipeline, instead of Reduce directly exports. In order to improve processing efficiency, intermediate result is only kept in internal memory. Change like this makes the complexity of original MapReduce framework greatly increase, and is unfavorable for the maintenance and expansion of system; C) interface that user is forced to use MapReduce is to define streaming operation, and this makes the scalable property of user program reduce.
The third mode: associative combination pattern. With TwitterSummingbird for example, although Summingbird utilizes unified programming interface to carry out integration platform, versatility is good, and extensibility is strong, but execution efficiency is not ideal enough in actually operating.
Summary of the invention
The present invention is directed to the above-mentioned problems in the prior art, it is provided that a kind of data-handling efficiency height and a kind of streaming of being optimized by data enquire method calculates and batch processing calculates the treatment system that combines.
Embodiments of the invention provide a kind of streaming to calculate and calculate, with batch processing, the treatment system that combines, and comprising:
Infrastructure layer, for providing the hardware environment of system cloud gray model, comprises virtualization, machine room, network and cluster;
Data store management layer, for distributed storage data, real time data after specimens preserving and batching data, manage metadata simultaneously;
Data computation layer, for providing streaming to calculate and batch processing account form;
Task dispatch layer, for system task being dispatched, and judges system job, carries out processing in real time or batch processing to metadata according to judged result control data computation layer;
Data analysis layer, for carrying out data analysis to the data after process;
Data View and inquiry Optimization Layer, for providing generating date view, data batch view and inquiry to optimize;
Data display layer, for providing the visual information of data analysis result.
Preferably, described data computation layer comprises:
Stream calculation framework, for processing for convection type data when described task dispatch layer judges that the current operation of system is real-time processing operation;
Batch processing Computational frame, for processing for convection type data when described task dispatch layer judges that the current operation of system is batch processing job.
Preferably, described Data View and inquiry Optimization Layer comprise:
RUNTIME VIEW processing module, adopts delta algorithm stream calculation framework to be calculated the result produced and is saved in described data store management layer, and form different real-time process views;
Batch view module, calculates batch processing Computational frame the result produced and is saved in described data store management layer, and form different batch view;
Module is optimized in inquiry, for providing user to utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.
Preferably, described optimization inquiry module comprises:
Parsing unit, for receiving the character string of user's input, and carries out grammatical analysis, exports abstract syntax tree;
Lexical analysis unit, for carrying out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;
Optimisation strategy actuator unit, for being optimized the data filtering out and not needing to LogicalPlan;
Paying actuator unit, on Spark cluster, really carrying out data analysis for LogicalPlan being converted to RDD.
Embodiments of the invention also provide a kind of streaming to calculate and calculate, with batch processing, the treatment process that combines, and comprise the following steps:
Build the hardware environment of system cloud gray model, comprise virtualization, machine room, network and cluster;
Obtain work order, system task is dispatched, and system job is judged, carry out processing in real time or batch processing to metadata according to judged result control;
To the real time data after process, batching data carries out and metadata carries out store management;
Form different generating date views and data batch view and show, the inquiry optimization of data processing is provided simultaneously.
Preferably, obtaining work order, system task dispatched, and judged by system job, the step controlling metadata to be carried out process or batch processing in real time according to judged result is specially:
Obtain work order, system job is dispatched, operation is judged meanwhile;
If processing operation in real time, Stream Processing technology is then adopted to carry out real time data processing, streaming is calculated and resolves into a series of short and small batch processing job, namely input data are divided into sectional data according to batchsize, every one piece of data all converts the RDD in Spark to, then the Transformation operation of DStream is turned into being operated by the Transformation of RDD in Spark by SparkStreaming, RDD is become intermediate result through operation and is kept in internal memory;
If batch processing job, batch system is then adopted to carry out batch processing, distributed data set is abstracted into elasticity distribution formula data set, achieve application task scheduling, RPC, serializing and compression, and carry API for it runs upper strata thereon, and by data subregion under distributed environment, then operation is converted into directed acyclic graph, and carry out the scheduling of DAG and the distributed variable-frequencypump of task stage by stage.
Preferably, described formation is different generating date view and data batch view the step that carries out showing specifically comprise:
Adopt delta algorithm that the result that process calculates in real time is stored, form different real-time process figure; And the result that described batch processing calculates stored, and form different batch view.
Preferably, the method for the inquiry optimization of described offer data processing is, user can utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.
Preferably, also comprise the following steps:
Grammatical analysis: the character string receiving user's input, and carry out grammatical analysis, export abstract syntax tree;
Voice is analyzed: carry out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;
Optimisation strategy performs: LogicalPlan is optimized the data filtering out and not needing;
Pay and perform: LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis.
Preferably, the step of described grammatical analysis, is specially:
According to the morphology defined, the character set of input is converted to word;
On the basis of lexical analysis, judge whether the word that user inputs meets grammer logic;
According to analysis the output of process abstract syntax tree.
In above technical scheme, when system task is dispatched, system job is judged, carry out processing in real time or batch processing to metadata according to judged result control, the interoperability between different computation model is achieved from level layer face, effectively improve fault-tolerant processing efficiency, optimize data enquire method simultaneously, be more convenient for the Conjoint Analysis of historical data and real time data.
Accompanying drawing explanation
Fig. 1 is that the streaming of embodiments of the invention calculates and batch processing combines the system tray composition for the treatment of system.
Fig. 2 is the data computation layer structural representation of embodiments of the invention.
Fig. 3 is Data View and the inquiry Optimization Layer structural representation of embodiments of the invention.
Fig. 4 is the structural representation of the inquiry optimization module of embodiments of the invention.
Fig. 5 is the query grammar analysis process figure of embodiments of the invention.
Embodiment
In order to make technical problem solved by the invention, technical scheme and useful effect clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated. It is to be understood that specific embodiment described herein is only in order to explain the present invention, it is not intended to limit the present invention.
As shown in Figure 1, embodiments of the invention provide a kind of streaming to calculate and calculate, with batch processing, the treatment system that combines, and comprising:
Infrastructure layer 10, for providing the hardware environment of system cloud gray model, is specially system infrastructure construction, comprises virtualization, machine room, network, cluster etc.;
Data store management layer 20, for distributed storage data, real time data after specimens preserving and batching data, manage metadata simultaneously; The storage of described distributed data comprises but does not limit an innings HDFS, distributed MySQL data warehouse, Cassandra.
Data computation layer 30, for providing streaming to calculate and batch processing account form, namely provides stream calculation and batch processing Computational frame and model.
Task dispatch layer 40, for system task is dispatched, and system job is judged, according to judged result control data computation layer, metadata being carried out process or batch processing in real time, the mode dispatched by system job includes but not limited to FIFO, FAIR.
Data analysis layer 50, for the data after process are carried out data analysis, includes but not limited to data mining, machine learning, degree of depth study.
Data View and inquiry Optimization Layer 60, for providing generating date view, data batch view and inquiry to optimize;
Data display layer 70, for providing the visual information of data analysis result.
Further, as shown in Figure 2, described data computation layer 30 comprises:
Stream calculation framework 301, for processing for convection type data when described task dispatch layer judges that the current operation of system is real-time processing operation;
It is specially, Stream Processing technology (such as SparkStreaming) is adopted to carry out real time data processing, streaming is calculated and resolves into a series of short and small batch processing job, namely input data are divided into sectional data according to batchsize, every one piece of data all converts the RDD in Spark to, then the Transformation operation of DStream is turned into being operated by the Transformation of RDD in Spark by SparkStreaming, RDD is become intermediate result through operation and is kept in internal memory.
Batch processing Computational frame 302, for processing for convection type data when described task dispatch layer judges that the current operation of system is batch processing job;
It is specially, adopts batch system (such as Spark) to carry out batch processing, distributed data set is abstracted into elasticity distribution formula data set, it is achieved that application task scheduling, RPC, serializing and compression, and carry API for it runs upper strata thereon. Operation, by data subregion under distributed environment, is then converted into directed acyclic graph (DAG), and carries out the scheduling of DAG and the distributed variable-frequencypump of task stage by stage by Spark.
In the above-mentioned process that system job is dispatched, operation is judged, if real-time processing operation, then call described stream calculation framework 301 and calculate; If batch processing job, then call described batch processing Computational frame 302 and calculate.
Further, as shown in Figure 3, described Data View and inquiry Optimization Layer 60 comprise:
RUNTIME VIEW module 601, it is provided that process view in real time. Being specially, the result adopting delta algorithm that stream calculation framework 301 is produced is kept in described data store management layer 20, forms different real-time process views (speedview). When the data set recalculated in batch processing comprises the data set of process in real time, from current RUNTIME VIEW, delete corresponding data.
Batch view module 602, it is provided that batch view. It is specially, the result that described batch processing Computational frame 302 produces is kept in described data store management layer 20, form different batch view (batchview).
Module 603 is optimized in inquiry, it is provided that the query manipulation that more multi-semantic meaning is abundant. It is specially, the basis of SparkSQL proposes a kind of data query optimisation strategy, solve the primary API inquiry velocity of SparkSQL not abundant problem slow, semantic. Relative to traditional SQL, add some new keywords according to concrete application scene, such as " every " keyword, it is possible to make user utilize the data of the Stream Processing after this keyword query for some time or a certain segment distance and batch processing. Meanwhile, the present invention is optimized according to concrete application scene while the parsing being sql, decreases the data of traversal, significantly improves the speed of inquiry.
Preferably, as shown in Figure 4, described optimization inquiry module 603 comprises:
Parsing unit 6031, for receiving the character string of user's input, and carries out grammatical analysis, exports abstract syntax tree;
Lexical analysis unit 6032, for carrying out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify.
Optimisation strategy actuator unit 6033, for being optimized the data filtering out and not needing to LogicalPlan;
Paying actuator unit 6034, on Spark cluster, really carrying out data analysis for LogicalPlan being converted to RDD.
Embodiments of the invention also provide a kind of streaming to calculate and calculate, with batch processing, the treatment process that combines, it is characterised in that, comprise the following steps:
Build the hardware environment of system cloud gray model, comprise virtualization, machine room, network and cluster;
Obtain work order, system task is dispatched, and system job is judged, carry out processing in real time or batch processing to metadata according to judged result control;
To the real time data after process, batching data carries out and metadata carries out store management;
Form different generating date views and data batch view and show, the inquiry optimization of data processing is provided simultaneously.
Further, obtaining work order, system task dispatched, and judged by system job, the step controlling metadata to be carried out process or batch processing in real time according to judged result is specially:
Obtain work order, system job is dispatched, operation is judged meanwhile;
If processing operation in real time, Stream Processing technology is then adopted to carry out real time data processing, streaming is calculated and resolves into a series of short and small batch processing job, namely input data are divided into sectional data according to batchsize, every one piece of data all converts the RDD in Spark to, then the Transformation operation of DStream is turned into being operated by the Transformation of RDD in Spark by SparkStreaming, RDD is become intermediate result through operation and is kept in internal memory;
If batch processing job, batch system is then adopted to carry out batch processing, distributed data set is abstracted into elasticity distribution formula data set, achieve application task scheduling, RPC, serializing and compression, and carry API for it runs upper strata thereon, and by data subregion under distributed environment, then operation is converted into directed acyclic graph, and carry out the scheduling of DAG and the distributed variable-frequencypump of task stage by stage.
Further, described formation is different generating date view and data batch view the step that carries out showing specifically comprise:
Adopt delta algorithm that the result that process calculates in real time is stored, form different real-time process figure; And the result that described batch processing calculates stored, and form different batch view.
Further, the method for the inquiry optimization of described offer data processing is, user can utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.
Further, described inquiry optimization flow process specifically comprises the following steps:
Grammatical analysis: the character string receiving user's input, and carry out grammatical analysis, export abstract syntax tree;
Shown in composition graphs 5, this step preferably includes:
S60311 lexical analysis. It is specially, according to the morphology defined, the character set of input is converted to " word ". Such as, during input selectfoo+100frombooks, after lexical analysis device processes, export one " sentence " that be made up of " word ":
(keyWord:SELECT) (Identifier:Foo) (keyword:+) (Number:100) (keyword:From) (Identifier:books)
The present invention as explanation, adds entry keyword " every " by following operation using " every " citing
S60312 grammatical analysis. It is specially, grammatical analysis judges on the basis of lexical analysis whether the word that user inputs meets grammer logic, * SELECTFOO+100FROMPOKES* is exactly a sentence meeting grammer, and * SELECTFOO+100FROM* is an illegal statement, because after FROM, a table name must be followed, otherwise grammatical analysis device can report an error.
S60313 exports abstract syntax tree. It is specially, abstract syntax tree is along with the procedure construction of grammatical analysis, after grammatical analysis normally terminates, grammatical analysis device will export an abstract syntax tree, the input of user and the structure content one_to_one corresponding of abstract syntax tree, so far, " character string " becoming " structure " completely of user's input.
Lexical analysis: carry out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;
Being specially, lexical analysis exports an inquiry plan according to the abstract syntax tree of upper one-phase, generally comprises semantic analysis phase and physical analysis stage. The analysis process of logic analysis substantially pure algebraically, it to be analyzed the SQL statement inputted and be used to What for, and which has operate. In general, a SQL statement always has an input, an output, and input data obtain exporting data after SQL processes. Physical query plan is that the logical query plan by producing before generates, and in the process of conversion, Spark programming framework does adaptation, the DAG that generation system can identify.
Optimisation strategy performs: LogicalPlan is optimized the data filtering out and not needing;
Being specially, the operation that optimization part is done is optimized by LogicalPlan, and used rule is as follows:
Pushing away under FilterPushdown strainer, the unit that earlier filtering out does not need usually reduces expense.The present invention changes under FilterPus realizes filtering out concrete scene the data not needed
Pay and perform: LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis
It is specially, LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis. The switching process of LogicalPlan to RDD introduces SparkPlan, and its main task generates RDD.
A kind of streaming that the above embodiment of the present invention proposes calculates the system and method that combines with batch processing, it is the framework being structured on Spark treatment S treaming data, basic principle is that Streaming data are divided into little time segment (several seconds), processes these small portion data in the way of similar batch batch processing. SparkStreaming is structured on Spark, it is because the low delay enforcement engine of Spark may be used for real-time calculating on the one hand, comparing the other processing framework (such as Storm) based on Record on the other hand, RDD data set more easily does efficient fault-tolerant processing. In addition the solution of the present invention optimizes data enquire method, is more convenient for historical data and real time data Conjoint Analysis.
The foregoing is only the better embodiment of the present invention, not in order to limit the present invention, all any amendment, equivalent replacement and improvement etc. done within the spirit and principles in the present invention, all should be included within protection scope of the present invention.

Claims (10)

1. a streaming calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, comprising:
Infrastructure layer, for providing the hardware environment of system cloud gray model, comprises virtualization, machine room, network and cluster;
Data store management layer, for distributed storage data, real time data after specimens preserving and batching data, manage metadata simultaneously;
Data computation layer, for providing streaming to calculate and batch processing account form;
Task dispatch layer, for system task being dispatched, and judges system job, carries out processing in real time or batch processing to metadata according to judged result control data computation layer;
Data analysis layer, for carrying out data analysis to the data after process;
Data View and inquiry Optimization Layer, for providing generating date view, data batch view and inquiry to optimize;
Data display layer, for providing the visual information of data analysis result.
2. streaming according to claim 1 calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, described data computation layer comprises:
Stream calculation framework, for processing for convection type data when described task dispatch layer judges that the current operation of system is real-time processing operation;
Batch processing Computational frame, for processing for convection type data when described task dispatch layer judges that the current operation of system is batch processing job.
3. streaming according to claim 2 calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, described Data View and inquiry Optimization Layer comprise:
RUNTIME VIEW processing module, adopts delta algorithm stream calculation framework to be calculated the result produced and is saved in described data store management layer, and form different real-time process views;
Batch view module, calculates batch processing Computational frame the result produced and is saved in described data store management layer, and form different batch view;
Module is optimized in inquiry, for providing user to utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.
4. streaming according to claim 3 calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, described optimization inquiry module comprises:
Parsing unit, for receiving the character string of user's input, and carries out grammatical analysis, exports abstract syntax tree;
Lexical analysis unit, for carrying out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;
Optimisation strategy actuator unit, for being optimized the data filtering out and not needing to LogicalPlan;
Paying actuator unit, on Spark cluster, really carrying out data analysis for LogicalPlan being converted to RDD.
5. a streaming calculates and calculates, with batch processing, the treatment process that combines, it is characterised in that, comprise the following steps:
Build the hardware environment of system cloud gray model, comprise virtualization, machine room, network and cluster;
Obtain work order, system task is dispatched, and system job is judged, carry out processing in real time or batch processing to metadata according to judged result control;
To the real time data after process, batching data carries out and metadata carries out store management;
Form different generating date views and data batch view and show, the inquiry optimization of data processing is provided simultaneously.
6. method according to claim 5, it is characterised in that, obtain work order, system task is dispatched, and system job is judged, the step controlling metadata to be carried out process or batch processing in real time according to judged result is specially:
Obtain work order, system job is dispatched, operation is judged meanwhile;
If processing operation in real time, Stream Processing technology is then adopted to carry out real time data processing, streaming is calculated and resolves into a series of short and small batch processing job, namely input data are divided into sectional data according to batchsize, every one piece of data all converts the RDD in Spark to, then the Transformation operation of DStream is turned into being operated by the Transformation of RDD in Spark by SparkStreaming, RDD is become intermediate result through operation and is kept in internal memory;
If batch processing job, batch system is then adopted to carry out batch processing, distributed data set is abstracted into elasticity distribution formula data set, achieve application task scheduling, RPC, serializing and compression, and carry API for it runs upper strata thereon, and by data subregion under distributed environment, then operation is converted into directed acyclic graph, and carry out the scheduling of DAG and the distributed variable-frequencypump of task stage by stage.
7. method according to claim 5, it is characterised in that, the generating date view that described formation is different and data batch view the step that carries out showing specifically comprise:
Adopt delta algorithm that the result that process calculates in real time is stored, form different real-time process figure; And the result that described batch processing calculates stored, and form different batch view.
8. method according to claim 5, it is characterised in that, the method for the inquiry optimization of described offer data processing is, user can utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.
9. method according to claim 8, it is characterised in that, also comprise the following steps:
Grammatical analysis: the character string receiving user's input, and carry out grammatical analysis, export abstract syntax tree;
Voice is analyzed: carry out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;
Optimisation strategy performs: LogicalPlan is optimized the data filtering out and not needing;
Pay and perform: LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis.
10. method according to claim 9, it is characterised in that, the step of described grammatical analysis, is specially:
According to the morphology defined, the character set of input is converted to word;
On the basis of lexical analysis, judge whether the word that user inputs meets grammer logic;
According to analysis the output of process abstract syntax tree.
CN201511019708.2A 2015-12-30 2015-12-30 Streaming computing and batch computing combined processing system and method Pending CN105677752A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511019708.2A CN105677752A (en) 2015-12-30 2015-12-30 Streaming computing and batch computing combined processing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511019708.2A CN105677752A (en) 2015-12-30 2015-12-30 Streaming computing and batch computing combined processing system and method

Publications (1)

Publication Number Publication Date
CN105677752A true CN105677752A (en) 2016-06-15

Family

ID=56297986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511019708.2A Pending CN105677752A (en) 2015-12-30 2015-12-30 Streaming computing and batch computing combined processing system and method

Country Status (1)

Country Link
CN (1) CN105677752A (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106254896A (en) * 2016-08-05 2016-12-21 中国传媒大学 A kind of distributed cryptographic method for real-time video
CN106873945A (en) * 2016-12-29 2017-06-20 中山大学 Data processing architecture and data processing method based on batch processing and Stream Processing
CN107016128A (en) * 2017-05-16 2017-08-04 郑州云海信息技术有限公司 A kind of data processing method and device
CN107341084A (en) * 2017-05-16 2017-11-10 阿里巴巴集团控股有限公司 A kind of method and device of data processing
CN107391719A (en) * 2017-07-31 2017-11-24 南京邮电大学 Distributed stream data processing method and system in a kind of cloud environment
CN108241722A (en) * 2016-12-23 2018-07-03 北京金山云网络技术有限公司 A kind of data processing system, method and device
CN108491507A (en) * 2018-03-22 2018-09-04 北京交通大学 A kind of parallel continuous Query method of uncertain traffic flow data based on Hadoop distributed environments
CN108806797A (en) * 2018-06-27 2018-11-13 思派(北京)网络科技有限公司 A kind of processing method and system of medical data
CN108920206A (en) * 2018-06-13 2018-11-30 北京交通大学 A kind of plug-in unit dispatching method and device
CN108984279A (en) * 2018-07-02 2018-12-11 山东汇贸电子口岸有限公司 A kind of streaming computing method of internet of things oriented tradition SQL developer
CN109192248A (en) * 2017-07-21 2019-01-11 上海桑格信息技术有限公司 Biological information analysis system, method and cloud computing platform system based on cloud platform
CN109375912A (en) * 2018-10-18 2019-02-22 腾讯科技(北京)有限公司 Model sequence method, apparatus and storage medium
CN109598348A (en) * 2017-09-28 2019-04-09 北京猎户星空科技有限公司 A kind of image pattern obtains, model training method and system
CN109828751A (en) * 2019-02-15 2019-05-31 福州大学 Integrated machine learning algorithm library and unified programming framework
CN109918391A (en) * 2019-03-12 2019-06-21 威讯柏睿数据科技(北京)有限公司 A kind of streaming transaction methods and system
CN110532283A (en) * 2019-09-03 2019-12-03 衢州学院 A kind of smart city big data processing system based on Hadoop aggregated structure
WO2020168901A1 (en) * 2019-02-19 2020-08-27 阿里巴巴集团控股有限公司 Data calculation method and engine
CN111611221A (en) * 2019-02-26 2020-09-01 北京京东尚科信息技术有限公司 Hybrid computing system, data processing method and device
CN112416537A (en) * 2020-12-15 2021-02-26 东北大学 Unified expression API calling system and calling method in Gaia system
CN112597200A (en) * 2020-12-22 2021-04-02 南京三眼精灵信息技术有限公司 Batch and streaming combined data processing method and device
CN112800091A (en) * 2021-01-26 2021-05-14 北京明略软件系统有限公司 Flow-batch integrated calculation control system and method
CN113297212A (en) * 2021-04-28 2021-08-24 上海淇玥信息技术有限公司 Spark query method and device based on materialized view and electronic equipment
CN113641705A (en) * 2021-08-16 2021-11-12 神州数码融信软件有限公司 Marketing disposal rule engine method based on calculation engine
CN113807710A (en) * 2021-09-22 2021-12-17 四川新网银行股份有限公司 Method for sectionally paralleling and dynamically scheduling system batch tasks and storage medium
CN115599524A (en) * 2022-10-27 2023-01-13 中国兵器工业计算机应用技术研究所(Cn) Data lake system based on cooperative scheduling processing of streaming data and batch data
CN116383238A (en) * 2023-06-06 2023-07-04 湖南红普创新科技发展有限公司 Data virtualization system, method, device, equipment and medium based on graph structure
CN113641705B (en) * 2021-08-16 2024-04-26 神州数码融信软件有限公司 Marketing disposal rule engine method based on calculation engine

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110128379A1 (en) * 2009-11-30 2011-06-02 Dah-Jye Lee Real-time optical flow sensor design and its application to obstacle detection
CN103761309A (en) * 2014-01-23 2014-04-30 中国移动(深圳)有限公司 Operation data processing method and system
CN104008007A (en) * 2014-06-12 2014-08-27 深圳先进技术研究院 Interoperability data processing system and method based on streaming calculation and batch processing calculation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110128379A1 (en) * 2009-11-30 2011-06-02 Dah-Jye Lee Real-time optical flow sensor design and its application to obstacle detection
CN103761309A (en) * 2014-01-23 2014-04-30 中国移动(深圳)有限公司 Operation data processing method and system
CN104008007A (en) * 2014-06-12 2014-08-27 深圳先进技术研究院 Interoperability data processing system and method based on streaming calculation and batch processing calculation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁国蓉: ""一个基于Dataflow的大数据Query Engine系统的设计与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106254896A (en) * 2016-08-05 2016-12-21 中国传媒大学 A kind of distributed cryptographic method for real-time video
CN108241722A (en) * 2016-12-23 2018-07-03 北京金山云网络技术有限公司 A kind of data processing system, method and device
CN106873945A (en) * 2016-12-29 2017-06-20 中山大学 Data processing architecture and data processing method based on batch processing and Stream Processing
CN107016128A (en) * 2017-05-16 2017-08-04 郑州云海信息技术有限公司 A kind of data processing method and device
CN107341084A (en) * 2017-05-16 2017-11-10 阿里巴巴集团控股有限公司 A kind of method and device of data processing
CN107341084B (en) * 2017-05-16 2021-07-06 创新先进技术有限公司 Data processing method and device
CN109192248A (en) * 2017-07-21 2019-01-11 上海桑格信息技术有限公司 Biological information analysis system, method and cloud computing platform system based on cloud platform
CN107391719A (en) * 2017-07-31 2017-11-24 南京邮电大学 Distributed stream data processing method and system in a kind of cloud environment
CN109598348A (en) * 2017-09-28 2019-04-09 北京猎户星空科技有限公司 A kind of image pattern obtains, model training method and system
CN108491507A (en) * 2018-03-22 2018-09-04 北京交通大学 A kind of parallel continuous Query method of uncertain traffic flow data based on Hadoop distributed environments
CN108920206A (en) * 2018-06-13 2018-11-30 北京交通大学 A kind of plug-in unit dispatching method and device
CN108806797A (en) * 2018-06-27 2018-11-13 思派(北京)网络科技有限公司 A kind of processing method and system of medical data
CN108984279A (en) * 2018-07-02 2018-12-11 山东汇贸电子口岸有限公司 A kind of streaming computing method of internet of things oriented tradition SQL developer
CN109375912A (en) * 2018-10-18 2019-02-22 腾讯科技(北京)有限公司 Model sequence method, apparatus and storage medium
CN109375912B (en) * 2018-10-18 2021-09-21 腾讯科技(北京)有限公司 Model serialization method, device and storage medium
CN109828751A (en) * 2019-02-15 2019-05-31 福州大学 Integrated machine learning algorithm library and unified programming framework
WO2020168901A1 (en) * 2019-02-19 2020-08-27 阿里巴巴集团控股有限公司 Data calculation method and engine
TWI723535B (en) * 2019-02-19 2021-04-01 開曼群島商創新先進技術有限公司 Data calculation method and engine
CN111611221A (en) * 2019-02-26 2020-09-01 北京京东尚科信息技术有限公司 Hybrid computing system, data processing method and device
CN109918391B (en) * 2019-03-12 2020-09-22 威讯柏睿数据科技(北京)有限公司 Streaming transaction processing method and system
CN109918391A (en) * 2019-03-12 2019-06-21 威讯柏睿数据科技(北京)有限公司 A kind of streaming transaction methods and system
CN110532283A (en) * 2019-09-03 2019-12-03 衢州学院 A kind of smart city big data processing system based on Hadoop aggregated structure
CN112416537A (en) * 2020-12-15 2021-02-26 东北大学 Unified expression API calling system and calling method in Gaia system
CN112597200A (en) * 2020-12-22 2021-04-02 南京三眼精灵信息技术有限公司 Batch and streaming combined data processing method and device
CN112597200B (en) * 2020-12-22 2024-01-12 南京三眼精灵信息技术有限公司 Batch and stream combined data processing method and device
CN112800091A (en) * 2021-01-26 2021-05-14 北京明略软件系统有限公司 Flow-batch integrated calculation control system and method
CN113297212A (en) * 2021-04-28 2021-08-24 上海淇玥信息技术有限公司 Spark query method and device based on materialized view and electronic equipment
CN113641705A (en) * 2021-08-16 2021-11-12 神州数码融信软件有限公司 Marketing disposal rule engine method based on calculation engine
CN113641705B (en) * 2021-08-16 2024-04-26 神州数码融信软件有限公司 Marketing disposal rule engine method based on calculation engine
CN113807710A (en) * 2021-09-22 2021-12-17 四川新网银行股份有限公司 Method for sectionally paralleling and dynamically scheduling system batch tasks and storage medium
CN113807710B (en) * 2021-09-22 2023-06-20 四川新网银行股份有限公司 System batch task segmentation parallel and dynamic scheduling method and storage medium
CN115599524A (en) * 2022-10-27 2023-01-13 中国兵器工业计算机应用技术研究所(Cn) Data lake system based on cooperative scheduling processing of streaming data and batch data
CN116383238A (en) * 2023-06-06 2023-07-04 湖南红普创新科技发展有限公司 Data virtualization system, method, device, equipment and medium based on graph structure
CN116383238B (en) * 2023-06-06 2023-08-29 湖南红普创新科技发展有限公司 Data virtualization system, method, device, equipment and medium based on graph structure

Similar Documents

Publication Publication Date Title
CN105677752A (en) Streaming computing and batch computing combined processing system and method
CN111344693B (en) Aggregation in dynamic and distributed computing systems
CN102609451B (en) SQL (structured query language) query plan generation method oriented to streaming data processing
CN107391719A (en) Distributed stream data processing method and system in a kind of cloud environment
US9146959B2 (en) Database query in a share-nothing database architecture
JP6050272B2 (en) Low latency query engine for APACHE HADOOP
CN109189589A (en) A kind of distribution big data computing engines and framework method
CN103430144A (en) Data source analytics
JP2014194769A6 (en) Low latency query engine for APACHE HADOOP
CN104050261A (en) Stormed-based variable logic general data processing system and method
CN104268428A (en) Visual configuration method for index calculation
CN103425762A (en) Telecom operator mass data processing method based on Hadoop platform
CN108021809A (en) A kind of data processing method and system
CN109799976B (en) Real-time wind control variable calculation method based on distributed stream type calculation engine
CN106951552A (en) A kind of user behavior data processing method based on Hadoop
CN107480202B (en) Data processing method and device for multiple parallel processing frameworks
CN108108466A (en) A kind of distributed system journal query analysis method and device
CN103646051A (en) Big-data parallel processing system and method based on column storage
CN103699656A (en) GPU-based mass-multimedia-data-oriented MapReduce platform
CN109063017A (en) A kind of data persistence location mode of cloud computing platform
CN102193958A (en) Method for implementing spatial decision support system based on Internet
CN113568938A (en) Data stream processing method and device, electronic equipment and storage medium
Theeten et al. Chive: Bandwidth optimized continuous querying in distributed clouds
CN117009038B (en) Graph computing platform based on cloud native technology
CN104299170B (en) Intermittent energy source mass data processing method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160615

RJ01 Rejection of invention patent application after publication