CN105677752A - Streaming computing and batch computing combined processing system and method - Google Patents
Streaming computing and batch computing combined processing system and method Download PDFInfo
- Publication number
- CN105677752A CN105677752A CN201511019708.2A CN201511019708A CN105677752A CN 105677752 A CN105677752 A CN 105677752A CN 201511019708 A CN201511019708 A CN 201511019708A CN 105677752 A CN105677752 A CN 105677752A
- Authority
- CN
- China
- Prior art keywords
- data
- processing
- batch
- layer
- batch processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9032—Query formulation
- G06F16/90332—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
Abstract
The invention provides a stream computing and batch computing combined processing system and a method. The system comprises: an infrastructure layer which is used for a hardware environment for operating the system and includes virtualization, machine room, network and cluster; a data storage management layer which is used for storing distributed data, preserving real-time data and batch data, and at the same time managing metadata; a data computing layer which is used for providing streaming computing and batch computing; a task scheduling layer which is used for scheduling system tasks, determining the system tasks, and controlling the data computing layer to conduct real-time processing and batch processing on metadata based on determination results; a data analysis layer which is used for analyzing processed data; a data view and inquiry optimizing layer which are used for providing data speed view, data batch view and inquiry optimization; and a data presentation layer which is used for providing visual information of the data analysis result. The method of the invention not only effectively increases data processing efficiency and optimizes data inquiry method.
Description
Technical field
The invention belongs to streaming calculating and the technical field of data processing of batch processing calculating, particularly relate to a kind of streaming calculating and calculate, with batch processing, treatment system and the method for combining.
Background technology
Along with the fast development of science, technology and engineering, over nearly 20 years, many fields all create the data of magnanimity, and big data grows causes the attention of people. The main tupe of big data processing can be divided into batch processing and process two kinds in real time, and during batch processing, first data are stored, analyzed subsequently. Real-time process is then a kind of dynamic processing mode, just calculates when data flow into, and streaming calculating is the important derivative model of process in real time.
A current part is absorbed in the calculating of single streaming, single batch processing, single computation model is adopted to carry out data processing, but along with the extensive growth of data volume and the variation day by day of customer need, in actual demand, people are more and more higher to the processing requirements of data, and single computation model can not independently undertake service. Another part is absorbed in the combination of Stream Processing and batch processing, but fails effectively to merge, and existing big data analysis system calculates work in fusion in streaming calculating and batch processing, main employing three kinds of modes:
First kind of way: on the basis of streaming computing system, increases the support that batch processing calculates. This method only needs the query function considered in data and data at batch processing layer, and therefore batch processing layer is controlled very well. Need to use delta algorithm and complicated NoSQL database at real-time layer, independent for all complicated problems to, in real-time layer, robustness, the reliability of system can be made important improvement by this. But, in the realization of reality, it is not an easy thing by setting up simple and unified data query function, in the past Database Systems based on relation type were the data handling systems being based upon on complete relational model, so tackling different types of structurizing and non-structure data are difficult to there is such simple function model.
2nd kind of mode: start with in the basis calculated from batch processing, in conjunction with streaming data processing, carries out real-time stream process as by amendment MapReduce programming model.This kind of exist several shortcomings based on MapReduce Stream Processing: the fragment that input Interval data a) becomes fixed size, process by MapReduce platform again, the delay of process is directly proportional to the length of data fragment, the expense of initialization process task, dependence management between fragment is more complicated, and optimum fragment size depends on embody rule; B) in order to support Stream Processing, MapReduce is transformed into the pattern of Pipeline, instead of Reduce directly exports. In order to improve processing efficiency, intermediate result is only kept in internal memory. Change like this makes the complexity of original MapReduce framework greatly increase, and is unfavorable for the maintenance and expansion of system; C) interface that user is forced to use MapReduce is to define streaming operation, and this makes the scalable property of user program reduce.
The third mode: associative combination pattern. With TwitterSummingbird for example, although Summingbird utilizes unified programming interface to carry out integration platform, versatility is good, and extensibility is strong, but execution efficiency is not ideal enough in actually operating.
Summary of the invention
The present invention is directed to the above-mentioned problems in the prior art, it is provided that a kind of data-handling efficiency height and a kind of streaming of being optimized by data enquire method calculates and batch processing calculates the treatment system that combines.
Embodiments of the invention provide a kind of streaming to calculate and calculate, with batch processing, the treatment system that combines, and comprising:
Infrastructure layer, for providing the hardware environment of system cloud gray model, comprises virtualization, machine room, network and cluster;
Data store management layer, for distributed storage data, real time data after specimens preserving and batching data, manage metadata simultaneously;
Data computation layer, for providing streaming to calculate and batch processing account form;
Task dispatch layer, for system task being dispatched, and judges system job, carries out processing in real time or batch processing to metadata according to judged result control data computation layer;
Data analysis layer, for carrying out data analysis to the data after process;
Data View and inquiry Optimization Layer, for providing generating date view, data batch view and inquiry to optimize;
Data display layer, for providing the visual information of data analysis result.
Preferably, described data computation layer comprises:
Stream calculation framework, for processing for convection type data when described task dispatch layer judges that the current operation of system is real-time processing operation;
Batch processing Computational frame, for processing for convection type data when described task dispatch layer judges that the current operation of system is batch processing job.
Preferably, described Data View and inquiry Optimization Layer comprise:
RUNTIME VIEW processing module, adopts delta algorithm stream calculation framework to be calculated the result produced and is saved in described data store management layer, and form different real-time process views;
Batch view module, calculates batch processing Computational frame the result produced and is saved in described data store management layer, and form different batch view;
Module is optimized in inquiry, for providing user to utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.
Preferably, described optimization inquiry module comprises:
Parsing unit, for receiving the character string of user's input, and carries out grammatical analysis, exports abstract syntax tree;
Lexical analysis unit, for carrying out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;
Optimisation strategy actuator unit, for being optimized the data filtering out and not needing to LogicalPlan;
Paying actuator unit, on Spark cluster, really carrying out data analysis for LogicalPlan being converted to RDD.
Embodiments of the invention also provide a kind of streaming to calculate and calculate, with batch processing, the treatment process that combines, and comprise the following steps:
Build the hardware environment of system cloud gray model, comprise virtualization, machine room, network and cluster;
Obtain work order, system task is dispatched, and system job is judged, carry out processing in real time or batch processing to metadata according to judged result control;
To the real time data after process, batching data carries out and metadata carries out store management;
Form different generating date views and data batch view and show, the inquiry optimization of data processing is provided simultaneously.
Preferably, obtaining work order, system task dispatched, and judged by system job, the step controlling metadata to be carried out process or batch processing in real time according to judged result is specially:
Obtain work order, system job is dispatched, operation is judged meanwhile;
If processing operation in real time, Stream Processing technology is then adopted to carry out real time data processing, streaming is calculated and resolves into a series of short and small batch processing job, namely input data are divided into sectional data according to batchsize, every one piece of data all converts the RDD in Spark to, then the Transformation operation of DStream is turned into being operated by the Transformation of RDD in Spark by SparkStreaming, RDD is become intermediate result through operation and is kept in internal memory;
If batch processing job, batch system is then adopted to carry out batch processing, distributed data set is abstracted into elasticity distribution formula data set, achieve application task scheduling, RPC, serializing and compression, and carry API for it runs upper strata thereon, and by data subregion under distributed environment, then operation is converted into directed acyclic graph, and carry out the scheduling of DAG and the distributed variable-frequencypump of task stage by stage.
Preferably, described formation is different generating date view and data batch view the step that carries out showing specifically comprise:
Adopt delta algorithm that the result that process calculates in real time is stored, form different real-time process figure; And the result that described batch processing calculates stored, and form different batch view.
Preferably, the method for the inquiry optimization of described offer data processing is, user can utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.
Preferably, also comprise the following steps:
Grammatical analysis: the character string receiving user's input, and carry out grammatical analysis, export abstract syntax tree;
Voice is analyzed: carry out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;
Optimisation strategy performs: LogicalPlan is optimized the data filtering out and not needing;
Pay and perform: LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis.
Preferably, the step of described grammatical analysis, is specially:
According to the morphology defined, the character set of input is converted to word;
On the basis of lexical analysis, judge whether the word that user inputs meets grammer logic;
According to analysis the output of process abstract syntax tree.
In above technical scheme, when system task is dispatched, system job is judged, carry out processing in real time or batch processing to metadata according to judged result control, the interoperability between different computation model is achieved from level layer face, effectively improve fault-tolerant processing efficiency, optimize data enquire method simultaneously, be more convenient for the Conjoint Analysis of historical data and real time data.
Accompanying drawing explanation
Fig. 1 is that the streaming of embodiments of the invention calculates and batch processing combines the system tray composition for the treatment of system.
Fig. 2 is the data computation layer structural representation of embodiments of the invention.
Fig. 3 is Data View and the inquiry Optimization Layer structural representation of embodiments of the invention.
Fig. 4 is the structural representation of the inquiry optimization module of embodiments of the invention.
Fig. 5 is the query grammar analysis process figure of embodiments of the invention.
Embodiment
In order to make technical problem solved by the invention, technical scheme and useful effect clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated. It is to be understood that specific embodiment described herein is only in order to explain the present invention, it is not intended to limit the present invention.
As shown in Figure 1, embodiments of the invention provide a kind of streaming to calculate and calculate, with batch processing, the treatment system that combines, and comprising:
Infrastructure layer 10, for providing the hardware environment of system cloud gray model, is specially system infrastructure construction, comprises virtualization, machine room, network, cluster etc.;
Data store management layer 20, for distributed storage data, real time data after specimens preserving and batching data, manage metadata simultaneously; The storage of described distributed data comprises but does not limit an innings HDFS, distributed MySQL data warehouse, Cassandra.
Data computation layer 30, for providing streaming to calculate and batch processing account form, namely provides stream calculation and batch processing Computational frame and model.
Task dispatch layer 40, for system task is dispatched, and system job is judged, according to judged result control data computation layer, metadata being carried out process or batch processing in real time, the mode dispatched by system job includes but not limited to FIFO, FAIR.
Data analysis layer 50, for the data after process are carried out data analysis, includes but not limited to data mining, machine learning, degree of depth study.
Data View and inquiry Optimization Layer 60, for providing generating date view, data batch view and inquiry to optimize;
Data display layer 70, for providing the visual information of data analysis result.
Further, as shown in Figure 2, described data computation layer 30 comprises:
Stream calculation framework 301, for processing for convection type data when described task dispatch layer judges that the current operation of system is real-time processing operation;
It is specially, Stream Processing technology (such as SparkStreaming) is adopted to carry out real time data processing, streaming is calculated and resolves into a series of short and small batch processing job, namely input data are divided into sectional data according to batchsize, every one piece of data all converts the RDD in Spark to, then the Transformation operation of DStream is turned into being operated by the Transformation of RDD in Spark by SparkStreaming, RDD is become intermediate result through operation and is kept in internal memory.
Batch processing Computational frame 302, for processing for convection type data when described task dispatch layer judges that the current operation of system is batch processing job;
It is specially, adopts batch system (such as Spark) to carry out batch processing, distributed data set is abstracted into elasticity distribution formula data set, it is achieved that application task scheduling, RPC, serializing and compression, and carry API for it runs upper strata thereon. Operation, by data subregion under distributed environment, is then converted into directed acyclic graph (DAG), and carries out the scheduling of DAG and the distributed variable-frequencypump of task stage by stage by Spark.
In the above-mentioned process that system job is dispatched, operation is judged, if real-time processing operation, then call described stream calculation framework 301 and calculate; If batch processing job, then call described batch processing Computational frame 302 and calculate.
Further, as shown in Figure 3, described Data View and inquiry Optimization Layer 60 comprise:
RUNTIME VIEW module 601, it is provided that process view in real time. Being specially, the result adopting delta algorithm that stream calculation framework 301 is produced is kept in described data store management layer 20, forms different real-time process views (speedview). When the data set recalculated in batch processing comprises the data set of process in real time, from current RUNTIME VIEW, delete corresponding data.
Batch view module 602, it is provided that batch view. It is specially, the result that described batch processing Computational frame 302 produces is kept in described data store management layer 20, form different batch view (batchview).
Module 603 is optimized in inquiry, it is provided that the query manipulation that more multi-semantic meaning is abundant. It is specially, the basis of SparkSQL proposes a kind of data query optimisation strategy, solve the primary API inquiry velocity of SparkSQL not abundant problem slow, semantic. Relative to traditional SQL, add some new keywords according to concrete application scene, such as " every " keyword, it is possible to make user utilize the data of the Stream Processing after this keyword query for some time or a certain segment distance and batch processing. Meanwhile, the present invention is optimized according to concrete application scene while the parsing being sql, decreases the data of traversal, significantly improves the speed of inquiry.
Preferably, as shown in Figure 4, described optimization inquiry module 603 comprises:
Parsing unit 6031, for receiving the character string of user's input, and carries out grammatical analysis, exports abstract syntax tree;
Lexical analysis unit 6032, for carrying out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify.
Optimisation strategy actuator unit 6033, for being optimized the data filtering out and not needing to LogicalPlan;
Paying actuator unit 6034, on Spark cluster, really carrying out data analysis for LogicalPlan being converted to RDD.
Embodiments of the invention also provide a kind of streaming to calculate and calculate, with batch processing, the treatment process that combines, it is characterised in that, comprise the following steps:
Build the hardware environment of system cloud gray model, comprise virtualization, machine room, network and cluster;
Obtain work order, system task is dispatched, and system job is judged, carry out processing in real time or batch processing to metadata according to judged result control;
To the real time data after process, batching data carries out and metadata carries out store management;
Form different generating date views and data batch view and show, the inquiry optimization of data processing is provided simultaneously.
Further, obtaining work order, system task dispatched, and judged by system job, the step controlling metadata to be carried out process or batch processing in real time according to judged result is specially:
Obtain work order, system job is dispatched, operation is judged meanwhile;
If processing operation in real time, Stream Processing technology is then adopted to carry out real time data processing, streaming is calculated and resolves into a series of short and small batch processing job, namely input data are divided into sectional data according to batchsize, every one piece of data all converts the RDD in Spark to, then the Transformation operation of DStream is turned into being operated by the Transformation of RDD in Spark by SparkStreaming, RDD is become intermediate result through operation and is kept in internal memory;
If batch processing job, batch system is then adopted to carry out batch processing, distributed data set is abstracted into elasticity distribution formula data set, achieve application task scheduling, RPC, serializing and compression, and carry API for it runs upper strata thereon, and by data subregion under distributed environment, then operation is converted into directed acyclic graph, and carry out the scheduling of DAG and the distributed variable-frequencypump of task stage by stage.
Further, described formation is different generating date view and data batch view the step that carries out showing specifically comprise:
Adopt delta algorithm that the result that process calculates in real time is stored, form different real-time process figure; And the result that described batch processing calculates stored, and form different batch view.
Further, the method for the inquiry optimization of described offer data processing is, user can utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.
Further, described inquiry optimization flow process specifically comprises the following steps:
Grammatical analysis: the character string receiving user's input, and carry out grammatical analysis, export abstract syntax tree;
Shown in composition graphs 5, this step preferably includes:
S60311 lexical analysis. It is specially, according to the morphology defined, the character set of input is converted to " word ". Such as, during input selectfoo+100frombooks, after lexical analysis device processes, export one " sentence " that be made up of " word ":
(keyWord:SELECT) (Identifier:Foo) (keyword:+) (Number:100) (keyword:From) (Identifier:books)
The present invention as explanation, adds entry keyword " every " by following operation using " every " citing
S60312 grammatical analysis. It is specially, grammatical analysis judges on the basis of lexical analysis whether the word that user inputs meets grammer logic, * SELECTFOO+100FROMPOKES* is exactly a sentence meeting grammer, and * SELECTFOO+100FROM* is an illegal statement, because after FROM, a table name must be followed, otherwise grammatical analysis device can report an error.
S60313 exports abstract syntax tree. It is specially, abstract syntax tree is along with the procedure construction of grammatical analysis, after grammatical analysis normally terminates, grammatical analysis device will export an abstract syntax tree, the input of user and the structure content one_to_one corresponding of abstract syntax tree, so far, " character string " becoming " structure " completely of user's input.
Lexical analysis: carry out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;
Being specially, lexical analysis exports an inquiry plan according to the abstract syntax tree of upper one-phase, generally comprises semantic analysis phase and physical analysis stage. The analysis process of logic analysis substantially pure algebraically, it to be analyzed the SQL statement inputted and be used to What for, and which has operate. In general, a SQL statement always has an input, an output, and input data obtain exporting data after SQL processes. Physical query plan is that the logical query plan by producing before generates, and in the process of conversion, Spark programming framework does adaptation, the DAG that generation system can identify.
Optimisation strategy performs: LogicalPlan is optimized the data filtering out and not needing;
Being specially, the operation that optimization part is done is optimized by LogicalPlan, and used rule is as follows:
Pushing away under FilterPushdown strainer, the unit that earlier filtering out does not need usually reduces expense.The present invention changes under FilterPus realizes filtering out concrete scene the data not needed
Pay and perform: LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis
It is specially, LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis. The switching process of LogicalPlan to RDD introduces SparkPlan, and its main task generates RDD.
A kind of streaming that the above embodiment of the present invention proposes calculates the system and method that combines with batch processing, it is the framework being structured on Spark treatment S treaming data, basic principle is that Streaming data are divided into little time segment (several seconds), processes these small portion data in the way of similar batch batch processing. SparkStreaming is structured on Spark, it is because the low delay enforcement engine of Spark may be used for real-time calculating on the one hand, comparing the other processing framework (such as Storm) based on Record on the other hand, RDD data set more easily does efficient fault-tolerant processing. In addition the solution of the present invention optimizes data enquire method, is more convenient for historical data and real time data Conjoint Analysis.
The foregoing is only the better embodiment of the present invention, not in order to limit the present invention, all any amendment, equivalent replacement and improvement etc. done within the spirit and principles in the present invention, all should be included within protection scope of the present invention.
Claims (10)
1. a streaming calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, comprising:
Infrastructure layer, for providing the hardware environment of system cloud gray model, comprises virtualization, machine room, network and cluster;
Data store management layer, for distributed storage data, real time data after specimens preserving and batching data, manage metadata simultaneously;
Data computation layer, for providing streaming to calculate and batch processing account form;
Task dispatch layer, for system task being dispatched, and judges system job, carries out processing in real time or batch processing to metadata according to judged result control data computation layer;
Data analysis layer, for carrying out data analysis to the data after process;
Data View and inquiry Optimization Layer, for providing generating date view, data batch view and inquiry to optimize;
Data display layer, for providing the visual information of data analysis result.
2. streaming according to claim 1 calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, described data computation layer comprises:
Stream calculation framework, for processing for convection type data when described task dispatch layer judges that the current operation of system is real-time processing operation;
Batch processing Computational frame, for processing for convection type data when described task dispatch layer judges that the current operation of system is batch processing job.
3. streaming according to claim 2 calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, described Data View and inquiry Optimization Layer comprise:
RUNTIME VIEW processing module, adopts delta algorithm stream calculation framework to be calculated the result produced and is saved in described data store management layer, and form different real-time process views;
Batch view module, calculates batch processing Computational frame the result produced and is saved in described data store management layer, and form different batch view;
Module is optimized in inquiry, for providing user to utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.
4. streaming according to claim 3 calculates and calculates, with batch processing, the treatment system that combines, it is characterised in that, described optimization inquiry module comprises:
Parsing unit, for receiving the character string of user's input, and carries out grammatical analysis, exports abstract syntax tree;
Lexical analysis unit, for carrying out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;
Optimisation strategy actuator unit, for being optimized the data filtering out and not needing to LogicalPlan;
Paying actuator unit, on Spark cluster, really carrying out data analysis for LogicalPlan being converted to RDD.
5. a streaming calculates and calculates, with batch processing, the treatment process that combines, it is characterised in that, comprise the following steps:
Build the hardware environment of system cloud gray model, comprise virtualization, machine room, network and cluster;
Obtain work order, system task is dispatched, and system job is judged, carry out processing in real time or batch processing to metadata according to judged result control;
To the real time data after process, batching data carries out and metadata carries out store management;
Form different generating date views and data batch view and show, the inquiry optimization of data processing is provided simultaneously.
6. method according to claim 5, it is characterised in that, obtain work order, system task is dispatched, and system job is judged, the step controlling metadata to be carried out process or batch processing in real time according to judged result is specially:
Obtain work order, system job is dispatched, operation is judged meanwhile;
If processing operation in real time, Stream Processing technology is then adopted to carry out real time data processing, streaming is calculated and resolves into a series of short and small batch processing job, namely input data are divided into sectional data according to batchsize, every one piece of data all converts the RDD in Spark to, then the Transformation operation of DStream is turned into being operated by the Transformation of RDD in Spark by SparkStreaming, RDD is become intermediate result through operation and is kept in internal memory;
If batch processing job, batch system is then adopted to carry out batch processing, distributed data set is abstracted into elasticity distribution formula data set, achieve application task scheduling, RPC, serializing and compression, and carry API for it runs upper strata thereon, and by data subregion under distributed environment, then operation is converted into directed acyclic graph, and carry out the scheduling of DAG and the distributed variable-frequencypump of task stage by stage.
7. method according to claim 5, it is characterised in that, the generating date view that described formation is different and data batch view the step that carries out showing specifically comprise:
Adopt delta algorithm that the result that process calculates in real time is stored, form different real-time process figure; And the result that described batch processing calculates stored, and form different batch view.
8. method according to claim 5, it is characterised in that, the method for the inquiry optimization of described offer data processing is, user can utilize the data of the Stream Processing before and after keyword query for some time or a certain segment distance and batch processing.
9. method according to claim 8, it is characterised in that, also comprise the following steps:
Grammatical analysis: the character string receiving user's input, and carry out grammatical analysis, export abstract syntax tree;
Voice is analyzed: carry out lexical analysis according to abstract syntax tree, the DAG that generation LogicalPlan and system can identify;
Optimisation strategy performs: LogicalPlan is optimized the data filtering out and not needing;
Pay and perform: LogicalPlan is converted to RDD on Spark cluster, really carries out data analysis.
10. method according to claim 9, it is characterised in that, the step of described grammatical analysis, is specially:
According to the morphology defined, the character set of input is converted to word;
On the basis of lexical analysis, judge whether the word that user inputs meets grammer logic;
According to analysis the output of process abstract syntax tree.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511019708.2A CN105677752A (en) | 2015-12-30 | 2015-12-30 | Streaming computing and batch computing combined processing system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511019708.2A CN105677752A (en) | 2015-12-30 | 2015-12-30 | Streaming computing and batch computing combined processing system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105677752A true CN105677752A (en) | 2016-06-15 |
Family
ID=56297986
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511019708.2A Pending CN105677752A (en) | 2015-12-30 | 2015-12-30 | Streaming computing and batch computing combined processing system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105677752A (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106254896A (en) * | 2016-08-05 | 2016-12-21 | 中国传媒大学 | A kind of distributed cryptographic method for real-time video |
CN106873945A (en) * | 2016-12-29 | 2017-06-20 | 中山大学 | Data processing architecture and data processing method based on batch processing and Stream Processing |
CN107016128A (en) * | 2017-05-16 | 2017-08-04 | 郑州云海信息技术有限公司 | A kind of data processing method and device |
CN107341084A (en) * | 2017-05-16 | 2017-11-10 | 阿里巴巴集团控股有限公司 | A kind of method and device of data processing |
CN107391719A (en) * | 2017-07-31 | 2017-11-24 | 南京邮电大学 | Distributed stream data processing method and system in a kind of cloud environment |
CN108241722A (en) * | 2016-12-23 | 2018-07-03 | 北京金山云网络技术有限公司 | A kind of data processing system, method and device |
CN108491507A (en) * | 2018-03-22 | 2018-09-04 | 北京交通大学 | A kind of parallel continuous Query method of uncertain traffic flow data based on Hadoop distributed environments |
CN108806797A (en) * | 2018-06-27 | 2018-11-13 | 思派(北京)网络科技有限公司 | A kind of processing method and system of medical data |
CN108920206A (en) * | 2018-06-13 | 2018-11-30 | 北京交通大学 | A kind of plug-in unit dispatching method and device |
CN108984279A (en) * | 2018-07-02 | 2018-12-11 | 山东汇贸电子口岸有限公司 | A kind of streaming computing method of internet of things oriented tradition SQL developer |
CN109192248A (en) * | 2017-07-21 | 2019-01-11 | 上海桑格信息技术有限公司 | Biological information analysis system, method and cloud computing platform system based on cloud platform |
CN109375912A (en) * | 2018-10-18 | 2019-02-22 | 腾讯科技(北京)有限公司 | Model sequence method, apparatus and storage medium |
CN109598348A (en) * | 2017-09-28 | 2019-04-09 | 北京猎户星空科技有限公司 | A kind of image pattern obtains, model training method and system |
CN109828751A (en) * | 2019-02-15 | 2019-05-31 | 福州大学 | Integrated machine learning algorithm library and unified programming framework |
CN109918391A (en) * | 2019-03-12 | 2019-06-21 | 威讯柏睿数据科技(北京)有限公司 | A kind of streaming transaction methods and system |
CN110532283A (en) * | 2019-09-03 | 2019-12-03 | 衢州学院 | A kind of smart city big data processing system based on Hadoop aggregated structure |
WO2020168901A1 (en) * | 2019-02-19 | 2020-08-27 | 阿里巴巴集团控股有限公司 | Data calculation method and engine |
CN111611221A (en) * | 2019-02-26 | 2020-09-01 | 北京京东尚科信息技术有限公司 | Hybrid computing system, data processing method and device |
CN112416537A (en) * | 2020-12-15 | 2021-02-26 | 东北大学 | Unified expression API calling system and calling method in Gaia system |
CN112597200A (en) * | 2020-12-22 | 2021-04-02 | 南京三眼精灵信息技术有限公司 | Batch and streaming combined data processing method and device |
CN112800091A (en) * | 2021-01-26 | 2021-05-14 | 北京明略软件系统有限公司 | Flow-batch integrated calculation control system and method |
CN113297212A (en) * | 2021-04-28 | 2021-08-24 | 上海淇玥信息技术有限公司 | Spark query method and device based on materialized view and electronic equipment |
CN113641705A (en) * | 2021-08-16 | 2021-11-12 | 神州数码融信软件有限公司 | Marketing disposal rule engine method based on calculation engine |
CN113807710A (en) * | 2021-09-22 | 2021-12-17 | 四川新网银行股份有限公司 | Method for sectionally paralleling and dynamically scheduling system batch tasks and storage medium |
CN115599524A (en) * | 2022-10-27 | 2023-01-13 | 中国兵器工业计算机应用技术研究所(Cn) | Data lake system based on cooperative scheduling processing of streaming data and batch data |
CN116383238A (en) * | 2023-06-06 | 2023-07-04 | 湖南红普创新科技发展有限公司 | Data virtualization system, method, device, equipment and medium based on graph structure |
CN113641705B (en) * | 2021-08-16 | 2024-04-26 | 神州数码融信软件有限公司 | Marketing disposal rule engine method based on calculation engine |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110128379A1 (en) * | 2009-11-30 | 2011-06-02 | Dah-Jye Lee | Real-time optical flow sensor design and its application to obstacle detection |
CN103761309A (en) * | 2014-01-23 | 2014-04-30 | 中国移动(深圳)有限公司 | Operation data processing method and system |
CN104008007A (en) * | 2014-06-12 | 2014-08-27 | 深圳先进技术研究院 | Interoperability data processing system and method based on streaming calculation and batch processing calculation |
-
2015
- 2015-12-30 CN CN201511019708.2A patent/CN105677752A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110128379A1 (en) * | 2009-11-30 | 2011-06-02 | Dah-Jye Lee | Real-time optical flow sensor design and its application to obstacle detection |
CN103761309A (en) * | 2014-01-23 | 2014-04-30 | 中国移动(深圳)有限公司 | Operation data processing method and system |
CN104008007A (en) * | 2014-06-12 | 2014-08-27 | 深圳先进技术研究院 | Interoperability data processing system and method based on streaming calculation and batch processing calculation |
Non-Patent Citations (1)
Title |
---|
梁国蓉: ""一个基于Dataflow的大数据Query Engine系统的设计与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106254896A (en) * | 2016-08-05 | 2016-12-21 | 中国传媒大学 | A kind of distributed cryptographic method for real-time video |
CN108241722A (en) * | 2016-12-23 | 2018-07-03 | 北京金山云网络技术有限公司 | A kind of data processing system, method and device |
CN106873945A (en) * | 2016-12-29 | 2017-06-20 | 中山大学 | Data processing architecture and data processing method based on batch processing and Stream Processing |
CN107016128A (en) * | 2017-05-16 | 2017-08-04 | 郑州云海信息技术有限公司 | A kind of data processing method and device |
CN107341084A (en) * | 2017-05-16 | 2017-11-10 | 阿里巴巴集团控股有限公司 | A kind of method and device of data processing |
CN107341084B (en) * | 2017-05-16 | 2021-07-06 | 创新先进技术有限公司 | Data processing method and device |
CN109192248A (en) * | 2017-07-21 | 2019-01-11 | 上海桑格信息技术有限公司 | Biological information analysis system, method and cloud computing platform system based on cloud platform |
CN107391719A (en) * | 2017-07-31 | 2017-11-24 | 南京邮电大学 | Distributed stream data processing method and system in a kind of cloud environment |
CN109598348A (en) * | 2017-09-28 | 2019-04-09 | 北京猎户星空科技有限公司 | A kind of image pattern obtains, model training method and system |
CN108491507A (en) * | 2018-03-22 | 2018-09-04 | 北京交通大学 | A kind of parallel continuous Query method of uncertain traffic flow data based on Hadoop distributed environments |
CN108920206A (en) * | 2018-06-13 | 2018-11-30 | 北京交通大学 | A kind of plug-in unit dispatching method and device |
CN108806797A (en) * | 2018-06-27 | 2018-11-13 | 思派(北京)网络科技有限公司 | A kind of processing method and system of medical data |
CN108984279A (en) * | 2018-07-02 | 2018-12-11 | 山东汇贸电子口岸有限公司 | A kind of streaming computing method of internet of things oriented tradition SQL developer |
CN109375912A (en) * | 2018-10-18 | 2019-02-22 | 腾讯科技(北京)有限公司 | Model sequence method, apparatus and storage medium |
CN109375912B (en) * | 2018-10-18 | 2021-09-21 | 腾讯科技(北京)有限公司 | Model serialization method, device and storage medium |
CN109828751A (en) * | 2019-02-15 | 2019-05-31 | 福州大学 | Integrated machine learning algorithm library and unified programming framework |
WO2020168901A1 (en) * | 2019-02-19 | 2020-08-27 | 阿里巴巴集团控股有限公司 | Data calculation method and engine |
TWI723535B (en) * | 2019-02-19 | 2021-04-01 | 開曼群島商創新先進技術有限公司 | Data calculation method and engine |
CN111611221A (en) * | 2019-02-26 | 2020-09-01 | 北京京东尚科信息技术有限公司 | Hybrid computing system, data processing method and device |
CN109918391B (en) * | 2019-03-12 | 2020-09-22 | 威讯柏睿数据科技(北京)有限公司 | Streaming transaction processing method and system |
CN109918391A (en) * | 2019-03-12 | 2019-06-21 | 威讯柏睿数据科技(北京)有限公司 | A kind of streaming transaction methods and system |
CN110532283A (en) * | 2019-09-03 | 2019-12-03 | 衢州学院 | A kind of smart city big data processing system based on Hadoop aggregated structure |
CN112416537A (en) * | 2020-12-15 | 2021-02-26 | 东北大学 | Unified expression API calling system and calling method in Gaia system |
CN112597200A (en) * | 2020-12-22 | 2021-04-02 | 南京三眼精灵信息技术有限公司 | Batch and streaming combined data processing method and device |
CN112597200B (en) * | 2020-12-22 | 2024-01-12 | 南京三眼精灵信息技术有限公司 | Batch and stream combined data processing method and device |
CN112800091A (en) * | 2021-01-26 | 2021-05-14 | 北京明略软件系统有限公司 | Flow-batch integrated calculation control system and method |
CN113297212A (en) * | 2021-04-28 | 2021-08-24 | 上海淇玥信息技术有限公司 | Spark query method and device based on materialized view and electronic equipment |
CN113641705A (en) * | 2021-08-16 | 2021-11-12 | 神州数码融信软件有限公司 | Marketing disposal rule engine method based on calculation engine |
CN113641705B (en) * | 2021-08-16 | 2024-04-26 | 神州数码融信软件有限公司 | Marketing disposal rule engine method based on calculation engine |
CN113807710A (en) * | 2021-09-22 | 2021-12-17 | 四川新网银行股份有限公司 | Method for sectionally paralleling and dynamically scheduling system batch tasks and storage medium |
CN113807710B (en) * | 2021-09-22 | 2023-06-20 | 四川新网银行股份有限公司 | System batch task segmentation parallel and dynamic scheduling method and storage medium |
CN115599524A (en) * | 2022-10-27 | 2023-01-13 | 中国兵器工业计算机应用技术研究所(Cn) | Data lake system based on cooperative scheduling processing of streaming data and batch data |
CN116383238A (en) * | 2023-06-06 | 2023-07-04 | 湖南红普创新科技发展有限公司 | Data virtualization system, method, device, equipment and medium based on graph structure |
CN116383238B (en) * | 2023-06-06 | 2023-08-29 | 湖南红普创新科技发展有限公司 | Data virtualization system, method, device, equipment and medium based on graph structure |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105677752A (en) | Streaming computing and batch computing combined processing system and method | |
CN111344693B (en) | Aggregation in dynamic and distributed computing systems | |
CN102609451B (en) | SQL (structured query language) query plan generation method oriented to streaming data processing | |
CN107391719A (en) | Distributed stream data processing method and system in a kind of cloud environment | |
US9146959B2 (en) | Database query in a share-nothing database architecture | |
JP6050272B2 (en) | Low latency query engine for APACHE HADOOP | |
CN109189589A (en) | A kind of distribution big data computing engines and framework method | |
CN103430144A (en) | Data source analytics | |
JP2014194769A6 (en) | Low latency query engine for APACHE HADOOP | |
CN104050261A (en) | Stormed-based variable logic general data processing system and method | |
CN104268428A (en) | Visual configuration method for index calculation | |
CN103425762A (en) | Telecom operator mass data processing method based on Hadoop platform | |
CN108021809A (en) | A kind of data processing method and system | |
CN109799976B (en) | Real-time wind control variable calculation method based on distributed stream type calculation engine | |
CN106951552A (en) | A kind of user behavior data processing method based on Hadoop | |
CN107480202B (en) | Data processing method and device for multiple parallel processing frameworks | |
CN108108466A (en) | A kind of distributed system journal query analysis method and device | |
CN103646051A (en) | Big-data parallel processing system and method based on column storage | |
CN103699656A (en) | GPU-based mass-multimedia-data-oriented MapReduce platform | |
CN109063017A (en) | A kind of data persistence location mode of cloud computing platform | |
CN102193958A (en) | Method for implementing spatial decision support system based on Internet | |
CN113568938A (en) | Data stream processing method and device, electronic equipment and storage medium | |
Theeten et al. | Chive: Bandwidth optimized continuous querying in distributed clouds | |
CN117009038B (en) | Graph computing platform based on cloud native technology | |
CN104299170B (en) | Intermittent energy source mass data processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160615 |
|
RJ01 | Rejection of invention patent application after publication |