CN112800091A - Flow-batch integrated calculation control system and method - Google Patents

Flow-batch integrated calculation control system and method Download PDF

Info

Publication number
CN112800091A
CN112800091A CN202110105453.0A CN202110105453A CN112800091A CN 112800091 A CN112800091 A CN 112800091A CN 202110105453 A CN202110105453 A CN 202110105453A CN 112800091 A CN112800091 A CN 112800091A
Authority
CN
China
Prior art keywords
batch
data
data source
streaming
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110105453.0A
Other languages
Chinese (zh)
Inventor
张玮霖
王泽东
于政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202110105453.0A priority Critical patent/CN112800091A/en
Publication of CN112800091A publication Critical patent/CN112800091A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources

Abstract

The application provides a flow and batch integrated computing control system and a method, wherein the system comprises: the control device is used for converting metadata of the batch data source into dimension table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data each time the batch offline data is read; and the calculating device is used for carrying out stream calculation on the stream real-time data and the real-time dimension table of the stream data source together to obtain a stream calculation result. The whole system of the embodiment of the application completely uses the streaming system to perform streaming calculation, so that the extra calculation consumption and operation and maintenance cost for simultaneously maintaining the streaming system and the batch system under a stream-batch separation scene can be avoided; and the real-time dimension table is used as a buffer between the batch off-line data and the streaming real-time data, so that the access delay difference between the batch off-line data and the streaming real-time data can be reduced, and the high system load generated by loading the batch off-line data into the memory of the streaming system can be relieved.

Description

Flow-batch integrated calculation control system and method
Technical Field
The application relates to the technical field of data processing, in particular to a flow-batch integrated calculation control system and method.
Background
In the data era, data is an important influence factor of productivity, efficiency evaluation of business is inseparable from timeliness of data support, and timely capturing of 'actions' of data can enable business to be agile, and efficient feedback and rapid response are achieved. In the power industry, devices generate large amounts of data. These data reflect the operating state of the equipment, so that the equipment data need to be calculated and analyzed in real time. The same model of equipment has different parameters in different installation environments, and two same model of equipment at different positions represent different meanings even if the same signal data is sent out. If a piece of data is analyzed according to a real-time streaming calculation list, the meaning contained behind the data is difficult to find; if the equipment data and the historical data thereof are analyzed according to batch type calculation, a certain time is consumed in the calculation process, so that the real-time performance of the equipment data is lost.
The prior art has the following two solutions:
1) flow batch separation calculation scheme: on the basis of original batch data, a stream type calculation module for executing the same calculation is added. And after the real-time data enter the system, calculating by the stream type calculation module to generate a real-time result, and storing the data into an offline data warehouse. And when the batch type calculation triggered at fixed time obtains a result, covering the batch type calculation result with the real-time result of the stream type calculation. In the scheme, the same data needs to be calculated twice, so that more system resources are consumed; meanwhile, when the system is maintained, a set of streaming system with the same computation logic needs to be additionally maintained, and additional operation and maintenance resource consumption is generated.
2) And (3) calculating a scheme by replacing batches with streams: batch type calculation is regarded as special and input-limited stream type calculation, and the original batch type calculation process is completely replaced by stream type calculation. When the off-line data needs to be processed, the off-line data is read into the streaming system in a limited data stream mode, and a new streaming computing task is started to complete the original batch computing. In the scheme, although no additional maintenance work is needed, the batch data needs to be loaded into the system in a streaming mode by replacing batch computation with streaming computation, so that the caching capacity of the middleware is seriously depended; on the other hand, streaming requires loading large amounts of data into the system at the same time to ensure correctness, which may lead to incorrect results due to data loss while increasing the system load.
Disclosure of Invention
In view of this, an object of the present application is to provide a system and a method for controlling a batch-to-batch integrated calculation, in which a streaming system is completely used for streaming calculation in the entire system, so that extra calculation consumption and operation and maintenance costs for simultaneously maintaining the streaming system and the batch system in a batch-to-batch separation scenario can be avoided; and the real-time dimension table is used as a buffer between the batch off-line data and the streaming real-time data, so that the access delay difference between the batch off-line data and the streaming real-time data can be reduced, and the high system load generated by loading the batch off-line data into the memory of the streaming system can be relieved.
In a first aspect, an embodiment of the present application provides a batch-to-batch integrated computing control system, including:
the control device is used for converting metadata of a batch data source into dimension table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data each time the batch offline data is read from the batch data source;
and the calculating device is used for carrying out stream calculation on the stream real-time data of the stream data source and the real-time dimension table together to obtain a stream calculation result.
In one possible embodiment, the control device comprises:
the metadata import module is used for respectively acquiring metadata from the batch data source and the streaming data source and importing the metadata of the batch data source and the streaming data source into the metadata directory module;
the metadata directory module is used for storing and retrieving metadata of the streaming data source and the batch data source and converting the metadata of the batch data source into dimension table metadata;
the dimension table synchronization module is used for continuously reading batch offline data from a batch data source, and generating a real-time dimension table based on the batch offline data, the dimension table metadata and a user-configured dimension table synchronization strategy when the batch offline data are read from the batch data source each time; the dimension table synchronization strategy is used for controlling the frequency of data synchronization, so that the timeliness of the data in the real-time dimension table and the system consumption generated in the data synchronization process are balanced.
In one possible embodiment, the control device further comprises:
and the metadata management module is used for controlling the metadata importing module to start action and generating the dimension table synchronization strategy based on the configuration operation of a user.
In one possible embodiment, the control device further comprises:
the SQL statement analysis module is used for converting the standard SQL statement configured by the user into an abstract semantic tree;
the execution plan generating module is used for generating an execution plan based on the abstract semantic tree, the metadata of the batch data source and the metadata of the streaming data source; the execution plan comprises a directed acyclic computation flow graph ordered according to topology, and each vertex in the directed acyclic computation flow graph corresponds to a streaming computation thread.
In one possible implementation, the computing device includes:
the connection information temporary storage module is used for temporarily storing the streaming real-time data of the streaming data source and the data source connection information of the real-time dimension table, and releasing the memory space occupied by the data source connection information after the calculation of the streaming calculation module is finished;
and the stream type calculation module is used for reading the stream type real-time data of the stream type data source and the real-time dimension table based on the data source connection information, and performing stream type calculation on the stream type real-time data of the stream type data source and the real-time dimension table together based on the execution plan to obtain a stream type calculation result.
In a possible implementation manner, the connection information temporary storage module is further configured to temporarily store data source connection information of batch offline data of the batch data source, and release a memory space occupied by the data source connection information after the calculation of the flow calculation module is completed;
and the streaming calculation module is further configured to read batch offline data of the batch data source based on the data source connection information, and perform streaming calculation on the batch offline data of the batch data source based on the execution plan to obtain a streaming calculation result.
In a possible implementation manner, the connection information temporary storage module is further configured to temporarily store data source connection information of streaming real-time data of the streaming data source, and release a memory space occupied by the data source connection information after the streaming calculation module finishes calculation;
the streaming calculation module is further configured to read streaming real-time data of the streaming data source based on the data source connection information, and perform streaming calculation on the streaming real-time data of the streaming data source based on the execution plan to obtain a streaming calculation result.
In a second aspect, an embodiment of the present application further provides a batch-to-batch integrated calculation control method, including:
converting metadata of the batch data source into dimension table metadata;
continuously reading batch off-line data from the batch data source, and generating a real-time dimension table based on the batch off-line data each time the batch off-line data is read from the batch data source;
and carrying out stream type calculation on the stream type real-time data of the stream type data source and the real-time dimension table together to obtain a stream type calculation result.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the second aspect.
In a fourth aspect, the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps in the second aspect.
The system for flow and batch integrated calculation control comprises a control device and a calculation device. The control device is used for converting metadata of a batch data source into dimension table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data each time the batch offline data is read from the batch data source; and the calculating device is used for carrying out stream calculation on the stream real-time data of the stream data source and the real-time dimension table together to obtain a stream calculation result. The system connects the streaming real-time data and the batch offline data through the real-time dimension table, and the offline data is online through the real-time dimension table, so that the streaming system can quickly access the batch offline data. Based on the fact that batch offline data is dual-drive calculation of streaming real-time data, batch bounded data and streaming unbounded data are managed, calculated and stored uniformly, mining and improvement of high-density data values are achieved through supplement of dimension data, real-time feedback capacity is enhanced, resource consumption is reduced, and organization coordination capacity is improved. On one hand, as the whole system completely uses the streaming system for streaming calculation, the extra calculation consumption and operation and maintenance cost for simultaneously maintaining the streaming system and the batch system under the scene of stream-batch separation can be avoided. On the other hand, the real-time dimension table serves as a buffer between the batch-type offline data and the streaming-type real-time data, so that the access delay difference between the batch-type offline data and the streaming-type real-time data can be reduced, and the high system load generated by loading the batch-type offline data into the memory of the streaming-type system can be relieved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
FIG. 1 is a schematic structural diagram illustrating a batch-and-flow integrated computing control system according to an embodiment of the present disclosure;
FIG. 2 is a flow chart illustrating a method for controlling batch-to-batch integrated calculation according to an embodiment of the present disclosure;
fig. 3 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
In the power industry, devices generate large amounts of data. These data reflect the operating state of the equipment, so that the equipment data need to be calculated and analyzed in real time. The same model of equipment has different parameters in different installation environments, and two same model of equipment at different positions represent different meanings even if the same signal data is sent out. If a piece of data is analyzed according to a real-time streaming calculation list, the meaning contained behind the data is difficult to find; if the equipment data and the historical data thereof are analyzed according to batch type calculation, a certain time is consumed in the calculation process, so that the real-time performance of the equipment data is lost. The prior art has the following two solutions:
1) flow batch separation calculation scheme: on the basis of original batch data, a stream type calculation module for executing the same calculation is added. And after the real-time data enter the system, calculating by the stream type calculation module to generate a real-time result, and storing the data into an offline data warehouse. And when the batch type calculation triggered at fixed time obtains a result, covering the batch type calculation result with the real-time result of the stream type calculation. In the scheme, the same data needs to be calculated twice, so that more system resources are consumed; meanwhile, when the system is maintained, a set of streaming system with the same computation logic needs to be additionally maintained, and additional operation and maintenance resource consumption is generated.
2) And (3) calculating a scheme by replacing batches with streams: batch type calculation is regarded as special and input-limited stream type calculation, and the original batch type calculation process is completely replaced by stream type calculation. When the off-line data needs to be processed, the off-line data is read into the streaming system in a limited data stream mode, and a new streaming computing task is started to complete the original batch computing. In the scheme, although no additional maintenance work is needed, the batch data needs to be loaded into the system in a streaming mode by replacing batch computation with streaming computation, so that the caching capacity of the middleware is seriously depended; on the other hand, streaming requires loading large amounts of data into the system at the same time to ensure correctness, which may lead to incorrect results due to data loss while increasing the system load.
Based on this, the embodiment of the present application provides a batch-and-flow integrated computing control system and method, which are described below by way of embodiments.
For the convenience of understanding the present embodiment, a detailed description will be given first of all of a batch-and-flow integrated computing control system disclosed in the embodiments of the present application.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a batch-flow integrated computing control system according to an embodiment of the present disclosure. As shown in fig. 1, the system may include:
a control device 10 for converting metadata of a batch data source into dimensional table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimensional table based on the batch offline data each time the batch offline data is read from the batch data source;
and the calculating device 20 is used for performing stream calculation on the stream real-time data of the stream data source and the real-time dimension table together to obtain a stream calculation result.
The control device 10 and the calculation device 20 will be described in detail below.
As shown in fig. 1, the control device 10 may include the following modules:
and the metadata importing module 101 is configured to obtain metadata from the batch data source and the streaming data source, and import the metadata of the batch data source and the metadata of the streaming data source into the metadata directory module. The metadata of the batch data source refers to data describing the batch offline data, and generally includes configuration information such as a table structure of the batch data table and connection, fragmentation, and the like of the batch data source. Metadata of a streaming data source refers to data for describing streaming real-time data, and generally includes table structure of a streaming data table and configuration information of connection, fragmentation and the like of the streaming data source.
The metadata directory module 102 is configured to store and retrieve metadata (data field information and data source connection information) of the streaming data source and the batch data source, and convert the metadata of the batch data source into dimension table metadata. The dimension table is a copy of a subset of the batch data table for merging with the streaming real-time data to obtain a detailed bandwidth table of the input data. Dimension table metadata refers to metadata obtained by converting a batch data table structure.
And the dimension table synchronization module 103 is configured to continuously read batch offline data from a batch data source, and generate a real-time dimension table based on the batch offline data, the dimension table metadata, and a user-configured dimension table synchronization policy each time the batch offline data is read from the batch data source. The real-time dimension table is a memory key value pair database, and as batch offline data entries in the batch data source are usually increased continuously, the real-time dimension table needs to continuously synchronize or incrementally read new data from the batch data source, so that a synchronization strategy is needed to control the frequency of synchronizing data, and the timeliness of the data in the real-time dimension table and the system consumption generated in the data synchronizing process are balanced.
It should be noted that when the real-time requirement of the data is not too high or the batch-type offline data size is quite large, the real-time dimension table can use the RDBMS system or the distributed data storage system. The access delay of the RDBMS and the distributed data storage system is lower than that of an in-memory key value pair database, but the RDBMS and the distributed data storage system can store larger data and provide better stability. Therefore, when the real-time requirement of the data is not high or the offline data size is quite large, the RDBMS system or the distributed data storage system can replace the memory key value pair database as a real-time dimension table.
In one possible embodiment, the system may further include:
and the metadata management module 104 is used for controlling the metadata importing module 101 to start actions and generating the dimension table synchronization policy based on the configuration operation of the user. The user may configure the parameters of the various modules in the control device 10 via the metadata management module 104. Specifically, the user may set a timing or manually import metadata from the batch data source and the streaming data source through the metadata management module 104. The user may also perform a configuration operation on the metadata management module 104 through the front-end page, and the metadata management module 104 generates the dimension table synchronization policy based on the configuration operation of the user.
In one possible embodiment, the system may further include:
the SQL statement parsing module 105 is configured to convert the standard SQL statement configured by the user into an Abstract Semantic Tree (AST).
An execution plan generating module 106, configured to generate an execution plan based on the abstract semantic tree, the metadata of the batch data source, and the metadata of the streaming data source; the execution plan comprises directed acyclic computation flow graphs ordered according to topology, and the directed acyclic computation flow graphs correspond to a group of topological connection and distribution schemes of streaming computation resources. The streaming computing resource specifically refers to a streaming computing thread located in a distributed host cluster, and corresponds to a vertex in a directed acyclic computing flow graph. And data are transmitted between the streaming computing resources through TCP and correspond to directed edges in the acyclic computing flow graph. There may be multiple computing resources on one host.
As shown in fig. 2, the computing device 20 may include the following modules:
the connection information temporary storage module 201 is configured to temporarily store the streaming real-time data of the streaming data source and the data source connection information (which may exist on different hosts) of the real-time dimension table, and release a memory space occupied by the data source connection information after the streaming calculation module 202 finishes calculating. The data source connection information is represented in a memory temporary data view form, the memory temporary data view is similar to the view concept in a Relational data Management System (RDBMS), and the data source connection information is a virtual table for temporarily storing the data source connection information, does not store the data, and automatically releases the occupied memory space after the calculation task is finished.
And the streaming calculation module 202 is configured to read the streaming real-time data of the streaming data source and the real-time dimension table based on the data source connection information, and perform streaming calculation on the streaming real-time data of the streaming data source and the real-time dimension table together based on the execution plan to obtain a streaming calculation result.
In a possible implementation manner, the connection information temporary storage module 201 is further configured to temporarily store data source connection information of batch offline data of the batch data source, and release a memory space occupied by the data source connection information after the calculation of the flow calculation module 202 is finished;
the streaming calculation module 202 is further configured to read batch offline data of the batch data source based on the data source connection information, and perform streaming calculation on the batch offline data of the batch data source based on the execution plan to obtain a streaming calculation result.
In another possible implementation, the connection information temporary storage module 201 is further configured to temporarily store data source connection information of streaming real-time data of the streaming data source, and release a memory space occupied by the data source connection information after the streaming calculation module 202 finishes calculating;
the streaming calculation module 202 is further configured to read streaming real-time data of the streaming data source based on the data source connection information, and perform streaming calculation on the streaming real-time data of the streaming data source based on the execution plan to obtain a streaming calculation result.
The flow batch integrated calculation control system provided by this embodiment not only can perform flow batch integrated calculation on the streaming real-time data and the batch offline data in a real-time dimension table, but also can perform flow calculation on the batch offline data of the batch data source or the streaming real-time data of the streaming data source separately, and can be compatible with existing systems.
The system for controlling integrated calculation of batch and flow provided by the embodiment comprises a control device and a calculation device, wherein the control device is adopted to convert metadata of a batch data source into metadata of a dimension table, batch offline data are continuously read from the batch data source, and a real-time dimension table is generated based on the batch offline data each time the batch offline data are read from the batch data source; and then, carrying out stream type calculation on the stream type real-time data of the stream type data source and the real-time dimension table together by adopting a calculating device to obtain a stream type calculation result. The system connects the streaming real-time data and the batch offline data through the real-time dimension table, and the offline data is online through the real-time dimension table, so that the streaming system can quickly access the batch offline data. On one hand, as the whole system completely uses the streaming system for streaming calculation, the extra calculation consumption and operation and maintenance cost for simultaneously maintaining the streaming system and the batch system under the scene of stream-batch separation can be avoided. On the other hand, the real-time dimension table serves as a buffer between the batch-type offline data and the streaming-type real-time data, so that the access delay difference between the batch-type offline data and the streaming-type real-time data can be reduced, and the high system load generated by loading the batch-type offline data into the memory of the streaming-type system can be relieved.
Based on the same technical concept, embodiments of the present application further provide a batch-to-batch integrated calculation control method, an electronic device, a computer storage medium, and the like, which can be specifically referred to in the following embodiments.
Referring to fig. 2, fig. 2 is a flowchart illustrating a batch-to-batch integrated calculation control method according to an embodiment of the present disclosure. As shown in fig. 2, the method may include the steps of:
s210, converting metadata of the batch data source into dimension table metadata;
s220, continuously reading batch offline data from the batch data source;
s230, judging whether new batch offline data are read from the batch data source, if so, turning to the step S240, and if not, turning to the step S220;
s240, generating a real-time dimension table based on the batch off-line data;
and S250, performing streaming calculation on the streaming real-time data of the streaming data source and the real-time dimension table together to obtain a streaming calculation result.
In a possible implementation, step S210 further includes, before: metadata is obtained from the batch data source and the streaming data source, respectively.
In a possible implementation manner, step S240 specifically includes: generating a real-time dimension table based on the batch-type offline data, the dimension table metadata and a dimension table synchronization strategy configured by a user; the dimension table synchronization strategy is used for controlling the frequency of data synchronization, so that the timeliness of the data in the real-time dimension table and the system consumption generated in the data synchronization process are balanced.
In a possible implementation manner, after step S240 and before step S250, the method further includes: converting standard SQL sentences configured by a user into an abstract semantic tree; generating an execution plan based on the abstract semantic tree, the metadata of the batch data source and the metadata of the streaming data source; the execution plan comprises a directed acyclic computation flow graph ordered according to topology, and each vertex in the directed acyclic computation flow graph corresponds to a streaming computation thread.
In a possible implementation manner, step S250 specifically includes: temporarily storing streaming real-time data of the streaming data source and data source connection information of the real-time dimension table, and releasing memory space occupied by the data source connection information after the calculation of the streaming calculation module is finished; and reading the streaming real-time data and the real-time dimension table of the streaming data source based on the data source connection information, and performing streaming calculation on the streaming real-time data and the real-time dimension table of the streaming data source together based on the execution plan to obtain a streaming calculation result.
In a possible implementation, step S250 further includes: temporarily storing data source connection information of batch offline data of the batch data source, and releasing memory space occupied by the data source connection information after the calculation of the flow calculation module is finished; and reading the batch offline data of the batch data source based on the data source connection information, and performing streaming calculation on the batch offline data of the batch data source based on the execution plan to obtain a streaming calculation result.
In a possible implementation, step S250 further includes: temporarily storing the data source connection information of the streaming real-time data of the streaming data source, and releasing the memory space occupied by the data source connection information after the calculation of the streaming calculation module is finished; and reading the streaming real-time data of the streaming data source based on the data source connection information, and performing streaming calculation on the streaming real-time data of the streaming data source based on the execution plan to obtain a streaming calculation result.
In the method for controlling batch-to-batch integrated calculation provided in this embodiment, first, metadata of a batch data source is converted into metadata of a dimension table, batch offline data is continuously read from the batch data source, and a real-time dimension table is generated based on the batch offline data each time the batch offline data is read from the batch data source; and then carrying out stream type calculation on the stream type real-time data of the stream type data source and the real-time dimension table together to obtain a stream type calculation result. The method connects streaming real-time data and batch offline data through a real-time dimension table, and offline data are online through the real-time dimension table, so that a streaming system can quickly access the batch offline data. On one hand, as the whole method completely uses the streaming system to perform streaming computation, the extra computation consumption and operation and maintenance cost for simultaneously maintaining the streaming system and the batch system under the scene of stream-batch separation can be avoided. On the other hand, the real-time dimension table serves as a buffer between the batch-type offline data and the streaming-type real-time data, so that the access delay difference between the batch-type offline data and the streaming-type real-time data can be reduced, and the high system load generated by loading the batch-type offline data into the memory of the streaming-type system can be relieved.
An embodiment of the present application discloses an electronic device, as shown in fig. 3, including: a processor 301, a memory 302, and a bus 303, the memory 302 storing machine readable instructions executable by the processor 301, the processor 301 and the memory 302 communicating via the bus 303 when the electronic device is operating. The machine readable instructions, when executed by the processor 301, perform the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, which is not described herein again.
The computer program product of the batch-and-flow integrated calculation control method provided in the embodiment of the present application includes a computer-readable storage medium storing a nonvolatile program code executable by a processor, where instructions included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, and is not described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A batch-and-flow integrated computing control system, comprising:
the control device is used for converting metadata of a batch data source into dimension table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data each time the batch offline data is read from the batch data source;
and the calculating device is used for carrying out stream calculation on the stream real-time data of the stream data source and the real-time dimension table together to obtain a stream calculation result.
2. The method of claim 1, wherein the controlling means comprises:
the metadata import module is used for respectively acquiring metadata from the batch data source and the streaming data source and importing the metadata of the batch data source and the streaming data source into the metadata directory module;
the metadata directory module is used for storing and retrieving metadata of the streaming data source and the batch data source and converting the metadata of the batch data source into dimension table metadata;
the dimension table synchronization module is used for continuously reading batch offline data from a batch data source, and generating a real-time dimension table based on the batch offline data, the dimension table metadata and a user-configured dimension table synchronization strategy when the batch offline data are read from the batch data source each time; the dimension table synchronization strategy is used for controlling the frequency of data synchronization, so that the timeliness of the data in the real-time dimension table and the system consumption generated in the data synchronization process are balanced.
3. The method of claim 2, wherein the control device further comprises:
and the metadata management module is used for controlling the metadata importing module to start action and generating the dimension table synchronization strategy based on the configuration operation of a user.
4. The method of claim 2, wherein the control device further comprises:
the SQL statement analysis module is used for converting the standard SQL statement configured by the user into an abstract semantic tree;
the execution plan generating module is used for generating an execution plan based on the abstract semantic tree, the metadata of the batch data source and the metadata of the streaming data source; the execution plan comprises a directed acyclic computation flow graph ordered according to topology, and each vertex in the directed acyclic computation flow graph corresponds to a streaming computation thread.
5. The method of claim 4, wherein the computing device comprises:
the connection information temporary storage module is used for temporarily storing the streaming real-time data of the streaming data source and the data source connection information of the real-time dimension table, and releasing the memory space occupied by the data source connection information after the calculation of the streaming calculation module is finished;
and the stream type calculation module is used for reading the stream type real-time data of the stream type data source and the real-time dimension table based on the data source connection information, and performing stream type calculation on the stream type real-time data of the stream type data source and the real-time dimension table together based on the execution plan to obtain a stream type calculation result.
6. The method of claim 5,
the connection information temporary storage module is also used for temporarily storing the data source connection information of the batch off-line data of the batch data source and releasing the memory space occupied by the data source connection information after the calculation of the flow calculation module is finished;
and the streaming calculation module is further configured to read batch offline data of the batch data source based on the data source connection information, and perform streaming calculation on the batch offline data of the batch data source based on the execution plan to obtain a streaming calculation result.
7. The method of claim 6,
the connection information temporary storage module is also used for temporarily storing the data source connection information of the streaming real-time data of the streaming data source and releasing the memory space occupied by the data source connection information after the calculation of the streaming calculation module is finished;
the streaming calculation module is further configured to read streaming real-time data of the streaming data source based on the data source connection information, and perform streaming calculation on the streaming real-time data of the streaming data source based on the execution plan to obtain a streaming calculation result.
8. A flow-batch integrated calculation control method is characterized by comprising the following steps:
converting metadata of the batch data source into dimension table metadata;
continuously reading batch off-line data from the batch data source, and generating a real-time dimension table based on the batch off-line data each time the batch off-line data is read from the batch data source;
and carrying out stream type calculation on the stream type real-time data of the stream type data source and the real-time dimension table together to obtain a stream type calculation result.
9. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the method of claim 8.
10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as set forth in claim 8.
CN202110105453.0A 2021-01-26 2021-01-26 Flow-batch integrated calculation control system and method Pending CN112800091A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110105453.0A CN112800091A (en) 2021-01-26 2021-01-26 Flow-batch integrated calculation control system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110105453.0A CN112800091A (en) 2021-01-26 2021-01-26 Flow-batch integrated calculation control system and method

Publications (1)

Publication Number Publication Date
CN112800091A true CN112800091A (en) 2021-05-14

Family

ID=75811918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110105453.0A Pending CN112800091A (en) 2021-01-26 2021-01-26 Flow-batch integrated calculation control system and method

Country Status (1)

Country Link
CN (1) CN112800091A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117435596A (en) * 2023-12-20 2024-01-23 杭州网易云音乐科技有限公司 Streaming batch task integration method and device, storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677752A (en) * 2015-12-30 2016-06-15 深圳先进技术研究院 Streaming computing and batch computing combined processing system and method
CN106909598A (en) * 2016-07-01 2017-06-30 阿里巴巴集团控股有限公司 It is a kind of to ensure processing method, the apparatus and system for calculating data consistency
CN109189835A (en) * 2018-08-21 2019-01-11 北京京东尚科信息技术有限公司 The method and apparatus of the wide table of data are generated in real time
CN109522341A (en) * 2018-11-27 2019-03-26 北京京东金融科技控股有限公司 Realize method, apparatus, the equipment of the stream data processing engine based on SQL
CN110309848A (en) * 2019-05-08 2019-10-08 重庆天蓬网络有限公司 The method that off-line data and stream data real time fusion calculate
US20200278969A1 (en) * 2019-02-28 2020-09-03 Microsoft Technology Licensing, Llc Unified metrics computation platform

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677752A (en) * 2015-12-30 2016-06-15 深圳先进技术研究院 Streaming computing and batch computing combined processing system and method
CN106909598A (en) * 2016-07-01 2017-06-30 阿里巴巴集团控股有限公司 It is a kind of to ensure processing method, the apparatus and system for calculating data consistency
CN109189835A (en) * 2018-08-21 2019-01-11 北京京东尚科信息技术有限公司 The method and apparatus of the wide table of data are generated in real time
CN109522341A (en) * 2018-11-27 2019-03-26 北京京东金融科技控股有限公司 Realize method, apparatus, the equipment of the stream data processing engine based on SQL
US20200278969A1 (en) * 2019-02-28 2020-09-03 Microsoft Technology Licensing, Llc Unified metrics computation platform
CN110309848A (en) * 2019-05-08 2019-10-08 重庆天蓬网络有限公司 The method that off-line data and stream data real time fusion calculate

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117435596A (en) * 2023-12-20 2024-01-23 杭州网易云音乐科技有限公司 Streaming batch task integration method and device, storage medium and electronic equipment
CN117435596B (en) * 2023-12-20 2024-04-02 杭州网易云音乐科技有限公司 Streaming batch task integration method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
Shi et al. Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs
US9298774B2 (en) Changing the compression level of query plans
CN107515878B (en) Data index management method and device
EP3432157B1 (en) Data table joining mode processing method and apparatus
US9471651B2 (en) Adjustment of map reduce execution
WO2017080431A1 (en) Log analysis-based database replication method and device
WO2017019879A1 (en) Multi-query optimization
CN105824957A (en) Query engine system and query method of distributive memory column-oriented database
US20180248934A1 (en) Method and System for a Scheduled Map Executor
US11030196B2 (en) Method and apparatus for processing join query
CN111339073A (en) Real-time data processing method and device, electronic equipment and readable storage medium
US10102098B2 (en) Method and system for recommending application parameter setting and system specification setting in distributed computation
CN109471893B (en) Network data query method, equipment and computer readable storage medium
CN110851234A (en) Log processing method and device based on docker container
CN114722119A (en) Data synchronization method and system
US20160170462A1 (en) Resource capacity management in a cluster of host computers using power management analysis
CN112328592A (en) Data storage method, electronic device and computer readable storage medium
CN112800091A (en) Flow-batch integrated calculation control system and method
CN112506869A (en) File processing method, device and system
CN110222046B (en) List data processing method, device, server and storage medium
CN109408711B (en) Data filtering method and device, electronic equipment and storage medium
US20220360458A1 (en) Control method, information processing apparatus, and non-transitory computer-readable storage medium for storing control program
CN107515916B (en) Performance optimization method and device for data query
CN109902067B (en) File processing method and device, storage medium and computer equipment
CN111241099A (en) Industrial big data storage method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination