CN118193565A - Distributed big data calculation engine - Google Patents

Distributed big data calculation engine Download PDF

Info

Publication number
CN118193565A
CN118193565A CN202311844217.6A CN202311844217A CN118193565A CN 118193565 A CN118193565 A CN 118193565A CN 202311844217 A CN202311844217 A CN 202311844217A CN 118193565 A CN118193565 A CN 118193565A
Authority
CN
China
Prior art keywords
task
data
computing
engine
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311844217.6A
Other languages
Chinese (zh)
Inventor
孟英谦
李旭光
葛晋鹏
李泽宇
陈朔
邬书豪
随秋林
张敏
刘晓兰
薛行
王嘉岩
李子行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China North Computer Application Technology Research Institute
Original Assignee
China North Computer Application Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China North Computer Application Technology Research Institute filed Critical China North Computer Application Technology Research Institute
Priority to CN202311844217.6A priority Critical patent/CN118193565A/en
Publication of CN118193565A publication Critical patent/CN118193565A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a distributed big data calculation engine, comprising: the system comprises a unified interface module, a distributed computing engine module and an operation result processing module; the unified interface module is used for receiving the calculation task and analyzing the task based on the data type identifier of the calculation task so as to start the corresponding calculation engine; the distributed computing engine module comprises a flow computing engine, a batch computing engine and a batch-flow integrated computing engine which are respectively used for reading and executing corresponding computing tasks; the operation result processing module is used for collecting the operation state data of each calculation engine, monitoring the operation state and returning the task calculation result to the client. The invention solves the problem that the big data calculation engine in the prior art cannot provide parallel and efficient real-time calculation aiming at multiple types of data processing tasks so as to perform real-time quick response when facing multiple complex calculation scenes.

Description

Distributed big data calculation engine
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a distributed big data computing engine.
Background
As heterogeneous data generated by the intelligence, monitoring and reconnaissance system is larger and larger, including electronic reconnaissance data, visible spectrum data, intelligence data, real-time situation data of various sites, instruction data such as logistics guarantee and personnel mobilization, the diversity and complexity of data services in various scenes are increased, and challenges in data storage, calculation and data analysis processing service break through are also larger and larger.
In the aspect of distributed computation, besides the traditional requirement of mass data batch processing, the real-time processing of the data efficiently assists in combat decision. There is therefore a need in the art for a unified big data computing engine to handle all of the needs of batch computing, stream computing, and batch-stream integrated computing.
Disclosure of Invention
In view of the above analysis, the present invention aims to disclose a distributed big data computing engine, which solves the problems that when the data computing engine in the prior art faces multiple complex computing scenes, real-time computing with high efficiency cannot be provided for multiple types of data processing tasks, resulting in high delay and incapability of real-time quick response.
The aim of the invention is mainly realized by the following technical scheme:
In one aspect, the invention discloses a distributed big data computing engine, the computing engine comprising: the system comprises a unified interface module, a distributed computing engine module and an operation result processing module;
The unified interface module is used for receiving a computing task and analyzing the task based on a data type identifier of the computing task so as to start a corresponding computing engine;
The distributed computing engine module comprises a flow computing engine, a batch computing engine and a batch-flow integrated computing engine which are respectively used for reading and executing corresponding computing tasks; the stream computing engine is used for carrying out task scheduling based on task information of each node in the directed acyclic graph and resource information of each corresponding physical computing node so as to realize low-delay processing of stream data;
The operation result processing module is used for collecting the operation state data of each calculation engine, monitoring the operation state and returning the task calculation result to the client.
Further, the batch computing engine processes batch computing tasks through a distributed computing and mixed load scheduler;
The batch and stream integrated calculation engine performs unified data storage on batch data and stream data in a mode of unified storage management system and row-column mixed storage; and the batch data and the stream data are respectively calculated and processed by an offline processing module and a real-time processing module through mixed load scheduling and vectorization execution.
Further, the rank mix storage includes: dividing a table into a plurality of tables, each of the tables comprising MetaData and a plurality of rowsets RowSet, the rowset RowSet comprising one MemRowSet and a plurality DiskRowSet;
The MemRowSet is used for new data insertion and updating of the data already in MemRowSet, the data in MemRowSet is stored according to rows, and after one MemRowSet is full, the data is brushed to a disk to form a plurality of DiskRowSet;
The DiskRowSet is used for changing old data, and compression processing is carried out on DiskRowSet periodically through a background so as to delete useless data and merge historical data; wherein the data in DiskRowSet is organized in columns.
Further, the flow calculation engine comprises a control node module, a calculation node module and a Zookeeper cluster module;
The computing node module comprises a plurality of physical computing nodes and is used for monitoring and executing corresponding flow computing tasks;
The Zookeeper cluster module is deployed on a plurality of servers and is used for storing all state information and task information of a plurality of physical computing nodes so as to enable the computing node module and the control node module to carry out real-time monitoring and calling;
The control node module is used for generating a directed acyclic graph based on the task information to be executed in the stream processing task and the resource information of each physical computing node; and issuing the tasks to be executed to the corresponding physical computing nodes according to the corresponding relation in the directed acyclic graph for processing, and scheduling the tasks based on the resource information of each physical computing node so as to realize low-delay processing of the stream data.
Further, the physical computing node comprises a monitoring process Tracker and a working process Worker; the monitoring process Tracker acquires task information through the Zookeeper cluster module and creates a working process workbench to calculate and process tasks issued to the physical calculation node;
The monitoring process Tracker is also used for monitoring task abnormality information of the work process Worker and performing task recovery based on a preset fault recovery flow so as to realize continuous execution of the stream processing task.
Further, the task scheduling based on the resource information of each physical computing node includes:
acquiring resource utilization rate of each physical computing node, wherein the resource utilization rate comprises CPU utilization rate, memory occupancy rate, disk I/O utilization rate and bandwidth utilization rate;
Calculating the proportion of the CPU utilization rate, the memory occupancy rate, the disk I/O utilization rate and the bandwidth utilization rate in the total resource utilization rate;
If the proportion occupied by any one of the utilization rates exceeds a preset threshold, judging that task scheduling is required to be carried out on the task to be executed by the physical computing node;
And calculating to obtain task scheduling priorities corresponding to the physical computing nodes based on the resource information of each physical computing node, and performing task scheduling on tasks to be scheduled based on the physical computing node with the highest task scheduling priority.
Further, the calculating, based on the resource information of each physical computing node, a task scheduling priority corresponding to each physical computing node includes:
Obtaining the residual rate of each resource of each physical computing node based on the utilization rate of each resource of each physical computing node;
obtaining task priority contribution degrees of all the resources according to the residual rates of all the resources of all the physical computing nodes and the resource quantity required by the tasks to be scheduled;
And obtaining task scheduling priorities of tasks to be scheduled corresponding to the physical computing nodes based on the task priority contribution degrees of the resources of the physical computing nodes.
Further, based on the difference value of the resource remaining rate corresponding to the CPU, the memory and the disk I/O interface and the resource quantity required by the task, the task priority contribution degree of the CPU, the memory and the disk I/O is obtained through the following formula:
Fcpu, fmem, fio is task priority contribution degrees of CUP, memory and disk I/O respectively; Δcpu, Δmem and Δio are the residual rate of the resources of the CUP, the memory and the disk I/O and the difference value of the amount of each resource required by the task respectively; taskcpu denotes the amount of CPU resources required by a task, taskmem denotes the amount of memory resources required by a task, taskio denotes the amount of disk I/O resources required by a task, qos cpu、qoscpu and qos cpu are the resource residuals of the CPU, the memory and the disk I/O interfaces respectively, and alpha, beta and gamma are weight factors.
Further, the task scheduling priority of the task to be scheduled corresponding to the physical computing node is obtained through the following formula:
Rank(task)=ρ+R(task)*fcpu*fmem*fio*v;
Wherein Rank (task) is task scheduling priority; fcpu, fmem, fio is the task scheduling priority contribution degrees of CUP, memory and disk I/O of the current physical computing node to the task to be scheduled respectively; ρ is a compensation factor; r (task) is the size of the task to be scheduled from the end point task in the scheduling sequence; v is a task processing speed influence factor, v= (1-s)/t; s is the ratio of the output data quantity to the input quantity of the task to be scheduled, and t is the processed time of the task.
Further, each resource amount required by the stream computing task is obtained by the following method respectively:
Independently running the stream calculation task on a physical calculation node, and respectively counting idle time t1 and running time t2 of a CPU (Central processing Unit); the amount of CPU resources taskcpu required by the stream computation task is obtained by the following formula:
taskcpu=1-P=1-t1/(t2*Q);
wherein P is CPU idle rate when independently running tasks, and Q is CPU number;
the memory resource amount taskmem and the disk I/O resource amount taskio required by the stream computing task are obtained through statistics of memory and disk I/O statistics tools provided by the corresponding physical computing nodes.
The invention can realize at least one of the following beneficial effects:
1. The invention integrates various data calculation engines, performs data task analysis based on a unified data interface module, and realizes the parallel high-efficiency real-time processing of stream data, batch data and batch-stream integrated data.
2. The invention calculates the task scheduling priority corresponding to each physical computing node based on the hardware resource information of the physical computing nodes, and performs task scheduling on the task to be scheduled based on the physical computing node with the highest task scheduling priority, so that the streaming data processing task is distributed and executed under the constraint of fixed hardware resources and task priority, the low-delay processing of the streaming processing task is realized, and the real-time performance of the system is improved.
3. The invention introduces compensation factors and task processing speed influencing factors in task scheduling, considers the influence of different resources on scheduling and the influence of task processing speed in task scheduling, compensates running tasks, improves the scheduling priority of the running tasks, ensures the continuity of task running and the correctness and effectiveness of a scheduling algorithm, reduces the system time delay and can better solve the problem of stream data processing bottleneck.
3. The invention realizes the continuous execution of the flow calculation task through the Zookeeper cluster module, the backup control node setting and the fault recovery flow, and ensures the reliability of the flow calculation engine.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to designate like parts throughout the drawings;
FIG. 1 is a block diagram of a distributed big data calculation engine in an embodiment of the invention;
FIG. 2 is a workflow of a stream computation engine in an embodiment of the invention;
Detailed Description
Preferred embodiments of the present application are described in detail below with reference to the attached drawing figures, which form a part of the present application and are used in conjunction with embodiments of the present application to illustrate the principles of the present application.
In one embodiment of the present invention, a distributed big data calculation engine is disclosed, as shown in fig. 1, the big data calculation engine includes: the system comprises a unified interface module, a distributed computing engine module and an operation result processing module; wherein,
The unified interface module is used for receiving a computing task and analyzing the task based on a data type identifier of the computing task so as to start a corresponding computing engine;
Specifically, the distributed big data computing engine of the embodiment is deployed in a domestic server cluster based on a distributed architecture design, and supports servers of domestic architecture CPUs such as Feiteng, shenwei, loongson, megacore, sea light and the like. On a hardware server, firstly, a Docker+Kubernetes base resource management service is deployed to realize uniform resource management and scheduling of the bottom domestic heterogeneous hardware, then batch computing, stream computing and batch stream integrated computing service is deployed, and finally, an engine result service is deployed. According to the embodiment, based on unified SQL engines, computing services are provided to the outside through JDBC, ODBC, API interfaces and the like, the received data type identifiers carried in the computing tasks are analyzed, the data type identifiers are matched with a preset identifier database of each computing engine, and corresponding computing engines are started according to analysis and matching results of the tasks.
Further, the distributed computing engine module comprises a flow computing engine, a batch computing engine and a batch-flow integrated computing engine which are respectively used for reading and executing corresponding computing tasks;
The stream computing engine is used for carrying out task scheduling based on task information of each node in the directed acyclic graph and resource information of each corresponding physical computing node so as to realize low-delay processing of stream data;
Specifically, the stream computation engine comprises a control node module, a computation node module and a Zookeeper cluster module, wherein,
The computing node module comprises a plurality of physical computing nodes and is used for monitoring and executing corresponding flow computing tasks;
The Zookeeper cluster module is deployed on a plurality of servers and is used for storing all state information and task information of a plurality of physical computing nodes so as to enable the computing node module and the control node module to carry out real-time monitoring and calling;
the control node module is used for generating a directed acyclic graph based on the task information to be executed in the stream processing task and the resource information of each physical computing node; and issuing the tasks to be executed to the corresponding physical computing nodes for processing according to the corresponding relation in the directed acyclic graph, and scheduling the tasks based on the resource information of each physical computing node so as to realize low-delay processing of the stream data.
Specifically, the control node module is a central hub of the distributed stream computing engine, and is configured with a master control node and a slave control node; when the flow calculation engine is started, the main control node and the slave control node are initialized; the control node started later sends registration information to the control node started earlier, and heartbeat keep-alive connection between the control nodes is established; in a normal working state, the main control node is in a working mode and is used for executing control tasks; the slave control node is used as a backup node of the master control node and is in a standby mode, and when the master control node fails, the control task of the master control node is replaced based on heartbeat keep-alive connection, so that the continuous running of the stream processing task is ensured, and the reliability of the system is improved.
In a distributed stream computing system, tasks are the basic unit of scheduling, there is typically a constraint of priority between different tasks, and tasks with higher priority may be executed earlier than tasks with lower priority. The task scheduling in this embodiment is aimed at allocating and executing tasks under the constraint of fixed hardware resources and task priorities, so as to reduce the total time taken for completing the tasks.
Preferably, the task scheduling based on the resource information of each physical computing node includes:
Acquiring the resource utilization rate of each physical computing node, wherein the resource utilization rate comprises CPU utilization rate, memory occupancy rate and disk I/O utilization rate; calculating the proportion of CPU utilization rate, memory occupancy rate, disk I/O utilization rate and bandwidth utilization rate in the total resource utilization rate; if the proportion occupied by any one of the utilization rates exceeds a preset threshold, judging that task scheduling is required to be carried out on the task to be executed by the physical computing node; calculating to obtain task scheduling priorities corresponding to the physical computing nodes based on the resource information of each physical computing node, and performing task scheduling on tasks to be scheduled based on the physical computing node with the highest task scheduling priority;
The task scheduling priority corresponding to each physical computing node is obtained through calculation by the following method: obtaining the residual rate of each resource of each physical computing node based on the utilization rate of each resource of each physical computing node; obtaining task priority contribution degrees of all the resources according to the residual rates of all the resources of all the physical computing nodes and the resource quantity required by the tasks to be scheduled; and obtaining task scheduling priorities of tasks to be scheduled corresponding to the physical computing nodes based on the task priority contribution degrees of the resources of the physical computing nodes.
Specifically, the resource amount required by the task to be scheduled is obtained by the following methods:
Independently running a stream calculation task on a physical calculation node, and respectively counting idle time t1 and running time t2 of a CPU (Central processing Unit); the amount of CPU resources taskcpu required for the stream computation task is obtained by the following formula:
taskcpu=1-P=1-t1/(t2*Q);
wherein P is CPU idle rate when independently running tasks, and Q is CPU number;
The amount of memory resources taskmem and the amount of disk I/O resources taskio required by the stream computing task are obtained by statistics of memory and disk I/O statistics tools provided by the corresponding physical computing nodes, respectively.
According to the method and the device for scheduling the tasks, the tasks of the physical computing nodes with larger loads affecting the task processing speed are scheduled to the relatively idle physical computing nodes for processing based on the resource information of the physical computing nodes, so that the stream data processing speed is greatly improved, and the problems of high delay and poor instantaneity caused by hardware resource limitation are solved.
More specifically, because the resources required by different tasks are different, the computation-intensive tasks require more CPU resources, and the I/O-intensive tasks require higher I/O efficiency, so that the tasks are scheduled according to the influence of different resources on scheduling, and the problem of stream data processing bottlenecks can be better solved. The resource consumption of one task is multiple, so that the mapping between the load capacity of the physical computing node and the resource use condition of different dimensions can be established, and task scheduling can be performed on the tasks corresponding to each node.
Factors affecting node load generally include CPU usage, memory usage, and I/O efficiency, where:
The CPU is an important resource calculated by a computer, has important influence on task scheduling, and the CPU utilization rate is counted by the following method:
Checking the running time and the idle time of the CPU respectively, taking the time difference between two adjacent CPU running times as a calculation period to count the running time of the CPU, and counting the idle time of the CPU by the same method, so that the utilization rate Use cpu of the CPU can be obtained based on the idle time and the running time;
The memory usage Use mem can be counted by a command for counting the memory occupation condition provided by the Linux system. Disk I/O usage Use IO is similar to memory usage, and statistics is also performed using the I/O statistics tool provided by the system.
The control node module in this embodiment starts a thread, specifically monitors the running state of each task node, and in general, the running state of the computing node is relatively stable, the usage condition of each resource does not exceed a threshold, and the data flow rate is stable, so that task scheduling does not need to be periodically performed. Only when some computing nodes fail or the load exceeds a threshold value, task scheduling needs to be carried out again when jitter is relatively large.
In the distributed flow computing platform, task allocation scheduling is performed based on load information of nodes, so that analysis of node loads is critical to node operation efficiency, and the relation between the loads and resources of physical computing nodes is as follows:
L=Usecpu+Usemem+UseIO
Where Use cpu represents CPU usage, use mem represents memory occupancy, and Use IO represents disk I/O usage. In practical application, a certain time interval can be set according to practical conditions, each resource information is periodically collected, the load condition of a physical computing node is obtained, and the proportion occupied by each resource occupancy is calculated, wherein the proportion is shown in the following formula:
Ccpu=ΔUsecpu/ΔL;
Cmem=ΔUsemem/ΔL;
CIO=ΔUseIO/ΔL;
wherein, delta Usecpu, delta Usemem and delta UseIO are respectively the CPU usage difference value, the memory occupancy difference value and the disk I/O usage difference value of two adjacent acquisition periods, and delta L is the difference value of the total load of two adjacent acquisition periods.
The factor with the greatest influence on the load can be obtained through the C cpu、Cmem、CIO, and the factor which possibly becomes the bottleneck of the node is also reflected. Through a large number of experiments, it is found that when the maximum fluctuation in C cpu、Cmem、CIO exceeds 30%, large fluctuation may occur, so the threshold is set to 30% in this embodiment, that is, when any one of C cpu、Cmem、CIO exceeds the 30% threshold, task scheduling is performed on the task corresponding to the physical computing node. Similarly, when C cpu、Cmem、CIO assumes a smaller value, indicating that the resources in this regard are relatively free, the corresponding task should be scheduled to that node preferentially if such resources are needed.
More specifically, based on the difference value between the resource remaining rate corresponding to the CPU, the memory and the disk I/O interface and the resource quantity required by the task, the task priority contribution degree of the CPU, the memory and the disk I/O is obtained through the following formula:
Fcpu, fmem, fio is task priority contribution degrees of CUP, memory and disk I/O respectively; Δcpu, Δmem and Δio are the residual rate of the resources of the CUP, the memory and the disk I/O and the difference value of the amount of each resource required by the task respectively; taskcpu denotes the amount of CPU resources required by a task, taskmem denotes the amount of memory resources required by a task, taskio denotes the amount of disk I/O resources required by a task, qos cpu、qosmem and qos io are the resource residuals of the CPU, the memory and the disk I/O interfaces respectively, and alpha, beta and gamma are weight factors.
Preferably, qos cpu、qoscpu and qos cpu are obtained by the following formula:
qoscpu=1–Usecpu
qosmem=1–Usemem
qosio=1–Useio
the difference delta CPU, delta mem and delta io between the residual rate of the CPU, the memory and the disk I/O and the amount of I/O resources of the CPU, the memory and the disk required by the task are respectively obtained by the following formulas:
Δcpu=qoscpu-taskcpu
Δmem=qosmem-taskmem
Δio=qosio-taskio
Finally, the task scheduling priority corresponding to the physical computing node is obtained through the following formula:
Rank(task)=ρ+R(task)*fcpu*fmem*fio*v;
Wherein Rank (task) is task scheduling priority; fcpu, fmem, fio is the task priority contribution of CUP, memory and disk I/O respectively; ρ is a compensation factor and, R (task) is the size of the node task from the end point task in the scheduling sequence; v is a task processing speed influence factor, v= (1-s)/t; s is the ratio of the output data quantity to the input quantity of the task, and t is the processed time of the task.
The higher the task scheduling priority is, the higher the efficiency of processing the corresponding task to be scheduled is, therefore, after the task scheduling priority is obtained by calculation, the task to be scheduled is scheduled to the physical computing node with the highest corresponding priority for calculation processing, so that the running efficiency of the system is improved, and the delay of data processing is reduced.
It should be noted that, taking fcpu as an example, when Δcpu >0, it is described that the residual rate of the physical computing node CPU is more than that of the CPU required by the task, fcpu has a positive correlation effect on the task scheduling priority Rank (task), the positive effect is correlated with the weight factor α, and the values of α, β, γ are all between 0 and 1, and can be set according to the actual application situation; when Δcpu <0, it indicates that the demand of task on CPU is greater than the current node CPU residual rate, and according to the formula fcpu, fcpu has a negative correlation effect on task scheduling priority Rank (task) when Δcpu < 0; for a node, the more tasks of input data processed in unit time, the less memory consumption, the embodiment introduces a task processing speed influence factor v, defines v= (1-s)/t, considers the influence of task processing speed, and improves the effectiveness of a task scheduling algorithm; and the present embodiment introduces a compensation factor, the compensation factor ρ is defined as: compensation is given to the running task, the scheduling priority of the running task is improved, the continuity of task running and the correctness and effectiveness of a scheduling algorithm are guaranteed, and the system time delay is reduced.
Further, as shown in fig. 2, the operation of the stream processing computing engine includes:
The method comprises the steps that a Master control node Master and a slave control node Secondary Master are respectively initialized, the initialization parts of the Master control node Master and the slave control node Secondary Master are sequential, a later started control node sends registration information to the earlier started control node, and heartbeat keep-alive connection between the control nodes is established;
The monitoring process Tracker of each physical computing node in the computing node module is started and initialized;
the computing node monitoring process acquires metadata such as hardware resource information of a physical computer through a Zookeeper cluster module, registers the metadata with a main control node and reports the hardware resource information;
a user sends a task starting instruction to a Master control node Master through a management interface Portal process;
the main control node analyzes the streamSQL to obtain a static topological graph (static directed acyclic graph STATICDAG) of the task; performing task scheduling according to cluster resources and the running task conditions to obtain a dynamic topological graph (dynamic directed acyclic graph DYNAMICDAG) of the task;
the main control node transmits the operation instruction to the corresponding physical computing node according to the corresponding relation of the dynamic topological graph;
the computing node monitoring process Tracker creates a working process workbench according to the instruction of the control node, and operates a designated stream processing task;
the computing node monitoring process Tracker replies the working process creation condition to the main control node;
after the Master control node Master waits for the working processes of all the computing units in the task topological graph to be established, an instruction for connecting the downstream computing units is sent to the computing node monitoring process according to the topological graph relation of the task;
the computing node monitoring process Tracker forwards a downstream connection instruction to a working process Worker, and the working process Worker establishes network connection with a working process of a downstream computing node;
the computing node monitoring process Tracker replies the downstream execution result of the work process workbench to the Master control node Master;
After the Master control node Master waits for the complete establishment of the topological graph of the task, the Master control node Master replies the task starting execution result to the management interface.
Further, in practical application, the control node module may set a resource management sub-module, a task scheduling sub-module, a data source management sub-module, and an output management sub-module.
The task management sub-module is used for analyzing any task information to be executed of stream processing provided by the client and generating and managing a Directed Acyclic Graph (DAG); more specifically, the distributed stream computing task is organized in the form of a Directed Acyclic Graph (DAG), and the control node generates a topological relation of the task by analyzing StreamSQL, wherein the topological relation contains the execution sequence of each task to be executed in the stream processing task and the physical information of the computing nodes, including id information of the physical computing nodes, a father node list and a child node list, and records the IP address, a monitoring port, a process number, a process state and the like of the physical computing node where the working process is located.
The task management submodule mainly manages state information and operation of tasks, and in this embodiment, a task information table is used for managing tasks, wherein the task table is a hash table, and task states and managed Directed Acyclic Graph (DAG) information are stored in the hash table, so as to ensure reliability of task management. The business_table contains task names, task ids, task state information, task storage locations, and information pointing to the DAG.
The scheduling algorithm sub-module is used for distributing tasks to corresponding physical computing nodes according to Directed Acyclic Graph (DAG) information after receiving a task starting request sent by a client, rescheduling the tasks when the utilization rate of certain resources fluctuates greatly or exceeds a threshold value, and starting a fault recovery algorithm to schedule the tasks if the nodes fail.
The resource management sub-module is used for collecting CPU information, bandwidth information, memory information and disk I/O information on the physical machine where the computing node is located, collecting the resource use information, counting, and providing original data for the scheduling algorithm. In the running process of the system, the computing node monitoring process Tracker can collect the hardware resource residual information of the computing node and the resource occupation information of the working process at fixed time, and the hardware resource residual information and the resource occupation information of the working process are placed in a heartbeat protocol packet and sent to the control node in a piggybacking mode. The control node module adopts a sliding window mechanism to count the hardware resource information reported by the computing node, and the oldest resource information in the current window can be replaced when the latest reported resource information is received each time. And taking the average value of the hardware resource information in the sliding statistical window as the actual hardware resource use information.
The data source management sub-module performs unified management on the data source configuration such as a source table, a dimension table and the like used by the execution of the streamSQL task, selects the managed data source to operate when writing the SQL task, and avoids the need of manually inputting connection information for each task.
Data source management includes: newly added data sources: for the newly added data source, the type of the added data source needs to be selected, and the connection configuration parameters of the data sources of different types are also different. And after the connection information is filled in, performing connection test on the data source, and after connection establishment is successful, storing the connection information of the current data source. Deleting the data source: when some data sources are no longer in use, the data source is removed from the managed data sources. Modifying the data source: when some data source connection information is changed, the data source is reconnected by modifying the data source parameter configuration.
The output management sub-module is responsible for persistence of stream computation results, supporting distributed file systems (HDFS), relational databases (e.g., mysql, kunDB, etc.), columnar or row-column hybrid storage systems (e.g., HBase, holodesk, etc.), search storage systems (e.g., elastic search), and event-type storage systems (e.g., kafka). Here, the relational database Mysql, the columnar storage system HBase, and the search type storage system elastic search are selected to introduce their relevant configurations.
Furthermore, the Zookeeper cluster module adopts a mode that a leader is provided with a plurality of followers, the leader and each follower are deployed on different servers, and data are synchronized to the followers through the leader, so that the leader and each follower store all state information and task information of the same physical computing node, and when the leader breaks down, one leader is selected from the plurality of followers through an election mechanism, and the continuous operation of the distributed engine is kept. That is, the Zookeeper module is a cluster module formed by a leader (leader) and a plurality of followers (follower), each server stores an identical copy of data, the client is consistent no matter the client is connected to that server, and the leader (leader) completes the distributed read-write and update request forwarding.
The computing node modules are deployed on different physical clusters (physical computing nodes) and are responsible for receiving tasks distributed by the control nodes, managing working processes, such as starting and stopping of the working processes, and running specific tasks. The physical computing node comprises a monitoring process Tracker and a working process Worker; the monitoring process Tracker acquires task information through a Zookeeper cluster module and creates a work process workbench so as to calculate and process tasks issued to the physical calculation node;
the monitoring process Tracker is also used for monitoring task exception information of the work process Worker and performing task recovery based on a preset fault recovery flow so as to realize continuous execution of the flow processing task.
Specifically, performing task recovery based on a preset failure recovery flow includes:
Receiving a task exception message of a working process workbench 1 with a fault through a monitoring process Tracker and forwarding the task exception message to a control node module;
The control node module issues a rescheduling command to a monitoring process Tracker1 where the working process workbench 1 is located, the monitoring process Tracker1 issues a command for terminating the working process workbench 1, and a workbench process list of the control node module is maintained and updated;
The control node module detects the resource information of each current physical computing node and the system resources required by the replacement working process workbench 1 so as to recalculate the task scheduling priority and initiate a command of pulling the working process workbench 2 to the monitoring process workbench 2 on the physical computing node with the highest task scheduling priority;
The working process workbench 2 loads a program operated by the working process workbench 1 according to the position designated by the command and is connected with a downstream node of the working process workbench 1;
the control node module informs an upstream node of the work process Worker1 to actively connect with the work process Worker2, and sends the data stream carried by the data stream to the work process Worker2 according to a distribution rule preset by the stream calculation task to complete task recovery.
Because dynamic data continuously generated by the data flow needs to be processed, the flow calculation engine needs to achieve the effects of rapidness, high efficiency and low delay on the data processing.
Further, the batch computing engine processes batch computing tasks through a distributed computing and mixed load scheduler;
specifically, the batch calculation engine comprises a storage layer, an execution layer, a compiling layer and a service layer;
The storage layer is composed of two functional modules, namely a data source connector and a data table management module, so that the batch processing engine is adapted to be connected with various data sources and manage various data tables, and the requirement of multi-source data analysis is met.
The execution layer consists of two functional modules of a distributed computing node and a mixed load scheduler, so that the distributed execution and task scheduling of the large data volume analysis task are realized;
specifically, the hybrid load scheduler of the present embodiment includes a first-in first-out scheduling policy and a fair scheduling policy:
Wherein, the first-in first-out scheduling strategy comprises: queuing all SQL tasks in a scheduler according to the submitted time sequence; when idle resources appear, the idle resources are preferentially distributed to the tasks submitted first; and when no idle resources exist, queuing the subsequent tasks. That is, all tasks are performed in the order of commit.
The fair scheduling strategy comprises a resource pool level and a fair scheduling strategy of two levels in the resource pool;
The resource Pool level scheduling strategy sets a Minimum Share (Minimum occupied resource) and a Pool Weight (Chi Quan Weight of resource) for each resource Pool; pool Weight represents the priority of the resource Pool, and when idle resources exist, tasks in the resource Pool with high priority acquire resources preferentially; minimum Share represents the least occupied resource of a resource pool, and even if a resource pool with higher priority is robbing resources, the Minimum resource number of a resource pool with low priority is guaranteed preferentially. The resource pool internal level scheduling policy fairly shares all resources for tasks submitted to the same resource pool, regardless of task submission order.
The embodiment realizes unified multi-task execution and multi-level scheduling through the hybrid load scheduler, optimizes the utilization rate of the whole resources of the cluster, realizes the functions of queue scheduling, quota storage, load balancing and the like, and improves the whole job throughput of the distributed batch computing engine.
Further, the compiling layer comprises an SQL compiler, a stored procedure compiler, a transaction management unit, an optimizer and other components. The batch processing computing engine performs concurrent execution in the cluster through the distributed engine by adding and deleting SQL execution plans through optimization of the SQL compiler, can meet the high throughput rate requirement of batch processing business, and simultaneously provides a function of supporting business operation and meets the business requirement of batch processing business.
Further, the service layer provides unified development interface service, engine ecological connection management service, security and high availability management service and the like for the batch processing computing engine.
Preferably, the batch-stream integrated calculation engine performs unified data storage on batch data and stream data in a mode of unified storage management system and row-column mixed storage; the batch data and the stream data are respectively calculated and processed by an offline processing module and a real-time processing module through mixed load scheduling and vectorization execution;
Specifically, conventional database query execution typically employs a one-time tuple pipleline execution mode, and most of the processing time of the CPU is not used to actually process data, but rather traverses the query operation tree, resulting in low CPU utilization, and this also results in low instruction cache performance and frequent jumps. The embodiment adopts a vectorization execution strategy to change the execution mode of one tuple once, and utilizes SIMD acceleration instructions of the CPU, namely single instruction and multiple data operation, to help the CPU to realize data parallelism and improve operation efficiency.
Preferably, the vectorization execution engine may be implemented by the following method:
Based on the volcanic model, modifying the processing mode of one tuple at a time into a processing mode of returning a group of column storage line values (for example, 100-1000 lines) at a time so as to improve the operation efficiency; or adopting a compiling execution model, converting the optimized execution plan tree into compiling execution based on a hierarchical execution mode, namely, for each call, returning data upwards after each layer is completed, reducing the call times among nodes of each hierarchy to the greatest extent, and improving the effective calculation efficiency of the CPU.
Further, for the existing computing engine, such as Lambda architecture, after the data passes through the kafka message middleware and enters the Lambda architecture, two processing modules of offline processing (Hadoop) and real-time processing (Storm) are simultaneously entered. The offline processing performs batch calculations and aggregates a large amount of data. The real-time processing is to perform stream processing or micro batch processing, and calculate the results of seconds and minutes. Finally, the data are input into a service database (service DB) for summarization and are exposed to the upper layer service call. However, the existing method needs to maintain two sets of codes of real-time processing and off-line processing at the same time, and meanwhile, the consistency of the processing results of the two sets of codes is guaranteed, and the complexity of the architecture is high.
The embodiment adopts high-performance row-column mixed storage, utilizes real-time computing service and offline computing service in a computing layer, thoroughly opens up metadata, adopts an event-driven mode, ensures that data can be processed in real time with low delay, and can be searched and queried in the offline computing service immediately after the data is written.
The small files generated by the archiving real-time warehousing task can be merged and archived on the premise of no perception of the upper layer application by utilizing the built-in small file merging service of the storage layer;
Specifically, in order to ensure the writing performance, all the writing operations write a new file in the bottom layer, when frequent writing operations with small data volume are performed, a large number of base/delta files occur, the content is very small (KB level), and for files below 32MB, namely, small files are regarded as being too many, the cost of IO is greatly increased. The embodiment designs a Compact functional module to combine a plurality of small files into one file, thereby solving the problem caused by the small files.
Specifically, doclet merges fall into three categories, including Full, minor, and Major.
Combining a plurality of base files into one base file by full compact, and deleting delta files together;
The minor compact merges the delta files of a base to generate a new delta file, and applies the new delta file to the original base file;
The major compact merges a base and delta files thereof to generate a new base file;
The embodiment solves the problems that too many small files greatly increase IO overhead, influence the performance of a computing engine and further influence the stability of a system through small file merging.
Further, the rank mix storage includes: a table is divided into tables, each of which includes MetaData meta-information and RowSet, which includes one MemRowSet and DiskRowSet. Wherein MemRowSet: for new data insert and update of the data already in MemRowSet, the data in MemRowSet is stored in rows, and one MemRowSet is written up to brush the data to disk to form DiskRowSet. DiskRowSet is used for changing old data (mutation), the background periodically makes a comparison to DiskRowSet to delete useless data and merge historical data, IO overhead in the query process is reduced, and data in DiskRowSet is organized according to columns.
More specifically, the data write process of the batch flow unified computing engine is as follows:
the client is connected with the Master to acquire the related information of the table, wherein the related information comprises partition information and information of all tables in the table;
The client finds DATA SERVER where the tablelet responsible for handling the read-write request is located. The platform receives the request of the client and checks whether the request meets the requirement (table structure);
The platform searches all rowset (memrowset, diskrowset) in the table to confirm whether the data with the same main key as the data to be inserted exists, if so, an error is returned, otherwise, the process is continued;
Write operations are first committed to the pre-written log (WAL) of the tableets, and agreements to follow-up nodes are made according to Raft consistency algorithms, then added to the memory of one of the tableets, and inserts are added to MemRowSet of the tableet. To support multi-version concurrency control (MVCC) in MemRowSet, update and delete operations on the most recently inserted row (i.e., a new row that has not been refreshed to disk) will be appended after the original row in MemRowSet to generate a list of REDO records;
Streaming data is written MemRowset, and when MemRowset data reaches a certain size, memRowset drops the data and generates diskrowset for persisting the data and generates memrowset a request to continue to receive new data. The background periodically makes a comparison to DiskRowSet to delete useless data and merge history data.
It should be emphasized that in the solution of the batch-stream integrated computing engine, a core problem is that a service database layer, that is, a stream processing storage layer, has a bottleneck, and it is not possible to combine efficient writing and real-time analysis of the service. In the existing open source scheme, if a piece of data is required to be written in real time, the real-time query can be performed, the writing throughput is likely to be reduced, and meanwhile, a serious small file problem is caused; on the other hand, if data is written in batches in order to improve throughput, this scheme cannot fully meet the requirement of data instantaneity.
The storage management system designed by the batch-flow integrated computing engine can solve the problem of excessive small files while supporting high-efficiency writing of data; the embodiment performs depth integration on real-time calculation and offline calculation; in the calculation layer, the real-time calculation task and the offline calculation service are thoroughly communicated on metadata, and an event driving mode is adopted, so that the data can be processed in real time with low delay, and the data can be searched and inquired in the offline calculation service immediately after being written; in the storage layer, a set of efficient storage system is developed, and on the premise of ensuring the writing throughput of an approximate file system, the built-in small file merging service can be utilized to merge and archive small files generated by a real-time warehousing task on the premise of no perception of upper-layer application.
Further, the operation result processing module of the big data computing engine of the embodiment is used for collecting the operation state data of each computing engine, monitoring the operation state and returning the task computing result to the client;
Specifically, the operation result processing module can collect operation state data in the task execution process, monitor the operation state and write in an operation log, if an error or a problem occurs, start an error alarm, and return the calculation result of the executed calculation task to the corresponding client.
In summary, the distributed big data computing engine integrates various data computing engines, performs data task analysis based on the unified data interface module, and realizes parallel, efficient and real-time processing of stream data, batch data and batch-stream integrated data. The flow computing engine calculates task scheduling priorities corresponding to the physical computing nodes based on hardware resource information of the physical computing nodes, and performs task scheduling on tasks to be scheduled based on the physical computing node with the highest task scheduling priority, so that the flow data processing tasks are distributed and executed under the constraint of fixed hardware resources and task priorities, low-delay processing of the flow processing tasks is realized, and real-time performance of the system is improved. And the compensation factors and the task processing speed influence factors are introduced into the task scheduling, so that the influence of different resources on the scheduling is considered during the task scheduling, the influence of the task processing speed is considered, the running task is compensated, the scheduling priority of the running task is improved, the running continuity of the task and the correctness and effectiveness of a scheduling algorithm are ensured, the system time delay is reduced, and the problem of stream data processing bottleneck can be better solved. Continuous execution of the flow calculation task is realized through the Zookeeper cluster module, the backup control node setting and the fault recovery flow, and the reliability of the flow calculation engine is ensured; the bottom layer storage of the batch-stream integrated computing engine adopts high-performance row-column mixed storage, so that the problem that the existing computing engine cannot achieve efficient writing and real-time analysis of services due to the bottleneck of a stream processing storage layer is solved.
Those skilled in the art will appreciate that implementing all or part of the processes of the methods in the above embodiments may be accomplished by computer programs to instruct related hardware, and that the programs may be stored in a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims (10)

1. A distributed big data computing engine, comprising: the system comprises a unified interface module, a distributed computing engine module and an operation result processing module;
The unified interface module is used for receiving a computing task and analyzing the task based on a data type identifier of the computing task so as to start a corresponding computing engine;
The distributed computing engine module comprises a flow computing engine, a batch computing engine and a batch-flow integrated computing engine which are respectively used for reading and executing corresponding computing tasks; the stream computing engine is used for carrying out task scheduling based on task information of each node in the directed acyclic graph and resource information of each corresponding physical computing node so as to realize low-delay processing of stream data;
The operation result processing module is used for collecting the operation state data of each calculation engine, monitoring the operation state and returning the task calculation result to the client.
2. The distributed big data calculation engine of claim 1, wherein,
The batch calculation engine processes batch calculation tasks through a distributed calculation and mixed load scheduler;
The batch and stream integrated calculation engine performs unified data storage on batch data and stream data in a mode of unified storage management system and row-column mixed storage; and the batch data and the stream data are respectively calculated and processed by an offline processing module and a real-time processing module through mixed load scheduling and vectorization execution.
3. The distributed big data compute engine of claim 2, wherein the rank mix store comprises: dividing a table into a plurality of tables, each of the tables comprising MetaData and a plurality of rowsets RowSet, the rowset RowSet comprising one MemRowSet and a plurality DiskRowSet;
the MemRowSet is used for new data insertion and updating of the data already in MemRowSet, the data in MemRowSet is stored according to rows, and after one MemRowSet is full, the data is brushed to a disk to form DiskRowSet;
The DiskRowSet is used for changing old data, and compression processing is carried out on DiskRowSet periodically through a background so as to delete unused data and merge historical data; wherein the data in DiskRowSet is organized in columns.
4. The distributed big data computing engine of claim 1, wherein the stream computing engine comprises a control node module, a compute node module, and a Zookeeper cluster module;
The computing node module comprises a plurality of physical computing nodes and is used for monitoring and executing corresponding flow computing tasks;
The Zookeeper cluster module is deployed on a plurality of servers and is used for storing all state information and task information of a plurality of physical computing nodes so as to enable the computing node module and the control node module to carry out real-time monitoring and calling;
The control node module is used for generating a directed acyclic graph based on the task information to be executed in the stream processing task and the resource information of each physical computing node; and issuing the tasks to be executed to the corresponding physical computing nodes according to the corresponding relation in the directed acyclic graph for processing, and scheduling the tasks based on the resource information of each physical computing node so as to realize low-delay processing of the stream data.
5. The distributed big data computing engine of claim 4, wherein the physical computing nodes include a monitoring process Tracker and a work process Worker; the monitoring process Tracker acquires task information through the Zookeeper cluster module and creates a working process workbench to calculate and process tasks issued to the physical calculation node;
The monitoring process Tracker is also used for monitoring task abnormality information of the work process Worker and performing task recovery based on a preset fault recovery flow so as to realize continuous execution of the stream processing task.
6. The distributed big data computing engine of claim 4, wherein the task scheduling based on the resource information of each physical computing node comprises:
acquiring resource utilization rate of each physical computing node, wherein the resource utilization rate comprises CPU utilization rate, memory occupancy rate, disk I/O utilization rate and bandwidth utilization rate;
Calculating the proportion of the CPU utilization rate, the memory occupancy rate, the disk I/O utilization rate and the bandwidth utilization rate in the total resource utilization rate;
If the proportion occupied by any one of the utilization rates exceeds a preset threshold, judging that task scheduling is required to be carried out on the task to be executed by the physical computing node;
And calculating to obtain task scheduling priorities corresponding to the physical computing nodes based on the resource information of each physical computing node, and performing task scheduling on tasks to be scheduled based on the physical computing node with the highest task scheduling priority.
7. The distributed big data computing engine of claim 4, wherein the computing, based on the resource information of each physical computing node, a task scheduling priority corresponding to each physical computing node includes:
Obtaining the residual rate of each resource of each physical computing node based on the utilization rate of each resource of each physical computing node;
obtaining task priority contribution degrees of all the resources according to the residual rates of all the resources of all the physical computing nodes and the resource quantity required by the tasks to be scheduled;
And obtaining task scheduling priorities of tasks to be scheduled corresponding to the physical computing nodes based on the task priority contribution degrees of the resources of the physical computing nodes.
8. The distributed big data computing engine of claim 5, wherein the task priority contribution of the CPU, memory and disk I/O is obtained by the following formula based on the difference between the resource residuals corresponding to the CPU, memory and disk I/O interfaces and the amount of resources required by the task:
Fcpu, fmem, fio is task priority contribution degrees of CUP, memory and disk I/O respectively; Δcpu, Δmem and Δio are the residual rate of the resources of the CUP, the memory and the disk I/O and the difference value of the amount of each resource required by the task respectively; taskcpu denotes the amount of CPU resources required by a task, taskmem denotes the amount of memory resources required by a task, taskio denotes the amount of disk I/O resources required by a task, qos cpu、qoscpu and qos cpu are the resource residuals of the CPU, the memory and the disk I/O interfaces respectively, and alpha, beta and gamma are weight factors.
9. The distributed big data computing engine of claim 6, wherein the task scheduling priority of the task to be scheduled corresponding to the physical computing node is obtained by the following formula:
Rank(task)=ρ+R(task)*fcpu*fmem*fio*v;
Wherein Rank (task) is task scheduling priority; fcpu, fmem, fio is the task scheduling priority contribution degrees of CUP, memory and disk I/O of the current physical computing node to the task to be scheduled respectively; ρ is a compensation factor; r (task) is the size of the task to be scheduled from the end point task in the scheduling sequence; v is a task processing speed influence factor, v= (1-s)/t; s is the ratio of the output data quantity to the input quantity of the task to be scheduled, and t is the processed time of the task.
10. The distributed big data computing engine of claim 6, wherein the amount of resources required by the stream computing task is obtained by:
Independently running the stream calculation task on a physical calculation node, and respectively counting idle time t1 and running time t2 of a CPU (Central processing Unit); the amount of CPU resources taskcpu required by the stream computation task is obtained by the following formula:
taskcpu=1-P=1-t1/(t2*Q);
wherein P is CPU idle rate when independently running tasks, and Q is CPU number;
the memory resource amount taskmem and the disk I/O resource amount taskio required by the stream computing task are obtained through statistics of memory and disk I/O statistics tools provided by the corresponding physical computing nodes.
CN202311844217.6A 2023-12-28 2023-12-28 Distributed big data calculation engine Pending CN118193565A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311844217.6A CN118193565A (en) 2023-12-28 2023-12-28 Distributed big data calculation engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311844217.6A CN118193565A (en) 2023-12-28 2023-12-28 Distributed big data calculation engine

Publications (1)

Publication Number Publication Date
CN118193565A true CN118193565A (en) 2024-06-14

Family

ID=91398996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311844217.6A Pending CN118193565A (en) 2023-12-28 2023-12-28 Distributed big data calculation engine

Country Status (1)

Country Link
CN (1) CN118193565A (en)

Similar Documents

Publication Publication Date Title
CN111327681A (en) Cloud computing data platform construction method based on Kubernetes
Santos et al. Real-time data warehouse loading methodology
US8082234B2 (en) Closed-loop system management method and process capable of managing workloads in a multi-system database environment
US9171044B2 (en) Method and system for parallelizing database requests
US8775413B2 (en) Parallel, in-line, query capture database for real-time logging, monitoring and optimizer feedback
US8082273B2 (en) Dynamic control and regulation of critical database resources using a virtual memory table interface
US8423534B2 (en) Actively managing resource bottlenecks in a database system
US8392404B2 (en) Dynamic query and step routing between systems tuned for different objectives
US20090327216A1 (en) Dynamic run-time optimization using automated system regulation for a parallel query optimizer
US10223437B2 (en) Adaptive data repartitioning and adaptive data replication
US10482084B2 (en) Optimized merge-sorting of data retrieved from parallel storage units
CN103930875A (en) Software virtual machine for acceleration of transactional data processing
US8392461B2 (en) Virtual data maintenance
US20090049024A1 (en) Dynamic query optimization between systems based on system conditions
CN112230894A (en) Flink-based stream batch integration index design method
CN118193565A (en) Distributed big data calculation engine
Boutin et al. Jetscope: Reliable and interactive analytics at cloud scale
CN112783892A (en) Chained task execution engine realized through event-driven model
Dai et al. GraphTrek: asynchronous graph traversal for property graph-based metadata management
Luo et al. Towards efficiently supporting database as a service with QoS guarantees
Alshamrani et al. An efficient approach for storage of big data streams in distributed stream processing systems
Sang et al. Cougar: A General Framework for Jobs Optimization In Cloud
Dou et al. Latency-Oriented Elastic Memory Management at Task-Granularity for Stateful Streaming Processing
US20240118905A1 (en) Performing shutdown of a node in a database system
CN118113424A (en) Distributed stream computing engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination