CN118193565A

CN118193565A - Distributed big data calculation engine

Info

Publication number: CN118193565A
Application number: CN202311844217.6A
Authority: CN
Inventors: 孟英谦; 李旭光; 葛晋鹏; 李泽宇; 陈朔; 邬书豪; 随秋林; 张敏; 刘晓兰; 薛行; 王嘉岩; 李子行
Original assignee: China North Computer Application Technology Research Institute
Current assignee: China North Computer Application Technology Research Institute
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-06-14

Abstract

The invention relates to a distributed big data calculation engine, comprising: the system comprises a unified interface module, a distributed computing engine module and an operation result processing module; the unified interface module is used for receiving the calculation task and analyzing the task based on the data type identifier of the calculation task so as to start the corresponding calculation engine; the distributed computing engine module comprises a flow computing engine, a batch computing engine and a batch-flow integrated computing engine which are respectively used for reading and executing corresponding computing tasks; the operation result processing module is used for collecting the operation state data of each calculation engine, monitoring the operation state and returning the task calculation result to the client. The invention solves the problem that the big data calculation engine in the prior art cannot provide parallel and efficient real-time calculation aiming at multiple types of data processing tasks so as to perform real-time quick response when facing multiple complex calculation scenes.

Description

Distributed big data calculation engine

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a distributed big data computing engine.

Background

As heterogeneous data generated by the intelligence, monitoring and reconnaissance system is larger and larger, including electronic reconnaissance data, visible spectrum data, intelligence data, real-time situation data of various sites, instruction data such as logistics guarantee and personnel mobilization, the diversity and complexity of data services in various scenes are increased, and challenges in data storage, calculation and data analysis processing service break through are also larger and larger.

In the aspect of distributed computation, besides the traditional requirement of mass data batch processing, the real-time processing of the data efficiently assists in combat decision. There is therefore a need in the art for a unified big data computing engine to handle all of the needs of batch computing, stream computing, and batch-stream integrated computing.

Disclosure of Invention

In view of the above analysis, the present invention aims to disclose a distributed big data computing engine, which solves the problems that when the data computing engine in the prior art faces multiple complex computing scenes, real-time computing with high efficiency cannot be provided for multiple types of data processing tasks, resulting in high delay and incapability of real-time quick response.

The aim of the invention is mainly realized by the following technical scheme:

In one aspect, the invention discloses a distributed big data computing engine, the computing engine comprising: the system comprises a unified interface module, a distributed computing engine module and an operation result processing module;

The unified interface module is used for receiving a computing task and analyzing the task based on a data type identifier of the computing task so as to start a corresponding computing engine;

The distributed computing engine module comprises a flow computing engine, a batch computing engine and a batch-flow integrated computing engine which are respectively used for reading and executing corresponding computing tasks; the stream computing engine is used for carrying out task scheduling based on task information of each node in the directed acyclic graph and resource information of each corresponding physical computing node so as to realize low-delay processing of stream data;

The operation result processing module is used for collecting the operation state data of each calculation engine, monitoring the operation state and returning the task calculation result to the client.

Further, the batch computing engine processes batch computing tasks through a distributed computing and mixed load scheduler;

The batch and stream integrated calculation engine performs unified data storage on batch data and stream data in a mode of unified storage management system and row-column mixed storage; and the batch data and the stream data are respectively calculated and processed by an offline processing module and a real-time processing module through mixed load scheduling and vectorization execution.

Further, the rank mix storage includes: dividing a table into a plurality of tables, each of the tables comprising MetaData and a plurality of rowsets RowSet, the rowset RowSet comprising one MemRowSet and a plurality DiskRowSet;

The MemRowSet is used for new data insertion and updating of the data already in MemRowSet, the data in MemRowSet is stored according to rows, and after one MemRowSet is full, the data is brushed to a disk to form a plurality of DiskRowSet;

The DiskRowSet is used for changing old data, and compression processing is carried out on DiskRowSet periodically through a background so as to delete useless data and merge historical data; wherein the data in DiskRowSet is organized in columns.

Further, the flow calculation engine comprises a control node module, a calculation node module and a Zookeeper cluster module;

The computing node module comprises a plurality of physical computing nodes and is used for monitoring and executing corresponding flow computing tasks;

The Zookeeper cluster module is deployed on a plurality of servers and is used for storing all state information and task information of a plurality of physical computing nodes so as to enable the computing node module and the control node module to carry out real-time monitoring and calling;

The control node module is used for generating a directed acyclic graph based on the task information to be executed in the stream processing task and the resource information of each physical computing node; and issuing the tasks to be executed to the corresponding physical computing nodes according to the corresponding relation in the directed acyclic graph for processing, and scheduling the tasks based on the resource information of each physical computing node so as to realize low-delay processing of the stream data.

Further, the physical computing node comprises a monitoring process Tracker and a working process Worker; the monitoring process Tracker acquires task information through the Zookeeper cluster module and creates a working process workbench to calculate and process tasks issued to the physical calculation node;

The monitoring process Tracker is also used for monitoring task abnormality information of the work process Worker and performing task recovery based on a preset fault recovery flow so as to realize continuous execution of the stream processing task.

Further, the task scheduling based on the resource information of each physical computing node includes:

acquiring resource utilization rate of each physical computing node, wherein the resource utilization rate comprises CPU utilization rate, memory occupancy rate, disk I/O utilization rate and bandwidth utilization rate;

Calculating the proportion of the CPU utilization rate, the memory occupancy rate, the disk I/O utilization rate and the bandwidth utilization rate in the total resource utilization rate;

If the proportion occupied by any one of the utilization rates exceeds a preset threshold, judging that task scheduling is required to be carried out on the task to be executed by the physical computing node;

And calculating to obtain task scheduling priorities corresponding to the physical computing nodes based on the resource information of each physical computing node, and performing task scheduling on tasks to be scheduled based on the physical computing node with the highest task scheduling priority.

Further, the calculating, based on the resource information of each physical computing node, a task scheduling priority corresponding to each physical computing node includes:

Obtaining the residual rate of each resource of each physical computing node based on the utilization rate of each resource of each physical computing node;

obtaining task priority contribution degrees of all the resources according to the residual rates of all the resources of all the physical computing nodes and the resource quantity required by the tasks to be scheduled;

And obtaining task scheduling priorities of tasks to be scheduled corresponding to the physical computing nodes based on the task priority contribution degrees of the resources of the physical computing nodes.

Further, based on the difference value of the resource remaining rate corresponding to the CPU, the memory and the disk I/O interface and the resource quantity required by the task, the task priority contribution degree of the CPU, the memory and the disk I/O is obtained through the following formula:

Fcpu, fmem, fio is task priority contribution degrees of CUP, memory and disk I/O respectively; Δcpu, Δmem and Δio are the residual rate of the resources of the CUP, the memory and the disk I/O and the difference value of the amount of each resource required by the task respectively; taskcpu denotes the amount of CPU resources required by a task, taskmem denotes the amount of memory resources required by a task, taskio denotes the amount of disk I/O resources required by a task, qos _cpu、qos_cpu and qos _cpu are the resource residuals of the CPU, the memory and the disk I/O interfaces respectively, and alpha, beta and gamma are weight factors.

Further, the task scheduling priority of the task to be scheduled corresponding to the physical computing node is obtained through the following formula:

Rank(task)＝ρ+R(task)*fcpu*fmem*fio*v；

Wherein Rank (task) is task scheduling priority; fcpu, fmem, fio is the task scheduling priority contribution degrees of CUP, memory and disk I/O of the current physical computing node to the task to be scheduled respectively; ρ is a compensation factor; r (task) is the size of the task to be scheduled from the end point task in the scheduling sequence; v is a task processing speed influence factor, v= (1-s)/t; s is the ratio of the output data quantity to the input quantity of the task to be scheduled, and t is the processed time of the task.

Further, each resource amount required by the stream computing task is obtained by the following method respectively:

Independently running the stream calculation task on a physical calculation node, and respectively counting idle time t1 and running time t2 of a CPU (Central processing Unit); the amount of CPU resources taskcpu required by the stream computation task is obtained by the following formula:

taskcpu＝1-P＝1-t1/(t2*Q)；

wherein P is CPU idle rate when independently running tasks, and Q is CPU number;

the memory resource amount taskmem and the disk I/O resource amount taskio required by the stream computing task are obtained through statistics of memory and disk I/O statistics tools provided by the corresponding physical computing nodes.

The invention can realize at least one of the following beneficial effects:

1. The invention integrates various data calculation engines, performs data task analysis based on a unified data interface module, and realizes the parallel high-efficiency real-time processing of stream data, batch data and batch-stream integrated data.

2. The invention calculates the task scheduling priority corresponding to each physical computing node based on the hardware resource information of the physical computing nodes, and performs task scheduling on the task to be scheduled based on the physical computing node with the highest task scheduling priority, so that the streaming data processing task is distributed and executed under the constraint of fixed hardware resources and task priority, the low-delay processing of the streaming processing task is realized, and the real-time performance of the system is improved.

3. The invention introduces compensation factors and task processing speed influencing factors in task scheduling, considers the influence of different resources on scheduling and the influence of task processing speed in task scheduling, compensates running tasks, improves the scheduling priority of the running tasks, ensures the continuity of task running and the correctness and effectiveness of a scheduling algorithm, reduces the system time delay and can better solve the problem of stream data processing bottleneck.

3. The invention realizes the continuous execution of the flow calculation task through the Zookeeper cluster module, the backup control node setting and the fault recovery flow, and ensures the reliability of the flow calculation engine.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to designate like parts throughout the drawings;

FIG. 1 is a block diagram of a distributed big data calculation engine in an embodiment of the invention;

FIG. 2 is a workflow of a stream computation engine in an embodiment of the invention;

Detailed Description

Preferred embodiments of the present application are described in detail below with reference to the attached drawing figures, which form a part of the present application and are used in conjunction with embodiments of the present application to illustrate the principles of the present application.

In one embodiment of the present invention, a distributed big data calculation engine is disclosed, as shown in fig. 1, the big data calculation engine includes: the system comprises a unified interface module, a distributed computing engine module and an operation result processing module; wherein,

Specifically, the distributed big data computing engine of the embodiment is deployed in a domestic server cluster based on a distributed architecture design, and supports servers of domestic architecture CPUs such as Feiteng, shenwei, loongson, megacore, sea light and the like. On a hardware server, firstly, a Docker+Kubernetes base resource management service is deployed to realize uniform resource management and scheduling of the bottom domestic heterogeneous hardware, then batch computing, stream computing and batch stream integrated computing service is deployed, and finally, an engine result service is deployed. According to the embodiment, based on unified SQL engines, computing services are provided to the outside through JDBC, ODBC, API interfaces and the like, the received data type identifiers carried in the computing tasks are analyzed, the data type identifiers are matched with a preset identifier database of each computing engine, and corresponding computing engines are started according to analysis and matching results of the tasks.

Further, the distributed computing engine module comprises a flow computing engine, a batch computing engine and a batch-flow integrated computing engine which are respectively used for reading and executing corresponding computing tasks;

The stream computing engine is used for carrying out task scheduling based on task information of each node in the directed acyclic graph and resource information of each corresponding physical computing node so as to realize low-delay processing of stream data;

Specifically, the stream computation engine comprises a control node module, a computation node module and a Zookeeper cluster module, wherein,

the control node module is used for generating a directed acyclic graph based on the task information to be executed in the stream processing task and the resource information of each physical computing node; and issuing the tasks to be executed to the corresponding physical computing nodes for processing according to the corresponding relation in the directed acyclic graph, and scheduling the tasks based on the resource information of each physical computing node so as to realize low-delay processing of the stream data.

Specifically, the control node module is a central hub of the distributed stream computing engine, and is configured with a master control node and a slave control node; when the flow calculation engine is started, the main control node and the slave control node are initialized; the control node started later sends registration information to the control node started earlier, and heartbeat keep-alive connection between the control nodes is established; in a normal working state, the main control node is in a working mode and is used for executing control tasks; the slave control node is used as a backup node of the master control node and is in a standby mode, and when the master control node fails, the control task of the master control node is replaced based on heartbeat keep-alive connection, so that the continuous running of the stream processing task is ensured, and the reliability of the system is improved.

In a distributed stream computing system, tasks are the basic unit of scheduling, there is typically a constraint of priority between different tasks, and tasks with higher priority may be executed earlier than tasks with lower priority. The task scheduling in this embodiment is aimed at allocating and executing tasks under the constraint of fixed hardware resources and task priorities, so as to reduce the total time taken for completing the tasks.

Preferably, the task scheduling based on the resource information of each physical computing node includes:

Acquiring the resource utilization rate of each physical computing node, wherein the resource utilization rate comprises CPU utilization rate, memory occupancy rate and disk I/O utilization rate; calculating the proportion of CPU utilization rate, memory occupancy rate, disk I/O utilization rate and bandwidth utilization rate in the total resource utilization rate; if the proportion occupied by any one of the utilization rates exceeds a preset threshold, judging that task scheduling is required to be carried out on the task to be executed by the physical computing node; calculating to obtain task scheduling priorities corresponding to the physical computing nodes based on the resource information of each physical computing node, and performing task scheduling on tasks to be scheduled based on the physical computing node with the highest task scheduling priority;

The task scheduling priority corresponding to each physical computing node is obtained through calculation by the following method: obtaining the residual rate of each resource of each physical computing node based on the utilization rate of each resource of each physical computing node; obtaining task priority contribution degrees of all the resources according to the residual rates of all the resources of all the physical computing nodes and the resource quantity required by the tasks to be scheduled; and obtaining task scheduling priorities of tasks to be scheduled corresponding to the physical computing nodes based on the task priority contribution degrees of the resources of the physical computing nodes.

Specifically, the resource amount required by the task to be scheduled is obtained by the following methods:

Independently running a stream calculation task on a physical calculation node, and respectively counting idle time t1 and running time t2 of a CPU (Central processing Unit); the amount of CPU resources taskcpu required for the stream computation task is obtained by the following formula:

taskcpu＝1-P＝1-t1/(t2*Q)；

The amount of memory resources taskmem and the amount of disk I/O resources taskio required by the stream computing task are obtained by statistics of memory and disk I/O statistics tools provided by the corresponding physical computing nodes, respectively.

According to the method and the device for scheduling the tasks, the tasks of the physical computing nodes with larger loads affecting the task processing speed are scheduled to the relatively idle physical computing nodes for processing based on the resource information of the physical computing nodes, so that the stream data processing speed is greatly improved, and the problems of high delay and poor instantaneity caused by hardware resource limitation are solved.

More specifically, because the resources required by different tasks are different, the computation-intensive tasks require more CPU resources, and the I/O-intensive tasks require higher I/O efficiency, so that the tasks are scheduled according to the influence of different resources on scheduling, and the problem of stream data processing bottlenecks can be better solved. The resource consumption of one task is multiple, so that the mapping between the load capacity of the physical computing node and the resource use condition of different dimensions can be established, and task scheduling can be performed on the tasks corresponding to each node.

Factors affecting node load generally include CPU usage, memory usage, and I/O efficiency, where:

The CPU is an important resource calculated by a computer, has important influence on task scheduling, and the CPU utilization rate is counted by the following method:

Checking the running time and the idle time of the CPU respectively, taking the time difference between two adjacent CPU running times as a calculation period to count the running time of the CPU, and counting the idle time of the CPU by the same method, so that the utilization rate Use _cpu of the CPU can be obtained based on the idle time and the running time;

The memory usage Use _mem can be counted by a command for counting the memory occupation condition provided by the Linux system. Disk I/O usage Use _IO is similar to memory usage, and statistics is also performed using the I/O statistics tool provided by the system.

The control node module in this embodiment starts a thread, specifically monitors the running state of each task node, and in general, the running state of the computing node is relatively stable, the usage condition of each resource does not exceed a threshold, and the data flow rate is stable, so that task scheduling does not need to be periodically performed. Only when some computing nodes fail or the load exceeds a threshold value, task scheduling needs to be carried out again when jitter is relatively large.

In the distributed flow computing platform, task allocation scheduling is performed based on load information of nodes, so that analysis of node loads is critical to node operation efficiency, and the relation between the loads and resources of physical computing nodes is as follows:

L＝Use_cpu+Use_mem+Use_IO；

Where Use _cpu represents CPU usage, use _mem represents memory occupancy, and Use _IO represents disk I/O usage. In practical application, a certain time interval can be set according to practical conditions, each resource information is periodically collected, the load condition of a physical computing node is obtained, and the proportion occupied by each resource occupancy is calculated, wherein the proportion is shown in the following formula:

C_cpu＝ΔUsecpu/ΔL；

C_mem＝ΔUsemem/ΔL；

C_IO＝ΔUseIO/ΔL；

wherein, delta Usecpu, delta Usemem and delta UseIO are respectively the CPU usage difference value, the memory occupancy difference value and the disk I/O usage difference value of two adjacent acquisition periods, and delta L is the difference value of the total load of two adjacent acquisition periods.

The factor with the greatest influence on the load can be obtained through the C _cpu、C_mem、C_IO, and the factor which possibly becomes the bottleneck of the node is also reflected. Through a large number of experiments, it is found that when the maximum fluctuation in C _cpu、C_mem、C_IO exceeds 30%, large fluctuation may occur, so the threshold is set to 30% in this embodiment, that is, when any one of C _cpu、C_mem、C_IO exceeds the 30% threshold, task scheduling is performed on the task corresponding to the physical computing node. Similarly, when C _cpu、C_mem、C_IO assumes a smaller value, indicating that the resources in this regard are relatively free, the corresponding task should be scheduled to that node preferentially if such resources are needed.

More specifically, based on the difference value between the resource remaining rate corresponding to the CPU, the memory and the disk I/O interface and the resource quantity required by the task, the task priority contribution degree of the CPU, the memory and the disk I/O is obtained through the following formula:

Fcpu, fmem, fio is task priority contribution degrees of CUP, memory and disk I/O respectively; Δcpu, Δmem and Δio are the residual rate of the resources of the CUP, the memory and the disk I/O and the difference value of the amount of each resource required by the task respectively; taskcpu denotes the amount of CPU resources required by a task, taskmem denotes the amount of memory resources required by a task, taskio denotes the amount of disk I/O resources required by a task, qos _cpu、qos_mem and qos _io are the resource residuals of the CPU, the memory and the disk I/O interfaces respectively, and alpha, beta and gamma are weight factors.

Preferably, qos _cpu、qos_cpu and qos _cpu are obtained by the following formula:

qos_cpu＝1–Use_cpu；

qos_mem＝1–Use_mem；

qos_io＝1–Use_io；

the difference delta CPU, delta mem and delta io between the residual rate of the CPU, the memory and the disk I/O and the amount of I/O resources of the CPU, the memory and the disk required by the task are respectively obtained by the following formulas:

Δcpu＝qos_cpu-task_cpu；

Δmem＝qos_mem-task_mem；

Δio＝qos_io-task_io；

Finally, the task scheduling priority corresponding to the physical computing node is obtained through the following formula:

Rank(task)＝ρ+R(task)*fcpu*fmem*fio*v；

Wherein Rank (task) is task scheduling priority; fcpu, fmem, fio is the task priority contribution of CUP, memory and disk I/O respectively; ρ is a compensation factor and, R (task) is the size of the node task from the end point task in the scheduling sequence; v is a task processing speed influence factor, v= (1-s)/t; s is the ratio of the output data quantity to the input quantity of the task, and t is the processed time of the task.

The higher the task scheduling priority is, the higher the efficiency of processing the corresponding task to be scheduled is, therefore, after the task scheduling priority is obtained by calculation, the task to be scheduled is scheduled to the physical computing node with the highest corresponding priority for calculation processing, so that the running efficiency of the system is improved, and the delay of data processing is reduced.

It should be noted that, taking fcpu as an example, when Δcpu >0, it is described that the residual rate of the physical computing node CPU is more than that of the CPU required by the task, fcpu has a positive correlation effect on the task scheduling priority Rank (task), the positive effect is correlated with the weight factor α, and the values of α, β, γ are all between 0 and 1, and can be set according to the actual application situation; when Δcpu <0, it indicates that the demand of task on CPU is greater than the current node CPU residual rate, and according to the formula fcpu, fcpu has a negative correlation effect on task scheduling priority Rank (task) when Δcpu < 0; for a node, the more tasks of input data processed in unit time, the less memory consumption, the embodiment introduces a task processing speed influence factor v, defines v= (1-s)/t, considers the influence of task processing speed, and improves the effectiveness of a task scheduling algorithm; and the present embodiment introduces a compensation factor, the compensation factor ρ is defined as: compensation is given to the running task, the scheduling priority of the running task is improved, the continuity of task running and the correctness and effectiveness of a scheduling algorithm are guaranteed, and the system time delay is reduced.

Further, as shown in fig. 2, the operation of the stream processing computing engine includes:

The method comprises the steps that a Master control node Master and a slave control node Secondary Master are respectively initialized, the initialization parts of the Master control node Master and the slave control node Secondary Master are sequential, a later started control node sends registration information to the earlier started control node, and heartbeat keep-alive connection between the control nodes is established;

The monitoring process Tracker of each physical computing node in the computing node module is started and initialized;

the computing node monitoring process acquires metadata such as hardware resource information of a physical computer through a Zookeeper cluster module, registers the metadata with a main control node and reports the hardware resource information;

a user sends a task starting instruction to a Master control node Master through a management interface Portal process;

the main control node analyzes the streamSQL to obtain a static topological graph (static directed acyclic graph STATICDAG) of the task; performing task scheduling according to cluster resources and the running task conditions to obtain a dynamic topological graph (dynamic directed acyclic graph DYNAMICDAG) of the task;

the main control node transmits the operation instruction to the corresponding physical computing node according to the corresponding relation of the dynamic topological graph;

the computing node monitoring process Tracker creates a working process workbench according to the instruction of the control node, and operates a designated stream processing task;

the computing node monitoring process Tracker replies the working process creation condition to the main control node;

after the Master control node Master waits for the working processes of all the computing units in the task topological graph to be established, an instruction for connecting the downstream computing units is sent to the computing node monitoring process according to the topological graph relation of the task;

the computing node monitoring process Tracker forwards a downstream connection instruction to a working process Worker, and the working process Worker establishes network connection with a working process of a downstream computing node;

the computing node monitoring process Tracker replies the downstream execution result of the work process workbench to the Master control node Master;

After the Master control node Master waits for the complete establishment of the topological graph of the task, the Master control node Master replies the task starting execution result to the management interface.

Further, in practical application, the control node module may set a resource management sub-module, a task scheduling sub-module, a data source management sub-module, and an output management sub-module.

The task management sub-module is used for analyzing any task information to be executed of stream processing provided by the client and generating and managing a Directed Acyclic Graph (DAG); more specifically, the distributed stream computing task is organized in the form of a Directed Acyclic Graph (DAG), and the control node generates a topological relation of the task by analyzing StreamSQL, wherein the topological relation contains the execution sequence of each task to be executed in the stream processing task and the physical information of the computing nodes, including id information of the physical computing nodes, a father node list and a child node list, and records the IP address, a monitoring port, a process number, a process state and the like of the physical computing node where the working process is located.

The task management submodule mainly manages state information and operation of tasks, and in this embodiment, a task information table is used for managing tasks, wherein the task table is a hash table, and task states and managed Directed Acyclic Graph (DAG) information are stored in the hash table, so as to ensure reliability of task management. The business_table contains task names, task ids, task state information, task storage locations, and information pointing to the DAG.

The scheduling algorithm sub-module is used for distributing tasks to corresponding physical computing nodes according to Directed Acyclic Graph (DAG) information after receiving a task starting request sent by a client, rescheduling the tasks when the utilization rate of certain resources fluctuates greatly or exceeds a threshold value, and starting a fault recovery algorithm to schedule the tasks if the nodes fail.

The resource management sub-module is used for collecting CPU information, bandwidth information, memory information and disk I/O information on the physical machine where the computing node is located, collecting the resource use information, counting, and providing original data for the scheduling algorithm. In the running process of the system, the computing node monitoring process Tracker can collect the hardware resource residual information of the computing node and the resource occupation information of the working process at fixed time, and the hardware resource residual information and the resource occupation information of the working process are placed in a heartbeat protocol packet and sent to the control node in a piggybacking mode. The control node module adopts a sliding window mechanism to count the hardware resource information reported by the computing node, and the oldest resource information in the current window can be replaced when the latest reported resource information is received each time. And taking the average value of the hardware resource information in the sliding statistical window as the actual hardware resource use information.

The data source management sub-module performs unified management on the data source configuration such as a source table, a dimension table and the like used by the execution of the streamSQL task, selects the managed data source to operate when writing the SQL task, and avoids the need of manually inputting connection information for each task.

Data source management includes: newly added data sources: for the newly added data source, the type of the added data source needs to be selected, and the connection configuration parameters of the data sources of different types are also different. And after the connection information is filled in, performing connection test on the data source, and after connection establishment is successful, storing the connection information of the current data source. Deleting the data source: when some data sources are no longer in use, the data source is removed from the managed data sources. Modifying the data source: when some data source connection information is changed, the data source is reconnected by modifying the data source parameter configuration.

The output management sub-module is responsible for persistence of stream computation results, supporting distributed file systems (HDFS), relational databases (e.g., mysql, kunDB, etc.), columnar or row-column hybrid storage systems (e.g., HBase, holodesk, etc.), search storage systems (e.g., elastic search), and event-type storage systems (e.g., kafka). Here, the relational database Mysql, the columnar storage system HBase, and the search type storage system elastic search are selected to introduce their relevant configurations.

Furthermore, the Zookeeper cluster module adopts a mode that a leader is provided with a plurality of followers, the leader and each follower are deployed on different servers, and data are synchronized to the followers through the leader, so that the leader and each follower store all state information and task information of the same physical computing node, and when the leader breaks down, one leader is selected from the plurality of followers through an election mechanism, and the continuous operation of the distributed engine is kept. That is, the Zookeeper module is a cluster module formed by a leader (leader) and a plurality of followers (follower), each server stores an identical copy of data, the client is consistent no matter the client is connected to that server, and the leader (leader) completes the distributed read-write and update request forwarding.

The computing node modules are deployed on different physical clusters (physical computing nodes) and are responsible for receiving tasks distributed by the control nodes, managing working processes, such as starting and stopping of the working processes, and running specific tasks. The physical computing node comprises a monitoring process Tracker and a working process Worker; the monitoring process Tracker acquires task information through a Zookeeper cluster module and creates a work process workbench so as to calculate and process tasks issued to the physical calculation node;

the monitoring process Tracker is also used for monitoring task exception information of the work process Worker and performing task recovery based on a preset fault recovery flow so as to realize continuous execution of the flow processing task.

Specifically, performing task recovery based on a preset failure recovery flow includes:

Receiving a task exception message of a working process workbench 1 with a fault through a monitoring process Tracker and forwarding the task exception message to a control node module;

The control node module issues a rescheduling command to a monitoring process Tracker1 where the working process workbench 1 is located, the monitoring process Tracker1 issues a command for terminating the working process workbench 1, and a workbench process list of the control node module is maintained and updated;

The control node module detects the resource information of each current physical computing node and the system resources required by the replacement working process workbench 1 so as to recalculate the task scheduling priority and initiate a command of pulling the working process workbench 2 to the monitoring process workbench 2 on the physical computing node with the highest task scheduling priority;

The working process workbench 2 loads a program operated by the working process workbench 1 according to the position designated by the command and is connected with a downstream node of the working process workbench 1;

the control node module informs an upstream node of the work process Worker1 to actively connect with the work process Worker2, and sends the data stream carried by the data stream to the work process Worker2 according to a distribution rule preset by the stream calculation task to complete task recovery.

Because dynamic data continuously generated by the data flow needs to be processed, the flow calculation engine needs to achieve the effects of rapidness, high efficiency and low delay on the data processing.

specifically, the batch calculation engine comprises a storage layer, an execution layer, a compiling layer and a service layer;

The storage layer is composed of two functional modules, namely a data source connector and a data table management module, so that the batch processing engine is adapted to be connected with various data sources and manage various data tables, and the requirement of multi-source data analysis is met.

The execution layer consists of two functional modules of a distributed computing node and a mixed load scheduler, so that the distributed execution and task scheduling of the large data volume analysis task are realized;

specifically, the hybrid load scheduler of the present embodiment includes a first-in first-out scheduling policy and a fair scheduling policy:

Wherein, the first-in first-out scheduling strategy comprises: queuing all SQL tasks in a scheduler according to the submitted time sequence; when idle resources appear, the idle resources are preferentially distributed to the tasks submitted first; and when no idle resources exist, queuing the subsequent tasks. That is, all tasks are performed in the order of commit.

The fair scheduling strategy comprises a resource pool level and a fair scheduling strategy of two levels in the resource pool;

The resource Pool level scheduling strategy sets a Minimum Share (Minimum occupied resource) and a Pool Weight (Chi Quan Weight of resource) for each resource Pool; pool Weight represents the priority of the resource Pool, and when idle resources exist, tasks in the resource Pool with high priority acquire resources preferentially; minimum Share represents the least occupied resource of a resource pool, and even if a resource pool with higher priority is robbing resources, the Minimum resource number of a resource pool with low priority is guaranteed preferentially. The resource pool internal level scheduling policy fairly shares all resources for tasks submitted to the same resource pool, regardless of task submission order.

The embodiment realizes unified multi-task execution and multi-level scheduling through the hybrid load scheduler, optimizes the utilization rate of the whole resources of the cluster, realizes the functions of queue scheduling, quota storage, load balancing and the like, and improves the whole job throughput of the distributed batch computing engine.

Further, the compiling layer comprises an SQL compiler, a stored procedure compiler, a transaction management unit, an optimizer and other components. The batch processing computing engine performs concurrent execution in the cluster through the distributed engine by adding and deleting SQL execution plans through optimization of the SQL compiler, can meet the high throughput rate requirement of batch processing business, and simultaneously provides a function of supporting business operation and meets the business requirement of batch processing business.

Further, the service layer provides unified development interface service, engine ecological connection management service, security and high availability management service and the like for the batch processing computing engine.

Preferably, the batch-stream integrated calculation engine performs unified data storage on batch data and stream data in a mode of unified storage management system and row-column mixed storage; the batch data and the stream data are respectively calculated and processed by an offline processing module and a real-time processing module through mixed load scheduling and vectorization execution;

Specifically, conventional database query execution typically employs a one-time tuple pipleline execution mode, and most of the processing time of the CPU is not used to actually process data, but rather traverses the query operation tree, resulting in low CPU utilization, and this also results in low instruction cache performance and frequent jumps. The embodiment adopts a vectorization execution strategy to change the execution mode of one tuple once, and utilizes SIMD acceleration instructions of the CPU, namely single instruction and multiple data operation, to help the CPU to realize data parallelism and improve operation efficiency.

Preferably, the vectorization execution engine may be implemented by the following method:

Based on the volcanic model, modifying the processing mode of one tuple at a time into a processing mode of returning a group of column storage line values (for example, 100-1000 lines) at a time so as to improve the operation efficiency; or adopting a compiling execution model, converting the optimized execution plan tree into compiling execution based on a hierarchical execution mode, namely, for each call, returning data upwards after each layer is completed, reducing the call times among nodes of each hierarchy to the greatest extent, and improving the effective calculation efficiency of the CPU.

Further, for the existing computing engine, such as Lambda architecture, after the data passes through the kafka message middleware and enters the Lambda architecture, two processing modules of offline processing (Hadoop) and real-time processing (Storm) are simultaneously entered. The offline processing performs batch calculations and aggregates a large amount of data. The real-time processing is to perform stream processing or micro batch processing, and calculate the results of seconds and minutes. Finally, the data are input into a service database (service DB) for summarization and are exposed to the upper layer service call. However, the existing method needs to maintain two sets of codes of real-time processing and off-line processing at the same time, and meanwhile, the consistency of the processing results of the two sets of codes is guaranteed, and the complexity of the architecture is high.

The embodiment adopts high-performance row-column mixed storage, utilizes real-time computing service and offline computing service in a computing layer, thoroughly opens up metadata, adopts an event-driven mode, ensures that data can be processed in real time with low delay, and can be searched and queried in the offline computing service immediately after the data is written.

The small files generated by the archiving real-time warehousing task can be merged and archived on the premise of no perception of the upper layer application by utilizing the built-in small file merging service of the storage layer;

Specifically, in order to ensure the writing performance, all the writing operations write a new file in the bottom layer, when frequent writing operations with small data volume are performed, a large number of base/delta files occur, the content is very small (KB level), and for files below 32MB, namely, small files are regarded as being too many, the cost of IO is greatly increased. The embodiment designs a Compact functional module to combine a plurality of small files into one file, thereby solving the problem caused by the small files.

Specifically, doclet merges fall into three categories, including Full, minor, and Major.

Combining a plurality of base files into one base file by full compact, and deleting delta files together;

The minor compact merges the delta files of a base to generate a new delta file, and applies the new delta file to the original base file;

The major compact merges a base and delta files thereof to generate a new base file;

The embodiment solves the problems that too many small files greatly increase IO overhead, influence the performance of a computing engine and further influence the stability of a system through small file merging.

Further, the rank mix storage includes: a table is divided into tables, each of which includes MetaData meta-information and RowSet, which includes one MemRowSet and DiskRowSet. Wherein MemRowSet: for new data insert and update of the data already in MemRowSet, the data in MemRowSet is stored in rows, and one MemRowSet is written up to brush the data to disk to form DiskRowSet. DiskRowSet is used for changing old data (mutation), the background periodically makes a comparison to DiskRowSet to delete useless data and merge historical data, IO overhead in the query process is reduced, and data in DiskRowSet is organized according to columns.

More specifically, the data write process of the batch flow unified computing engine is as follows:

the client is connected with the Master to acquire the related information of the table, wherein the related information comprises partition information and information of all tables in the table;

The client finds DATA SERVER where the tablelet responsible for handling the read-write request is located. The platform receives the request of the client and checks whether the request meets the requirement (table structure);

The platform searches all rowset (memrowset, diskrowset) in the table to confirm whether the data with the same main key as the data to be inserted exists, if so, an error is returned, otherwise, the process is continued;

Write operations are first committed to the pre-written log (WAL) of the tableets, and agreements to follow-up nodes are made according to Raft consistency algorithms, then added to the memory of one of the tableets, and inserts are added to MemRowSet of the tableet. To support multi-version concurrency control (MVCC) in MemRowSet, update and delete operations on the most recently inserted row (i.e., a new row that has not been refreshed to disk) will be appended after the original row in MemRowSet to generate a list of REDO records;

Streaming data is written MemRowset, and when MemRowset data reaches a certain size, memRowset drops the data and generates diskrowset for persisting the data and generates memrowset a request to continue to receive new data. The background periodically makes a comparison to DiskRowSet to delete useless data and merge history data.

It should be emphasized that in the solution of the batch-stream integrated computing engine, a core problem is that a service database layer, that is, a stream processing storage layer, has a bottleneck, and it is not possible to combine efficient writing and real-time analysis of the service. In the existing open source scheme, if a piece of data is required to be written in real time, the real-time query can be performed, the writing throughput is likely to be reduced, and meanwhile, a serious small file problem is caused; on the other hand, if data is written in batches in order to improve throughput, this scheme cannot fully meet the requirement of data instantaneity.

The storage management system designed by the batch-flow integrated computing engine can solve the problem of excessive small files while supporting high-efficiency writing of data; the embodiment performs depth integration on real-time calculation and offline calculation; in the calculation layer, the real-time calculation task and the offline calculation service are thoroughly communicated on metadata, and an event driving mode is adopted, so that the data can be processed in real time with low delay, and the data can be searched and inquired in the offline calculation service immediately after being written; in the storage layer, a set of efficient storage system is developed, and on the premise of ensuring the writing throughput of an approximate file system, the built-in small file merging service can be utilized to merge and archive small files generated by a real-time warehousing task on the premise of no perception of upper-layer application.

Further, the operation result processing module of the big data computing engine of the embodiment is used for collecting the operation state data of each computing engine, monitoring the operation state and returning the task computing result to the client;

Specifically, the operation result processing module can collect operation state data in the task execution process, monitor the operation state and write in an operation log, if an error or a problem occurs, start an error alarm, and return the calculation result of the executed calculation task to the corresponding client.

In summary, the distributed big data computing engine integrates various data computing engines, performs data task analysis based on the unified data interface module, and realizes parallel, efficient and real-time processing of stream data, batch data and batch-stream integrated data. The flow computing engine calculates task scheduling priorities corresponding to the physical computing nodes based on hardware resource information of the physical computing nodes, and performs task scheduling on tasks to be scheduled based on the physical computing node with the highest task scheduling priority, so that the flow data processing tasks are distributed and executed under the constraint of fixed hardware resources and task priorities, low-delay processing of the flow processing tasks is realized, and real-time performance of the system is improved. And the compensation factors and the task processing speed influence factors are introduced into the task scheduling, so that the influence of different resources on the scheduling is considered during the task scheduling, the influence of the task processing speed is considered, the running task is compensated, the scheduling priority of the running task is improved, the running continuity of the task and the correctness and effectiveness of a scheduling algorithm are ensured, the system time delay is reduced, and the problem of stream data processing bottleneck can be better solved. Continuous execution of the flow calculation task is realized through the Zookeeper cluster module, the backup control node setting and the fault recovery flow, and the reliability of the flow calculation engine is ensured; the bottom layer storage of the batch-stream integrated computing engine adopts high-performance row-column mixed storage, so that the problem that the existing computing engine cannot achieve efficient writing and real-time analysis of services due to the bottleneck of a stream processing storage layer is solved.

Those skilled in the art will appreciate that implementing all or part of the processes of the methods in the above embodiments may be accomplished by computer programs to instruct related hardware, and that the programs may be stored in a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. A distributed big data computing engine, comprising: the system comprises a unified interface module, a distributed computing engine module and an operation result processing module;

2. The distributed big data calculation engine of claim 1, wherein,

The batch calculation engine processes batch calculation tasks through a distributed calculation and mixed load scheduler;

3. The distributed big data compute engine of claim 2, wherein the rank mix store comprises: dividing a table into a plurality of tables, each of the tables comprising MetaData and a plurality of rowsets RowSet, the rowset RowSet comprising one MemRowSet and a plurality DiskRowSet;

the MemRowSet is used for new data insertion and updating of the data already in MemRowSet, the data in MemRowSet is stored according to rows, and after one MemRowSet is full, the data is brushed to a disk to form DiskRowSet;

The DiskRowSet is used for changing old data, and compression processing is carried out on DiskRowSet periodically through a background so as to delete unused data and merge historical data; wherein the data in DiskRowSet is organized in columns.

4. The distributed big data computing engine of claim 1, wherein the stream computing engine comprises a control node module, a compute node module, and a Zookeeper cluster module;

5. The distributed big data computing engine of claim 4, wherein the physical computing nodes include a monitoring process Tracker and a work process Worker; the monitoring process Tracker acquires task information through the Zookeeper cluster module and creates a working process workbench to calculate and process tasks issued to the physical calculation node;

6. The distributed big data computing engine of claim 4, wherein the task scheduling based on the resource information of each physical computing node comprises:

7. The distributed big data computing engine of claim 4, wherein the computing, based on the resource information of each physical computing node, a task scheduling priority corresponding to each physical computing node includes:

8. The distributed big data computing engine of claim 5, wherein the task priority contribution of the CPU, memory and disk I/O is obtained by the following formula based on the difference between the resource residuals corresponding to the CPU, memory and disk I/O interfaces and the amount of resources required by the task:

9. The distributed big data computing engine of claim 6, wherein the task scheduling priority of the task to be scheduled corresponding to the physical computing node is obtained by the following formula:

Rank(task)＝ρ+R(task)*fcpu*fmem*fio*v；

10. The distributed big data computing engine of claim 6, wherein the amount of resources required by the stream computing task is obtained by:

taskcpu＝1-P＝1-t1/(t2*Q)；