CN113568722A

CN113568722A - Task scheduling optimization data processing system based on resource load prediction

Info

Publication number: CN113568722A
Application number: CN202110633424.1A
Authority: CN
Inventors: 李晖; 韩文彪; 丁玺润
Original assignee: Guizhou Youlian Borui Technology Co ltd
Current assignee: Guizhou Youlian Borui Technology Co ltd
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-10-29

Abstract

The invention relates to a task scheduling optimization data processing system based on resource load prediction; the system comprises: the data acquisition module is used for realizing the acquisition and processing of data; the database module is used for storing the obtained data and reading the data; the resource load acquisition module is used for acquiring the resource utilization rate of the nodes; the task scheduling module is used for realizing cluster management, monitoring, load prediction and a UPSA task scheduling strategy; the data visualization module is used for visually displaying the historical load data of the nodes; the invention sets a task scheduling strategy, predicts the utilization rate of each node in advance, reduces the load inclination among each node and improves the execution efficiency of the task.

Description

Task scheduling optimization data processing system based on resource load prediction

Technical Field

The invention relates to the technical field of big data computing task scheduling, in particular to a task scheduling optimization data processing system based on resource load prediction.

Background

With the rapid development of information technology, enterprise information systems generate a large amount of business data; how to effectively extract useful information from the massive business data to help enterprise decision analysis becomes a challenge for an enterprise management layer; the basic purpose of data processing is to extract and derive valuable and meaningful data for certain specific areas from large, chaotic, unintelligible data; with the increasing dependence on data processing in various fields of social production and social life, hardware resources of a data processing cluster are increasingly likely to become bottlenecks, so that the efficiency of task execution is reduced, and the service quality is reduced.

At present, the capacity of a single processing node can not meet the requirement on data processing efficiency, large-scale applications begin to use clusters to improve the reliability of a database and the performance of the database, and various large database manufacturers also strive to develop a high-expansibility database cluster technology; although the processing performance can be greatly increased by increasing the number of processing nodes, the operating cost of an enterprise is increased; moreover, because the requests to the database have the characteristics of real-time performance and dynamic performance, and the consumption of computing resources by each request task may be greatly different, a default task scheduling policy may cause serious consequences such as uneven load distribution in a cluster, insufficient utilization of resources, even great influence on the response time of the tasks, damage to user experience, and the like, and thus, research on the task scheduling policy has become a research hotspot for optimizing data processing performance.

Disclosure of Invention

In order to overcome the technical defects in the prior art, the invention provides a task scheduling optimization data processing system based on resource load prediction, which can effectively solve the problems in the background art.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

the embodiment of the invention discloses a task scheduling optimization data processing system based on resource load prediction, which comprises:

the data acquisition module is used for realizing the acquisition and processing of data;

the database module is used for storing the obtained data and reading the data;

the resource load acquisition module is used for acquiring the resource utilization rate of the nodes;

the task scheduling module is used for realizing cluster management, monitoring, load prediction and a UPSA task scheduling strategy;

and the data visualization module is used for visually displaying the historical load data of the nodes.

In any of the above schemes, preferably, the functions of the data acquisition module include:

(1) collecting data, and collecting required information through personnel, equipment and network tools;

(2) converting the data, converting the data inconsistent with the data in the database, and aggregating the data according to the granularity of the database and calculating the business rule;

(3) grouping efficiently according to the characteristics of the data;

(4) carrying out arithmetic and logic operation on the acquired information;

(5) storing the data after primary processing into a database;

(6) the data list is ordered according to rules.

In any of the above schemes, preferably, the database module includes a database and a time sequence database, the time sequence database is used for processing data with time tags, and the database is used for storing the obtained data.

In any of the above aspects, preferably, the database is an external storage device.

In any of the above schemes, preferably, the time-series database is an infiluxdb time-series data repository.

In any of the above schemes, preferably, the infiluxdb time series data storage library adopts an LSM structure, when data is recorded, new data is written into a memory, and when the amount of data in the memory reaches a certain threshold, the data is persisted to an external storage device.

In any of the above schemes, it is preferable that the time-series data is compressed in a delta-of-delta manner, which can improve the compression efficiency.

In any of the above schemes, preferably, the time-series data is represented by two dimensional coordinates, wherein an abscissa of the time-series data represents time, and an ordinate of the time-series data is composed of a data source and a monitoring index, wherein the data source is uniquely represented by a series of data tags.

In any of the above schemes, preferably, the CPU data source includes four monitoring indicators, namely user, system, iowait, and idle.

In any of the above schemes, preferably, the resource load acquisition module is configured to acquire indexes of loads related to a CPU, an internal memory, and an IO, the resource load acquisition module acquires the indexes of the loads related to the CPU and the IO by encapsulating an iostat tool, and the resource load acquisition module acquires the indexes of the internal memory by encapsulating a free tool.

In any of the above schemes, preferably, the resource load acquisition module communicates with the task scheduling module through a message queue system, and the message queue system adopts an Apache Kafka system.

In any of the foregoing schemes, preferably, the resource load acquisition module further includes a historical load data storage, the historical load data storage is provided with a CPU table, a mem table and a diskio table, the CPU table is used to record the CPU historical load condition of each machine, the mem table is used to record the memory usage condition of each machine, and the diskio table is used to store the I/O historical load of each machine.

In any of the above schemes, preferably, the task scheduling module includes a workload calculator, a monitor, and a predictor, the workload calculator is configured to calculate a workload of a task, the monitor is configured to collect load information of an execution node, and the predictor is configured to predict a future I/O utilization of the node.

In any of the above aspects, preferably, the workload calculator is formulated by

Calculating the workload of the query task by COST (S + P + W) T; wherein S is the starting cost; p is the number of the disk pages accessed when the query task runs; t is the number of access tuples; w is a weighting factor.

In any of the above solutions, it is preferable that the execution of each task is divided into n steps by a formula

A total workload is calculated for each task, wherein,

COST_totalis the total workload of the task; n is the number of steps to generate I/O; COST_IonStep cost of IO generated for nth step by formula

COST_IOn＝ENDCost_n-StartCost_nIt is calculated, where StartCost is the overhead at the start of the step, and endpost is the overhead at the end of the step.

In any of the above schemes, preferably, the predictor comprises the following working steps:

1. establishing a random forest prediction model to realize real-time prediction of node load;

2. and distributing tasks according to the load conditions of the nodes predicted by the random forest prediction model.

In any of the above schemes, preferably, when the random forest prediction model is established, four TPC-H data sets with data sizes of 256MB, 512MB, 1GB and 2GB are produced, and the TPC-H data sets are loaded into the database.

In any of the above schemes, preferably, the task scheduling module is provided with a load generation program, the load generation program randomly generates a query task and sends the query task to the execution node, the resource load collector is used to record resource load index changes and execution states of each node, and the obtained data is organized into a load snapshot and stored in the time-series database.

In any of the above schemes, preferably, after the task scheduling module receives the new task, the workload calculator estimates the I/O overhead of the new task in real time, and the pseudo code thereof is:

procedure receiveNewTask(Task t)

t.cost←getTaskCost(t)

addToTaskQueue(t)

end procedure。

in any of the above schemes, preferably, after the workload calculator estimates the I/O overhead of a new task in real time, the task is programmed into a task queue to be executed, if a predicted value of the I/O utilization rate is smaller than a threshold, the task is regarded as an available node, and when the predicted value of the I/O utilization rate is larger than the threshold, the task is temporarily regarded as an unavailable node, the default threshold is set to 100, and the pseudo code thereof is:

in any of the above schemes, preferably, the predictor receives the current use condition of the computing resources of each node at the same time, the predictor predicts the load condition of each node, the scheduler distributes tasks according to the load condition of each node and the workload of the tasks, and when a task queue to be executed is not empty, the tasks are taken out one by one and distributed to the node with the lowest predicted value of the I/O utilization rate.

In any of the above schemes, preferably, after the task scheduling module sends the task, the predictor works to update the I/O utilization rate of the node, and the pseudo code is:

in any of the above schemes, preferably, the data visualization module is configured to implement functions of: load data visualization and task log query.

In any of the above schemes, preferably, the load data in the load data visualization includes: the CPU line graph shows that the CPU is respectively in idle, waiting for I/O, user mode and system mode; memory usage; change of disk read-write rate; disk read-write latency; change in disk utilization; request queue length change conditions.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, by establishing the random forest prediction model, the node load is predicted in real time when the self-scheduling monitoring module is started.

2. The invention realizes the processing of complex tasks by setting the task scheduling strategy and is suitable for services in different scenes.

3. The invention sets the task scheduling strategy to inform the utilization rate of each node, reduce the load inclination among the nodes and improve the execution efficiency of the task.

Drawings

The drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification.

FIG. 1 is a block diagram of a data processing system for optimizing task scheduling based on resource load prediction according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a task scheduling module in a task scheduling optimization data processing system based on resource load prediction according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element.

In the description of the present invention, it is to be understood that the terms "length", "width", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are not to be considered as limiting.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit to the number of technical features indicated. Thus, a feature defined as "first", "second", may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more, unless explicitly referring to fig. 1 and fig. 2, which are defined in the task scheduling optimization data processing system based on resource load prediction provided in this embodiment.

For better understanding of the above technical solutions, the technical solutions of the present invention will be described in detail below with reference to the drawings and the detailed description of the present invention.

Referring to fig. 1 and fig. 2, in a task scheduling optimization data processing system based on resource load prediction according to the present embodiment, the system includes:

the database module is used for storing the obtained data and reading the data;

the task scheduling module is used for realizing the existing cluster management, monitoring, load prediction and UPSA task scheduling strategies;

Referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction according to the present embodiment, the functions of the data acquisition module include:

(3) grouping efficiently according to the characteristics of the data;

(4) carrying out arithmetic and logic operation on the acquired information;

(5) storing the data after primary processing into a database;

(6) the data list is ordered according to rules.

Referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction provided in this embodiment, the database module includes a database and a time sequence database, where the time sequence database is used for processing data with time tags, and the database is used for storing the obtained data.

Time Series (Time Series) data is a Series of data indexed in Time dimension, and the Time Series data is used for describing the state change information of an object in the historical Time dimension, and the Time Series has prominent advantages in practical application, such as: the concurrency and the throughput are high, and the writing is stable and continuous; writing more and reading less; no update operation, etc.; time series data also has significant advantages over other data when stored.

Preferably, the database is an external storage device.

Preferably, the time-series database is an infiluxdb time-series data repository.

The InfluxDB is a professional time sequence database, only stores time sequence data, and performs a lot of optimization work on the storage of a data model.

Preferably, the infiluxdb time-series data storage library adopts an LSM structure, and when data is recorded, new data is written into the memory, and when the amount of data in the memory reaches a certain threshold, the data is persisted to the external storage device.

When the data storage system is used, data labels with the same data source of the InfluxDB time sequence data storage library adopting an LSM structure do not need to be stored redundantly, so that the storage space is greatly saved; meanwhile, the time column and the index value are independently stored, so that the time column and the index column can be respectively compressed.

Preferably, the time-series data is compressed by a delta-of-delta method, thereby improving the compression efficiency.

Preferably, the time-series data is represented by two dimensional coordinates, wherein the abscissa of the time-series data represents time, and the ordinate of the time-series data is composed of a data source and a monitoring index, wherein the data source is uniquely represented by a series of data tags.

Preferably, the CPU data source comprises four monitoring indexes of a user, a system, iowait and idle.

When the system is used, the time sequence data is adopted, so that the concurrency and the throughput of the system can be ensured to be high, and writing is stable and continuous; meanwhile, the method has the characteristics of writing, reading and updating.

Referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction provided in this embodiment, the resource load collection module is configured to collect indexes of loads related to a CPU, a memory, and an IO, the resource load collection module collects the indexes of the loads related to the CPU and the IO by encapsulating an iostat tool, and the resource load collection module collects the indexes of the memory by encapsulating a free tool.

Preferably, the resource load acquisition module communicates with the task scheduling module through a message queue system, and the message queue system adopts an Apache Kafka system.

In use, due to different processing operations, various types of tasks consume different resources, generally speaking, when executing a data processing task, disk I/O and CPU are the most likely to cause bottlenecks, and in an actual scenario, the processing rate of the CPU is much higher than that of the I/O, so when allocating and collecting tasks, processing is mainly performed according to the related load of the I/O.

Preferably, the resource load acquisition module further includes a historical load data storage, the historical load data storage is provided with a CPU table, a mem table and a diskio table, the CPU table is used for recording the CPU historical load condition of each machine, the mem table is used for recording the memory usage condition of each machine, and the diskio table is used for storing the I/O historical load of each machine.

When the method is used, the CPU table, the mem table and the diskio table are set, so that the query efficiency of massive historical data can be improved, and the requirement for massive real-time writing can be met.

Referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction provided in this embodiment, the task scheduling module includes a workload calculator, a monitor, and a predictor, where the workload calculator is configured to calculate a workload of a task, the monitor is configured to collect load information of an execution node, and the predictor is configured to predict future I/O utilization of the node.

Preferably, the workload calculator calculates the workload of the query task by the formula COST ═ S + P + W × (T); wherein S is the starting cost; p is the number of the disk pages accessed when the query task runs; t is the number of access tuples; w is a weighting factor.

Preferably, the execution of each task is divided into n steps, by formula

A total workload is calculated for each task, wherein,

COST_totalis the total workload of the task; n is the number of steps to generate I/O; COST_IOnStep COST of IO generated for nth step by formula COST_IOn＝ENDCost_n-StartCost_nIt is calculated, where StartCost is the overhead at the start of the step, and endpost is the overhead at the end of the step.

Preferably, the working steps of the predictor include:

Preferably, when the random forest prediction model is established, four TPC-H data sets with the data sizes of 256MB, 512MB, 1GB and 2GB are produced and loaded into the database.

Preferably, the task scheduling module is provided with a load generation program, the load generation program randomly generates a query task and sends the query task to the execution node, the resource load collector is used for recording resource load index changes and execution states of the nodes, and the obtained data are organized into load snapshots and stored in the TPC-H data set.

Referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction according to the present embodiment, after the task scheduling module receives a new task, the workload calculator estimates the I/O overhead of the new task in real time, and the pseudo code of the workload calculator is:

procedure receiveNewTask(Task t)

t.cost←getTaskCost(t)

addToTaskQueue(t)

end procedure。

preferably, after estimating the I/O overhead of a new task in real time, the workload calculator programs the task into a task queue to be executed, if the predicted value of the I/O utilization rate is smaller than a threshold, the task is regarded as an available node, if the predicted value of the I/O utilization rate is larger than the threshold, the task is temporarily regarded as an unavailable node, the default threshold is set to 100, and the pseudo code of the task is:

when the method is used, when a task queue is not empty, the UPSA acquires an available node list, if the predicted value of the I/O utilization rate is smaller than a threshold value, the available node list is regarded as the available node, and when the predicted value of the I/O utilization rate is larger than the threshold value, the task is continuously allocated, so that resource competition can be caused, the execution efficiency is reduced, and the available node list is regarded as the unavailable node temporarily until the predicted value of the I/O utilization rate is recovered to a normal level.

Preferably, the predictor receives the current use condition of the computing resources of each node at the same time, predicts the load condition of each node, the scheduler distributes tasks according to the load condition of each node and the workload of the tasks, and when a task queue to be executed is not empty, the tasks are taken out one by one and distributed to the node with the lowest predicted value of the I/O utilization rate.

Preferably, after the task scheduling module sends the task, the predictor works to update the I/O utilization rate of the node, and the pseudo code of the predictor is as follows:

referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction provided in this embodiment, the data visualization module is configured to implement the following functions: load data visualization and task log query.

Preferably, the load data in the load data visualization includes: the CPU line graph shows that the CPU is respectively in idle, waiting for I/O, user mode and system mode; memory usage; change of disk read-write rate; disk read-write latency; change in disk utilization; request queue length change conditions.

Compared with the prior art, the invention has the beneficial effects that:

Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that various changes, modifications and substitutions can be made without departing from the spirit and scope of the invention as defined by the appended claims. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A task scheduling optimization data processing system based on resource load prediction is characterized in that: the system comprises:

the database module is used for storing the obtained data and reading the data;

2. The system of claim 1, wherein the data processing system is optimized for task scheduling based on resource load prediction: the database module comprises a database and a time sequence database, wherein the time sequence database is used for processing data with time labels, and the database is used for storing the obtained data.

3. The system of claim 2, wherein the data processing system is optimized for task scheduling based on resource load prediction: the time sequence database is an InfluxDB time sequence data storage library which adopts an LSM structure, when data are recorded, new data are written into a memory, and when the data volume in the memory reaches a certain threshold value, the data are duralized to an external storage device.

4. The system of claim 3, wherein the data processing system is optimized for task scheduling based on resource load prediction: the time series data is compressed by adopting a delta-of-delta mode, and is represented by two dimensional coordinates, wherein the abscissa of the time series data represents time, and the ordinate of the time series data is composed of a data source and a monitoring index, wherein the data source is uniquely represented by a series of data labels.

5. The system of claim 4, wherein the data processing system is optimized for task scheduling based on resource load prediction: the resource load acquisition module is used for acquiring indexes of loads related to the CPU, the memory and the IO, the resource load acquisition module acquires the indexes of the loads related to the CPU and the IO by packaging an iostat tool, and the resource load acquisition module acquires the indexes of the memory by packaging a free tool.

6. The system of claim 5, wherein the data processing system is optimized for task scheduling based on resource load prediction: the resource load acquisition module further comprises a historical load data storage, the historical load data storage is provided with a CPU table, a mem table and a diskio table, the CPU table is used for recording the CPU historical load condition of each machine, the mem table is used for recording the memory use condition of each machine, and the diskio table is used for storing the I/O historical load of each machine.

7. The system of claim 6, wherein the data processing system is optimized for task scheduling based on resource load prediction: the task scheduling module comprises a workload calculator, a monitor and a predictor, wherein the workload calculator is used for calculating the workload of the task, the monitor is used for collecting the load information of the execution node, and the predictor is used for predicting the future I/O utilization rate of the node.

8. The system of claim 7, wherein the data processing system is optimized for task scheduling based on resource load prediction: dividing the execution of each task into n steps by formula

Calculating a total workload per task, wherein COST_totalIs the total workload of the task; n is the number of steps to generate I/O; COST_IOnStep COST of IO generated for nth step by formula COST_IOn＝ENDCost_n-StartCost_nIt is calculated, where StartCost is the overhead at the start of the step, and endpost is the overhead at the end of the step.

9. The system of claim 8, wherein the data processing system is optimized for task scheduling based on resource load prediction: the working steps of the predictor comprise:

establishing a random forest prediction model to realize real-time prediction of node load;

distributing tasks according to the load conditions of the nodes predicted by the random forest prediction model;

after the task scheduling module receives a new task, the workload calculator estimates the I/O overhead of the new task in real time, and the pseudo code of the workload calculator is as follows:

procedure receiveNewTask(Task t)

t.cost←getTaskCost(t)

addToTaskQueue(t)

end procedure。

10. the system of claim 9, wherein the data processing system is optimized for task scheduling based on resource load prediction: the load data in the load data visualization comprises the following steps: the CPU line graph shows that the CPU is respectively in idle, waiting for I/O, user mode and system mode; memory usage; change of disk read-write rate; disk read-write latency; change in disk utilization; request queue length change conditions.