CN113568722A - Task scheduling optimization data processing system based on resource load prediction - Google Patents

Task scheduling optimization data processing system based on resource load prediction Download PDF

Info

Publication number
CN113568722A
CN113568722A CN202110633424.1A CN202110633424A CN113568722A CN 113568722 A CN113568722 A CN 113568722A CN 202110633424 A CN202110633424 A CN 202110633424A CN 113568722 A CN113568722 A CN 113568722A
Authority
CN
China
Prior art keywords
data
task
task scheduling
load
resource load
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110633424.1A
Other languages
Chinese (zh)
Inventor
李晖
韩文彪
丁玺润
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Youlian Borui Technology Co ltd
Original Assignee
Guizhou Youlian Borui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Youlian Borui Technology Co ltd filed Critical Guizhou Youlian Borui Technology Co ltd
Priority to CN202110633424.1A priority Critical patent/CN113568722A/en
Publication of CN113568722A publication Critical patent/CN113568722A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a task scheduling optimization data processing system based on resource load prediction; the system comprises: the data acquisition module is used for realizing the acquisition and processing of data; the database module is used for storing the obtained data and reading the data; the resource load acquisition module is used for acquiring the resource utilization rate of the nodes; the task scheduling module is used for realizing cluster management, monitoring, load prediction and a UPSA task scheduling strategy; the data visualization module is used for visually displaying the historical load data of the nodes; the invention sets a task scheduling strategy, predicts the utilization rate of each node in advance, reduces the load inclination among each node and improves the execution efficiency of the task.

Description

Task scheduling optimization data processing system based on resource load prediction
Technical Field
The invention relates to the technical field of big data computing task scheduling, in particular to a task scheduling optimization data processing system based on resource load prediction.
Background
With the rapid development of information technology, enterprise information systems generate a large amount of business data; how to effectively extract useful information from the massive business data to help enterprise decision analysis becomes a challenge for an enterprise management layer; the basic purpose of data processing is to extract and derive valuable and meaningful data for certain specific areas from large, chaotic, unintelligible data; with the increasing dependence on data processing in various fields of social production and social life, hardware resources of a data processing cluster are increasingly likely to become bottlenecks, so that the efficiency of task execution is reduced, and the service quality is reduced.
At present, the capacity of a single processing node can not meet the requirement on data processing efficiency, large-scale applications begin to use clusters to improve the reliability of a database and the performance of the database, and various large database manufacturers also strive to develop a high-expansibility database cluster technology; although the processing performance can be greatly increased by increasing the number of processing nodes, the operating cost of an enterprise is increased; moreover, because the requests to the database have the characteristics of real-time performance and dynamic performance, and the consumption of computing resources by each request task may be greatly different, a default task scheduling policy may cause serious consequences such as uneven load distribution in a cluster, insufficient utilization of resources, even great influence on the response time of the tasks, damage to user experience, and the like, and thus, research on the task scheduling policy has become a research hotspot for optimizing data processing performance.
Disclosure of Invention
In order to overcome the technical defects in the prior art, the invention provides a task scheduling optimization data processing system based on resource load prediction, which can effectively solve the problems in the background art.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
the embodiment of the invention discloses a task scheduling optimization data processing system based on resource load prediction, which comprises:
the data acquisition module is used for realizing the acquisition and processing of data;
the database module is used for storing the obtained data and reading the data;
the resource load acquisition module is used for acquiring the resource utilization rate of the nodes;
the task scheduling module is used for realizing cluster management, monitoring, load prediction and a UPSA task scheduling strategy;
and the data visualization module is used for visually displaying the historical load data of the nodes.
In any of the above schemes, preferably, the functions of the data acquisition module include:
(1) collecting data, and collecting required information through personnel, equipment and network tools;
(2) converting the data, converting the data inconsistent with the data in the database, and aggregating the data according to the granularity of the database and calculating the business rule;
(3) grouping efficiently according to the characteristics of the data;
(4) carrying out arithmetic and logic operation on the acquired information;
(5) storing the data after primary processing into a database;
(6) the data list is ordered according to rules.
In any of the above schemes, preferably, the database module includes a database and a time sequence database, the time sequence database is used for processing data with time tags, and the database is used for storing the obtained data.
In any of the above aspects, preferably, the database is an external storage device.
In any of the above schemes, preferably, the time-series database is an infiluxdb time-series data repository.
In any of the above schemes, preferably, the infiluxdb time series data storage library adopts an LSM structure, when data is recorded, new data is written into a memory, and when the amount of data in the memory reaches a certain threshold, the data is persisted to an external storage device.
In any of the above schemes, it is preferable that the time-series data is compressed in a delta-of-delta manner, which can improve the compression efficiency.
In any of the above schemes, preferably, the time-series data is represented by two dimensional coordinates, wherein an abscissa of the time-series data represents time, and an ordinate of the time-series data is composed of a data source and a monitoring index, wherein the data source is uniquely represented by a series of data tags.
In any of the above schemes, preferably, the CPU data source includes four monitoring indicators, namely user, system, iowait, and idle.
In any of the above schemes, preferably, the resource load acquisition module is configured to acquire indexes of loads related to a CPU, an internal memory, and an IO, the resource load acquisition module acquires the indexes of the loads related to the CPU and the IO by encapsulating an iostat tool, and the resource load acquisition module acquires the indexes of the internal memory by encapsulating a free tool.
In any of the above schemes, preferably, the resource load acquisition module communicates with the task scheduling module through a message queue system, and the message queue system adopts an Apache Kafka system.
In any of the foregoing schemes, preferably, the resource load acquisition module further includes a historical load data storage, the historical load data storage is provided with a CPU table, a mem table and a diskio table, the CPU table is used to record the CPU historical load condition of each machine, the mem table is used to record the memory usage condition of each machine, and the diskio table is used to store the I/O historical load of each machine.
In any of the above schemes, preferably, the task scheduling module includes a workload calculator, a monitor, and a predictor, the workload calculator is configured to calculate a workload of a task, the monitor is configured to collect load information of an execution node, and the predictor is configured to predict a future I/O utilization of the node.
In any of the above aspects, preferably, the workload calculator is formulated by
Calculating the workload of the query task by COST (S + P + W) T; wherein S is the starting cost; p is the number of the disk pages accessed when the query task runs; t is the number of access tuples; w is a weighting factor.
In any of the above solutions, it is preferable that the execution of each task is divided into n steps by a formula
Figure BDA0003104618500000041
A total workload is calculated for each task, wherein,
COSTtotalis the total workload of the task; n is the number of steps to generate I/O; COSTIonStep cost of IO generated for nth step by formula
COSTIOn=ENDCostn-StartCostnIt is calculated, where StartCost is the overhead at the start of the step, and endpost is the overhead at the end of the step.
In any of the above schemes, preferably, the predictor comprises the following working steps:
1. establishing a random forest prediction model to realize real-time prediction of node load;
2. and distributing tasks according to the load conditions of the nodes predicted by the random forest prediction model.
In any of the above schemes, preferably, when the random forest prediction model is established, four TPC-H data sets with data sizes of 256MB, 512MB, 1GB and 2GB are produced, and the TPC-H data sets are loaded into the database.
In any of the above schemes, preferably, the task scheduling module is provided with a load generation program, the load generation program randomly generates a query task and sends the query task to the execution node, the resource load collector is used to record resource load index changes and execution states of each node, and the obtained data is organized into a load snapshot and stored in the time-series database.
In any of the above schemes, preferably, after the task scheduling module receives the new task, the workload calculator estimates the I/O overhead of the new task in real time, and the pseudo code thereof is:
procedure receiveNewTask(Task t)
t.cost←getTaskCost(t)
addToTaskQueue(t)
end procedure。
in any of the above schemes, preferably, after the workload calculator estimates the I/O overhead of a new task in real time, the task is programmed into a task queue to be executed, if a predicted value of the I/O utilization rate is smaller than a threshold, the task is regarded as an available node, and when the predicted value of the I/O utilization rate is larger than the threshold, the task is temporarily regarded as an unavailable node, the default threshold is set to 100, and the pseudo code thereof is:
Figure BDA0003104618500000051
Figure BDA0003104618500000061
in any of the above schemes, preferably, the predictor receives the current use condition of the computing resources of each node at the same time, the predictor predicts the load condition of each node, the scheduler distributes tasks according to the load condition of each node and the workload of the tasks, and when a task queue to be executed is not empty, the tasks are taken out one by one and distributed to the node with the lowest predicted value of the I/O utilization rate.
In any of the above schemes, preferably, after the task scheduling module sends the task, the predictor works to update the I/O utilization rate of the node, and the pseudo code is:
Figure BDA0003104618500000062
Figure BDA0003104618500000071
in any of the above schemes, preferably, the data visualization module is configured to implement functions of: load data visualization and task log query.
In any of the above schemes, preferably, the load data in the load data visualization includes: the CPU line graph shows that the CPU is respectively in idle, waiting for I/O, user mode and system mode; memory usage; change of disk read-write rate; disk read-write latency; change in disk utilization; request queue length change conditions.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the invention, by establishing the random forest prediction model, the node load is predicted in real time when the self-scheduling monitoring module is started.
2. The invention realizes the processing of complex tasks by setting the task scheduling strategy and is suitable for services in different scenes.
3. The invention sets the task scheduling strategy to inform the utilization rate of each node, reduce the load inclination among the nodes and improve the execution efficiency of the task.
Drawings
The drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification.
FIG. 1 is a block diagram of a data processing system for optimizing task scheduling based on resource load prediction according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a task scheduling module in a task scheduling optimization data processing system based on resource load prediction according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element.
In the description of the present invention, it is to be understood that the terms "length", "width", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are not to be considered as limiting.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit to the number of technical features indicated. Thus, a feature defined as "first", "second", may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more, unless explicitly referring to fig. 1 and fig. 2, which are defined in the task scheduling optimization data processing system based on resource load prediction provided in this embodiment.
For better understanding of the above technical solutions, the technical solutions of the present invention will be described in detail below with reference to the drawings and the detailed description of the present invention.
Referring to fig. 1 and fig. 2, in a task scheduling optimization data processing system based on resource load prediction according to the present embodiment, the system includes:
the data acquisition module is used for realizing the acquisition and processing of data;
the database module is used for storing the obtained data and reading the data;
the resource load acquisition module is used for acquiring the resource utilization rate of the nodes;
the task scheduling module is used for realizing the existing cluster management, monitoring, load prediction and UPSA task scheduling strategies;
and the data visualization module is used for visually displaying the historical load data of the nodes.
Referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction according to the present embodiment, the functions of the data acquisition module include:
(1) collecting data, and collecting required information through personnel, equipment and network tools;
(2) converting the data, converting the data inconsistent with the data in the database, and aggregating the data according to the granularity of the database and calculating the business rule;
(3) grouping efficiently according to the characteristics of the data;
(4) carrying out arithmetic and logic operation on the acquired information;
(5) storing the data after primary processing into a database;
(6) the data list is ordered according to rules.
Referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction provided in this embodiment, the database module includes a database and a time sequence database, where the time sequence database is used for processing data with time tags, and the database is used for storing the obtained data.
Time Series (Time Series) data is a Series of data indexed in Time dimension, and the Time Series data is used for describing the state change information of an object in the historical Time dimension, and the Time Series has prominent advantages in practical application, such as: the concurrency and the throughput are high, and the writing is stable and continuous; writing more and reading less; no update operation, etc.; time series data also has significant advantages over other data when stored.
Preferably, the database is an external storage device.
Preferably, the time-series database is an infiluxdb time-series data repository.
The InfluxDB is a professional time sequence database, only stores time sequence data, and performs a lot of optimization work on the storage of a data model.
Preferably, the infiluxdb time-series data storage library adopts an LSM structure, and when data is recorded, new data is written into the memory, and when the amount of data in the memory reaches a certain threshold, the data is persisted to the external storage device.
When the data storage system is used, data labels with the same data source of the InfluxDB time sequence data storage library adopting an LSM structure do not need to be stored redundantly, so that the storage space is greatly saved; meanwhile, the time column and the index value are independently stored, so that the time column and the index column can be respectively compressed.
Preferably, the time-series data is compressed by a delta-of-delta method, thereby improving the compression efficiency.
Preferably, the time-series data is represented by two dimensional coordinates, wherein the abscissa of the time-series data represents time, and the ordinate of the time-series data is composed of a data source and a monitoring index, wherein the data source is uniquely represented by a series of data tags.
Preferably, the CPU data source comprises four monitoring indexes of a user, a system, iowait and idle.
When the system is used, the time sequence data is adopted, so that the concurrency and the throughput of the system can be ensured to be high, and writing is stable and continuous; meanwhile, the method has the characteristics of writing, reading and updating.
Referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction provided in this embodiment, the resource load collection module is configured to collect indexes of loads related to a CPU, a memory, and an IO, the resource load collection module collects the indexes of the loads related to the CPU and the IO by encapsulating an iostat tool, and the resource load collection module collects the indexes of the memory by encapsulating a free tool.
Preferably, the resource load acquisition module communicates with the task scheduling module through a message queue system, and the message queue system adopts an Apache Kafka system.
In use, due to different processing operations, various types of tasks consume different resources, generally speaking, when executing a data processing task, disk I/O and CPU are the most likely to cause bottlenecks, and in an actual scenario, the processing rate of the CPU is much higher than that of the I/O, so when allocating and collecting tasks, processing is mainly performed according to the related load of the I/O.
Preferably, the resource load acquisition module further includes a historical load data storage, the historical load data storage is provided with a CPU table, a mem table and a diskio table, the CPU table is used for recording the CPU historical load condition of each machine, the mem table is used for recording the memory usage condition of each machine, and the diskio table is used for storing the I/O historical load of each machine.
When the method is used, the CPU table, the mem table and the diskio table are set, so that the query efficiency of massive historical data can be improved, and the requirement for massive real-time writing can be met.
Referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction provided in this embodiment, the task scheduling module includes a workload calculator, a monitor, and a predictor, where the workload calculator is configured to calculate a workload of a task, the monitor is configured to collect load information of an execution node, and the predictor is configured to predict future I/O utilization of the node.
Preferably, the workload calculator calculates the workload of the query task by the formula COST ═ S + P + W × (T); wherein S is the starting cost; p is the number of the disk pages accessed when the query task runs; t is the number of access tuples; w is a weighting factor.
Preferably, the execution of each task is divided into n steps, by formula
Figure BDA0003104618500000121
A total workload is calculated for each task, wherein,
COSTtotalis the total workload of the task; n is the number of steps to generate I/O; COSTIOnStep COST of IO generated for nth step by formula COSTIOn=ENDCostn-StartCostnIt is calculated, where StartCost is the overhead at the start of the step, and endpost is the overhead at the end of the step.
Preferably, the working steps of the predictor include:
1. establishing a random forest prediction model to realize real-time prediction of node load;
2. and distributing tasks according to the load conditions of the nodes predicted by the random forest prediction model.
Preferably, when the random forest prediction model is established, four TPC-H data sets with the data sizes of 256MB, 512MB, 1GB and 2GB are produced and loaded into the database.
Preferably, the task scheduling module is provided with a load generation program, the load generation program randomly generates a query task and sends the query task to the execution node, the resource load collector is used for recording resource load index changes and execution states of the nodes, and the obtained data are organized into load snapshots and stored in the TPC-H data set.
Referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction according to the present embodiment, after the task scheduling module receives a new task, the workload calculator estimates the I/O overhead of the new task in real time, and the pseudo code of the workload calculator is:
procedure receiveNewTask(Task t)
t.cost←getTaskCost(t)
addToTaskQueue(t)
end procedure。
preferably, after estimating the I/O overhead of a new task in real time, the workload calculator programs the task into a task queue to be executed, if the predicted value of the I/O utilization rate is smaller than a threshold, the task is regarded as an available node, if the predicted value of the I/O utilization rate is larger than the threshold, the task is temporarily regarded as an unavailable node, the default threshold is set to 100, and the pseudo code of the task is:
Figure BDA0003104618500000131
when the method is used, when a task queue is not empty, the UPSA acquires an available node list, if the predicted value of the I/O utilization rate is smaller than a threshold value, the available node list is regarded as the available node, and when the predicted value of the I/O utilization rate is larger than the threshold value, the task is continuously allocated, so that resource competition can be caused, the execution efficiency is reduced, and the available node list is regarded as the unavailable node temporarily until the predicted value of the I/O utilization rate is recovered to a normal level.
Preferably, the predictor receives the current use condition of the computing resources of each node at the same time, predicts the load condition of each node, the scheduler distributes tasks according to the load condition of each node and the workload of the tasks, and when a task queue to be executed is not empty, the tasks are taken out one by one and distributed to the node with the lowest predicted value of the I/O utilization rate.
Preferably, after the task scheduling module sends the task, the predictor works to update the I/O utilization rate of the node, and the pseudo code of the predictor is as follows:
Figure BDA0003104618500000141
Figure BDA0003104618500000151
referring to fig. 1 and fig. 2, in the task scheduling optimization data processing system based on resource load prediction provided in this embodiment, the data visualization module is configured to implement the following functions: load data visualization and task log query.
Preferably, the load data in the load data visualization includes: the CPU line graph shows that the CPU is respectively in idle, waiting for I/O, user mode and system mode; memory usage; change of disk read-write rate; disk read-write latency; change in disk utilization; request queue length change conditions.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the invention, by establishing the random forest prediction model, the node load is predicted in real time when the self-scheduling monitoring module is started.
2. The invention realizes the processing of complex tasks by setting the task scheduling strategy and is suitable for services in different scenes.
3. The invention sets the task scheduling strategy to inform the utilization rate of each node, reduce the load inclination among the nodes and improve the execution efficiency of the task.
Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that various changes, modifications and substitutions can be made without departing from the spirit and scope of the invention as defined by the appended claims. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A task scheduling optimization data processing system based on resource load prediction is characterized in that: the system comprises:
the data acquisition module is used for realizing the acquisition and processing of data;
the database module is used for storing the obtained data and reading the data;
the resource load acquisition module is used for acquiring the resource utilization rate of the nodes;
the task scheduling module is used for realizing cluster management, monitoring, load prediction and a UPSA task scheduling strategy;
and the data visualization module is used for visually displaying the historical load data of the nodes.
2. The system of claim 1, wherein the data processing system is optimized for task scheduling based on resource load prediction: the database module comprises a database and a time sequence database, wherein the time sequence database is used for processing data with time labels, and the database is used for storing the obtained data.
3. The system of claim 2, wherein the data processing system is optimized for task scheduling based on resource load prediction: the time sequence database is an InfluxDB time sequence data storage library which adopts an LSM structure, when data are recorded, new data are written into a memory, and when the data volume in the memory reaches a certain threshold value, the data are duralized to an external storage device.
4. The system of claim 3, wherein the data processing system is optimized for task scheduling based on resource load prediction: the time series data is compressed by adopting a delta-of-delta mode, and is represented by two dimensional coordinates, wherein the abscissa of the time series data represents time, and the ordinate of the time series data is composed of a data source and a monitoring index, wherein the data source is uniquely represented by a series of data labels.
5. The system of claim 4, wherein the data processing system is optimized for task scheduling based on resource load prediction: the resource load acquisition module is used for acquiring indexes of loads related to the CPU, the memory and the IO, the resource load acquisition module acquires the indexes of the loads related to the CPU and the IO by packaging an iostat tool, and the resource load acquisition module acquires the indexes of the memory by packaging a free tool.
6. The system of claim 5, wherein the data processing system is optimized for task scheduling based on resource load prediction: the resource load acquisition module further comprises a historical load data storage, the historical load data storage is provided with a CPU table, a mem table and a diskio table, the CPU table is used for recording the CPU historical load condition of each machine, the mem table is used for recording the memory use condition of each machine, and the diskio table is used for storing the I/O historical load of each machine.
7. The system of claim 6, wherein the data processing system is optimized for task scheduling based on resource load prediction: the task scheduling module comprises a workload calculator, a monitor and a predictor, wherein the workload calculator is used for calculating the workload of the task, the monitor is used for collecting the load information of the execution node, and the predictor is used for predicting the future I/O utilization rate of the node.
8. The system of claim 7, wherein the data processing system is optimized for task scheduling based on resource load prediction: dividing the execution of each task into n steps by formula
Figure FDA0003104618490000021
Calculating a total workload per task, wherein COSTtotalIs the total workload of the task; n is the number of steps to generate I/O; COSTIOnStep COST of IO generated for nth step by formula COSTIOn=ENDCostn-StartCostnIt is calculated, where StartCost is the overhead at the start of the step, and endpost is the overhead at the end of the step.
9. The system of claim 8, wherein the data processing system is optimized for task scheduling based on resource load prediction: the working steps of the predictor comprise:
establishing a random forest prediction model to realize real-time prediction of node load;
distributing tasks according to the load conditions of the nodes predicted by the random forest prediction model;
after the task scheduling module receives a new task, the workload calculator estimates the I/O overhead of the new task in real time, and the pseudo code of the workload calculator is as follows:
procedure receiveNewTask(Task t)
t.cost←getTaskCost(t)
addToTaskQueue(t)
end procedure。
10. the system of claim 9, wherein the data processing system is optimized for task scheduling based on resource load prediction: the load data in the load data visualization comprises the following steps: the CPU line graph shows that the CPU is respectively in idle, waiting for I/O, user mode and system mode; memory usage; change of disk read-write rate; disk read-write latency; change in disk utilization; request queue length change conditions.
CN202110633424.1A 2021-06-07 2021-06-07 Task scheduling optimization data processing system based on resource load prediction Withdrawn CN113568722A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110633424.1A CN113568722A (en) 2021-06-07 2021-06-07 Task scheduling optimization data processing system based on resource load prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110633424.1A CN113568722A (en) 2021-06-07 2021-06-07 Task scheduling optimization data processing system based on resource load prediction

Publications (1)

Publication Number Publication Date
CN113568722A true CN113568722A (en) 2021-10-29

Family

ID=78161108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110633424.1A Withdrawn CN113568722A (en) 2021-06-07 2021-06-07 Task scheduling optimization data processing system based on resource load prediction

Country Status (1)

Country Link
CN (1) CN113568722A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116166443A (en) * 2023-04-23 2023-05-26 欢喜时代(深圳)科技有限公司 Load optimization method and system of game task system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116166443A (en) * 2023-04-23 2023-05-26 欢喜时代(深圳)科技有限公司 Load optimization method and system of game task system
CN116166443B (en) * 2023-04-23 2023-06-23 欢喜时代(深圳)科技有限公司 Load optimization method and system of game task system

Similar Documents

Publication Publication Date Title
CN110166282B (en) Resource allocation method, device, computer equipment and storage medium
US7979399B2 (en) Database journaling in a multi-node environment
US8090710B2 (en) Index maintenance in a multi-node database
Cheng et al. Erms: An elastic replication management system for hdfs
US8627322B2 (en) System and method of active risk management to reduce job de-scheduling probability in computer clusters
CN104965861B (en) A kind of data access monitoring device
US9870269B1 (en) Job allocation in a clustered environment
US8195642B2 (en) Partial indexes for multi-node database
US9235590B1 (en) Selective data compression in a database system
US20070016558A1 (en) Method and apparatus for dynamically associating different query execution strategies with selective portions of a database table
US20080065588A1 (en) Selectively Logging Query Data Based On Cost
CN110888714A (en) Container scheduling method, device and computer-readable storage medium
CN107291539B (en) Cluster program scheduler method based on resource significance level
US20060074875A1 (en) Method and apparatus for predicting relative selectivity of database query conditions using respective cardinalities associated with different subsets of database records
CN109918450A (en) Based on the distributed parallel database and storage method under analysis classes scene
CN116755939B (en) Intelligent data backup task planning method and system based on system resources
CN110807145A (en) Query engine acquisition method, device and computer-readable storage medium
CN111488323B (en) Data processing method and device and electronic equipment
CN111176831B (en) Dynamic thread mapping optimization method and device based on multithreading shared memory communication
US7979400B2 (en) Database journaling in a multi-node environment
CN113568722A (en) Task scheduling optimization data processing system based on resource load prediction
Chai et al. Adaptive lower-level driven compaction to optimize LSM-tree key-value stores
KR20190061247A (en) Real time resource usage ratio monitoring system of big data processing platform
CN109117285B (en) Distributed memory computing cluster system supporting high concurrency
US9305045B1 (en) Data-temperature-based compression in a database system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20211029

WW01 Invention patent application withdrawn after publication