CN114741187A - Resource scheduling method, system, electronic device and medium - Google Patents

Resource scheduling method, system, electronic device and medium Download PDF

Info

Publication number
CN114741187A
CN114741187A CN202210320056.XA CN202210320056A CN114741187A CN 114741187 A CN114741187 A CN 114741187A CN 202210320056 A CN202210320056 A CN 202210320056A CN 114741187 A CN114741187 A CN 114741187A
Authority
CN
China
Prior art keywords
model group
online
operation information
online model
resources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210320056.XA
Other languages
Chinese (zh)
Inventor
黄杰
姜婧妍
位凯志
古亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202210320056.XA priority Critical patent/CN114741187A/en
Publication of CN114741187A publication Critical patent/CN114741187A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a method, a system, electronic equipment and a medium for resource scheduling, and relates to the technical field of computers. Firstly, acquiring each online model group in a flight task; then obtaining current operation information of each online model group and historical operation information corresponding to each online model group; and finally, scheduling the resources according to the current operation information of each online model group and the historical operation information corresponding to each online model group. In the method, the change rule of the load can be acquired through the historical information of each model group, so that the resources can be reasonably scheduled according to the change rule of the load and the current operation information. In addition, the application also provides a system for resource scheduling, an electronic device and a computer readable storage medium, which correspond to the above mentioned method for resource scheduling, and the effects are the same.

Description

Resource scheduling method, system, electronic device and medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, a system, an electronic device, and a medium for resource scheduling.
Background
During the flight task operation, each model group is usually allocated with a fixed number of kernels and memory to process the load input of the model group. However, load fluctuations are generally common, and a distribution scheme which is too low may cause a large amount of accumulation of model groups, and each piece of data cannot be consumed in time; although the too high allocation scheme does not generate accumulation, the resource waste is caused, and the operation cost is increased.
Therefore, how to find a more appropriate resource allocation scheme is an urgent problem to be solved by those skilled in the art.
Disclosure of Invention
The present application provides a method, a system, an electronic device, and a medium for resource scheduling, which are used to allocate resources more reasonably.
In order to solve the above technical problem, the present application provides a method for resource scheduling, including:
acquiring each online model group in the flight task;
acquiring current operation information of each online model group and historical operation information corresponding to each online model group;
and scheduling resources according to the current operation information of each online model group and the historical operation information corresponding to each online model group.
Preferably, each online model group in the task of acquiring Flink comprises:
under the condition that the abnormal offline or abnormal restarting model groups do not exist, obtaining each online model group;
scanning an abnormal log and acquiring the information of the abnormal log under the condition that an abnormal offline or abnormal restarting model group exists;
determining the memory occupation condition according to the abnormal log information;
if the memory meets the preset requirement, the abnormal off-line model set is on-line and each on-line model set is obtained;
and if the memory does not meet the preset requirement, increasing the memory and returning to the step of determining the memory occupation condition according to the abnormal log information.
Preferably, the scheduling the resource according to the current operation information of each online model group and the historical operation information corresponding to each online model group includes:
utilizing a machine learning model to complete two classifications of historical operation information corresponding to each online model group and current operation information of each online model group so as to determine an adjustment strategy of each online model group; the adjustment strategy comprises increasing resources or decreasing resources;
fitting the current operation information of each online model group by a method of fitting a plurality of curves to determine the adjustment step length of each online model group;
and scheduling the resources of each online model group according to the adjustment step length and the adjustment strategy.
Preferably, the scheduling the resources of each online model group according to the adjustment step size and the adjustment policy includes:
obtaining an unstable model group in the online model group;
offline the unstable model set;
and scheduling the resources of the unstable model group according to the adjustment step length and the adjustment strategy.
Preferably, before the scheduling the resource of the unstable model group according to the adjustment step size and the adjustment policy, the method further includes:
under the condition that the adjustment step length and the adjustment strategy meet the configurable resources of the system, the step of scheduling the resources of the unstable model group according to the adjustment step length and the adjustment strategy is carried out;
and under the condition that the adjustment step length and the adjustment strategy meet the condition that the configurable resources of the system are not met, performing global resource allocation according to the priority, the accumulation amount and the input speed of the corresponding model group.
Preferably, after the obtaining of the current operation information of each online model group and the historical operation information corresponding to each online model group, the method further includes:
judging the delay parameter, the processing speed and the stability of the online model group under the condition that the online model group is stable;
and if the delay parameters, the processing speed and the stability of the online model groups meet corresponding preset requirements, storing the scheduled operation data of each online model group.
Preferably, the preset requirements met by the delay parameter include: the mean value of the data delay processing time of each online model group is smaller than a threshold value;
the preset requirements that the processing speed satisfies include: the average processing speed of each online model group is smaller than the preset times of the maximum processing speed of each online model group within a first preset time;
the preset requirements met by the stability of the online model group include: and the restarting frequency of each online model group is smaller than a preset value in second preset time.
Preferably, the obtaining of the historical operation information corresponding to each online model group includes:
acquiring historical operation information of each model group in the Flink task;
and obtaining historical operation information corresponding to each online model group from the historical operation information of each model group.
In order to solve the above technical problem, the present application further provides a system for resource scheduling, including:
the first acquisition module is used for acquiring each online model group in the Flink task;
the second acquisition module is used for acquiring the current operation information of each online model group and the historical operation information corresponding to each online model group;
and the scheduling module is used for scheduling resources according to the current operation information of each online model group and the historical operation information corresponding to each online model group.
In order to solve the above technical problem, the present application further provides an electronic device, including:
a memory for storing a computer program;
a processor for implementing the steps of the above-mentioned method for resource scheduling when executing the computer program.
In order to solve the above technical problem, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for resource scheduling described above.
The resource scheduling method provided by the application comprises the steps of firstly obtaining each online model group in a flight task; then obtaining current operation information of each online model group and historical operation information corresponding to each online model group; and finally, scheduling the resources according to the current operation information of each online model group and the historical operation information corresponding to each online model group. In the method, the change rule of the load can be obtained through the historical information of each model group, so that the resources can be reasonably scheduled according to the change rule of the load and the current operation information.
In addition, the application also provides a system for resource scheduling, an electronic device and a computer readable storage medium, which correspond to the above mentioned method for resource scheduling, and the effects are the same.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
Fig. 1 is a flowchart of a resource scheduling method provided in the present application;
fig. 2 is a block diagram of a system for resource scheduling according to an embodiment of the present application;
FIG. 3 is a block diagram of an electronic device according to another embodiment of the present application;
fig. 4 is a schematic view of an application scenario of the resource scheduling method according to the embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The core of the application is to provide a method, a system, an electronic device and a medium for resource scheduling, which are used for realizing reasonable scheduling of resources.
For ease of understanding, the hardware structure used in the technical solution of the present application is described below. The hardware architecture of the resource scheduling method provided by the application mainly comprises a Central Processing Unit (CPU) and a memory. The CPU core number is the number of cores of a CPU, and the more the cores are, the faster the running speed of the CPU is represented, and the better the performance is. The operation of the memory also determines how fast the computer is running as a whole. When resource allocation is performed, the number of the CPU cores and the memory are basic units of resource allocation, and represent the number of the CPU cores that can be called by the model group and the used memory.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. Fig. 1 is a flowchart of a resource scheduling method provided in the present application. As shown in fig. 1, the method includes:
s10: and acquiring each online model group in the Flink task.
The Apache Flink is a distributable open source computing framework oriented to data stream processing and batch data processing, and is based on the same Flink streaming execution model (streaming execution model), and can support two application types of streaming processing and batch processing. The Flink task implemented by the user is composed of two basic building blocks, Stream and Transformation, wherein Stream is an intermediate result data, and Transformation is an operation which performs computation processing on one or more input streams and outputs one or more result streams. The Flink task usually includes a plurality of model groups, a model group is an object for allocating resources, one model group may include a plurality of workflows, and workflows processing the same topic or the same type of topic are generally categorized into the same model group. When a Flink task is executed, it will be mapped to Workflow. A Workflow is composed of a set of streams and operators, which is similar to a directed graph, starting at one or more Source operators and ending at one or more Sink operators at startup. The basic units allocated to the model group are the core number and the memory, which represent the number of CPU cores that the model group can call and the used memory, and the memory is generally equal to twice the core number minus one, and is increased to three times or four times if the number of the model group workflows is too large. The model groups may be classified into an online model group and an offline model group according to the operation state of the model groups.
When the online model groups in the Flink task are obtained, whether abnormal offline or abnormal restarting conditions exist in all the model groups can be inquired, if the abnormal offline or abnormal restarting conditions exist, the running logs of the model groups are scanned to judge the reason of the problem, a corresponding solution is proposed, then the abnormal offline or abnormal restarting model groups are re-online, and finally all the online model groups in the Flink task are obtained; and if the abnormal offline or abnormally restarted model group does not exist, directly acquiring each online model group in the Flink task.
S11: and acquiring the current operation information of each online model group and the historical operation information corresponding to each online model group.
Currently, during the flight task operation, a fixed kernel number and a memory are generally allocated to each model group to process the load input of the model group. However, the load fluctuation is ubiquitous, and often results in insufficient resources or excessive resources, that is, the resources cannot be reasonably allocated. According to the method and the device, the change rule of the load is analyzed through the historical operation information of the online model group, so that the resources can be reasonably distributed. On the basis of the step of obtaining each online model group in the Flink task, the operation information of each online model group is collected and processed, and the historical operation information of each online model group is further known. The historical operating information (also referred to as meta-information) of each online model group includes: input rate, accumulation amount, processing rate, processing amount, current offset, final offset, operator type, operator number workflow type, workflow number, and the like.
When obtaining the historical operation information of each online model group, firstly, indexes of Flink, Kafka and Mysql are matched, screened and the like. Kafka is a distributed message queue. The sender is called Producer and the recipient is called Consumer, and furthermore the Kafka cluster consists of multiple instances of Kafka, each instance (server) being called brooker. Either the Kafka cluster or the Consumer relies on the zookeeper cluster to maintain some meta information to ensure system availability. Mysql is the most popular Relational Database Management System, and MySQL is one of the best Relational Database Management System (RDBMS) application software in terms of WEB applications. During matching, indexes of Flink, Kafka and Mysql are corresponded according to the ID of the task; in screening, the collection can be carried out according to the importance degree of the model group. And then acquiring indexes of Flink, Kafka and Mysql, wherein the acquisition can be performed according to a fixed frequency or a non-fixed frequency during the acquisition of the indexes, and the acquisition frequency is not limited here. In implementation, in order to accurately obtain the change rule of the load, the index is collected according to a fixed frequency, and the specific value of the fixed frequency is not limited. It should be noted that the collected indexes are some indexes that are relatively simple and do not need to be obtained through calculation, such as the current offset, the final offset, and the like. After the indexes are collected, the indexes are written into the same file, and the format of the file is not limited, for example, the file can be a txt file. The file is read into a feature format which can be directly utilized, such as a pandas-dataframe format, and new indexes such as input rate, processing rate, stacking rate and the like are constructed according to the indexes, such as the processing rate (final offset-current offset)/acquisition interval time. And finally outputting historical operation information of each online model group. It should be noted that, the manner of obtaining the historical information of each online model group is not limited here, and may be directly obtaining the historical operation information of each online model group; after obtaining the historical information of all model groups, the historical operation information of the online model group and the like can be inquired from the historical information of all model groups.
The current operation information of each online model group is obtained by the method, and the current operation information also comprises input rate, accumulation amount, processing rate, processing amount, current offset, final offset, operator type, operator number workflow type, workflow number and the like.
S12: and scheduling the resources according to the current operation information of each online model group and the historical operation information corresponding to each online model group.
The current operation information and the corresponding historical operation information of each online model group are obtained in the steps, then a machine learning model is used for completing the adjustment strategy of the two-classification determined model group, namely the adjustment strategy is increased or decreased, the adjustment step length of the model group is determined by a method of fitting a plurality of curves, namely the increased step length or the decreased step length, and therefore the scheduling of resources is achieved. It should be noted that after determining the adjustment policy and the adjustment step length, it needs to determine whether the adjustment policy and the adjustment step length satisfy the configurable resources of the system, if not, perform global resource allocation to ensure the stability of the key service stream, and if the adjustment policy and the adjustment step length do not satisfy the configurable resources of the system after global allocation, call an Application Programming Interface (API) of the platform.
The method for scheduling resources provided by this embodiment first obtains each online model group in the Flink task; then obtaining current operation information of each online model group and historical operation information corresponding to each online model group; and finally, scheduling the resources according to the current operation information of each online model group and the historical operation information corresponding to each online model group. In the method, the change rule of the load can be acquired through the historical information of each model group, so that the resources can be reasonably scheduled according to the change rule of the load and the current operation information.
In practice, there will typically be a set of models that are abnormally offline or abnormally restarted. When resource adjustment is performed, if an abnormally offline model group exists, resources allocated to the model group before may not be used, and when the number of the abnormally offline model groups is large, a large amount of waste of resources may be caused; if the abnormally restarted model group exists, resources are not allocated to the model group before, so that a large amount of accumulation of the model group occurs, and each piece of data cannot be consumed in time. Therefore, in the implementation, it is first determined whether there is an abnormal offline or abnormally restarted model group, and if so, the abnormal model group is brought online again. Specifically, the obtaining of each online model group in the Flink task includes:
under the condition that the abnormal offline or abnormal restarting model groups do not exist, obtaining each online model group;
scanning an abnormal log and acquiring abnormal log information under the condition that an abnormal offline or abnormal restarting model group exists;
determining the memory occupation condition according to the abnormal log information;
if the memory meets the preset requirement, the online abnormal offline model groups are acquired;
and if the memory does not meet the preset requirement, increasing the memory and returning to the step of determining the memory occupation condition according to the abnormal log information.
Each model group is provided with a corresponding state database, whether an abnormal offline or abnormally restarted model group exists or not is inquired from the database, and if not, all online model groups, namely all running model groups are directly inquired; if the online abnormal offline model group exists, scanning the abnormal log, determining the memory occupation condition according to the abnormal log information, and if the memory meets the preset requirement, online abnormal offline model groups are acquired and all online model groups are acquired; if the memory does not meet the preset requirement, the memory is increased first, and the step of determining the memory occupation condition according to the abnormal log information is returned. It should be noted that the memory meeting the preset requirement here means that the size of the memory can meet the size of the memory required by the operation of each model group. When a specific character string appears in the log information, the specific character string represents abnormal offline or abnormal restart caused by insufficient memory. And for the abnormity caused by insufficient memory, adding workmemory, then online connecting the abnormal model groups, and finally querying all online model groups. It should be noted that after the word memory is added, whether the memory occupation is satisfied or not can be continuously judged according to the log information, and if not, the memory occupation can be enough by adding the word memory for multiple times; and directly querying all online model groups for the exceptions which are not caused by insufficient memory. In implementation, if the set of online models fails, the online can be repeated. The number of times the line is repeated is not limited herein. If the number of times of repeating the online operation is set to be three times, if the online operation still fails after the three times, the online operation of the model group is not required.
In the embodiment, when obtaining each online model group in the Flink task, the online model group is obtained after the online abnormally-offline or abnormally-restarted model group by taking the abnormally-offline or abnormally-restarted model group into consideration, so that the resource allocation and scheduling of the model group are ensured as much as possible under the condition that the model group is stable, and the resource scheduling is more reasonable.
In implementation, resources can be scheduled more accurately according to the current operation information of the online model groups and the historical operation information of each model group. As a preferred embodiment, the scheduling the resource according to the current operation information of each online model group and the historical operation information corresponding to each online model group includes:
finishing secondary classification on the historical operation information corresponding to each online model group and the current operation information of each online model group by using a machine learning model so as to determine an adjustment strategy of each online model group; adjusting the policy includes increasing or decreasing resources;
fitting the current operation information of each online model group by a method of fitting a plurality of curves to determine the adjustment step length of each online model group;
and scheduling the resources of each online model group according to the adjustment step length and the adjustment strategy.
When the machine learning model is adopted to complete the classification, the method specifically comprises the following steps:
(1) extracting corresponding characteristics from the metadata taken in the above steps. Because the metadata is collected at regular time intervals, the method is equivalent to a time sequence, and corresponding features are extracted from the whole time sequence in a sliding window mode, wherein the features comprise minimum values, maximum values, average values and variances of input rates, processing rates, accumulation rates and the like.
(2) And labeling the extracted characteristic data. The label, i.e. the calibration, is increased or decreased.
(3) Some feature engineering is done. Since the maximum value, the minimum value and the like of the collection are too simple for the model, the simple features are combined through feature engineering to construct high-order features, such as cross features, data discretization and the like.
(4) And calculating the importance of each feature by using a LightGBM algorithm to complete the screening of the features.
(5) And (4) training the model by using the algorithm again, and inputting the previously collected data into the model to judge whether the resources of the model group are increased or reduced.
It should be noted that, in this embodiment, the LightGBM algorithm is used to complete the training of the model, and in the implementation, other algorithms may also be used to train the model, and the specific algorithm used herein is not limited.
The specific steps of determining the step size of the adjustment by a method of fitting a plurality of curves are as follows:
(1) fitting a plurality of curves according to the metadata;
(2) selecting a curve with the highest correlation coefficient;
(3) and determining the number of the distributed kernels according to the current input speed and the fitted curve with the highest correlation coefficient.
After the metadata is collected, the processing speed values under various conditions can be known by storing the metadata, and the number of the distributed cores is determined according to the current input speed and a curve fitted by all the metadata. Assuming that the processing speed of the model set is less than the input speed, the model set requires increased resources. The increase of resources is to increase the processing speed, so the number of cores needs to be increased, for example, if the processing speed of the model group 2 is 1000 when the model group is found according to the historical operation information of the model group, and the processing speed of the model group 3 is 2000 when the model group is found, the processing speeds of the rest of cores can be deduced according to the curve with the highest correlation coefficient formed under the condition of different numbers of cores, and further, the model group can be allocated with proper number of cores according to the processing speed and the current input speed. Assuming that the processing speed of 4 cores is 3000 calculated according to the data of model group 2 cores and 3 cores, if the current input speed is 2900, then allocating 4 cores to the model group can satisfy the condition that the processing speed is greater than the input speed, and further reduce the accumulation.
The resource of the model group is adjusted by using the machine learning model and the method of fitting the multiple curves, and the change rule of the load is obtained according to the historical operation data of the model group, so that the resource can be reasonably scheduled according to the change rule of the load and the current operation information.
In order to increase the speed of resource adjustment for all model groups in implementation, it is preferable to schedule resources only for unstable model groups. Therefore, scheduling the resources of each online model group according to the adjustment step size and the adjustment strategy includes:
obtaining an unstable model group in the online model group;
a set of models with unstable downline;
and scheduling the resource of the unstable model group according to the adjustment step length and the adjustment strategy.
In practice, determining whether a model set is stable may be judged from the following aspects:
(1) according to the result of historical allocation; if the resource adjustment is carried out on the model group for multiple times in the historical allocation, the model group is not stable, otherwise, the model group is stable.
(2) According to the maximum processing speed. If the current maximum processing speed is lower than the input speed, accumulation is caused, so that resource adjustment is not needed to be performed on the model group, and the model group is stable.
(3) A large amount of pile-up is being processed. A model set is also considered stable when it is handling large accumulations;
(4) the processing speed is slightly greater than the input rate. When the processing speed is slightly higher than the input speed, the model group is determined to be stable without accumulation, otherwise, the model group is considered to be unstable;
(5) the input rate is approximately equal to the maximum processing speed. When the input speed and the maximum processing speed are approximately equal, the model group is also stated to be not piled up, the model group is stable, and otherwise, the model group is considered to be unstable.
And judging whether the model group is stable or not in the steps. In implementation, in order to reduce the occupation of the CPU, the resource scheduling is performed only on the unstable model group. If the resources need to be increased for the model group A and the resources need to be reduced for the model group B and the model group C according to the adjustment strategy and the adjustment step length, the assumption is that the model group A is an unstable model group, but the model group B and the model group C are stable model groups, the resources need to be increased only for the model group A, and the resources do not decrease for the model group B and the model group C.
Compared with the method for adjusting the resources of both the stable model group and the unstable model group, the method for adjusting the resources of only the unstable model group provided by the embodiment can more quickly adjust the resources of all the model groups and reduce the occupation of the CPU.
In implementation, it may happen that the adjustment strategy and the adjustment step size cannot satisfy the configurable resources of the system, thereby resulting in ineffective adjustment of the model group resources. Therefore, a preferred embodiment is that, before scheduling the resource of the unstable model group according to the adjustment step size and the adjustment policy, the method further includes:
under the condition that the adjustment step length and the adjustment strategy meet the configurable resources of the system, the step of scheduling the resources of the unstable model group according to the adjustment step length and the adjustment strategy is carried out;
and under the condition that the adjustment step length and the adjustment strategy meet the condition that the configurable resources of the system are not met, performing global resource allocation according to the priority, the accumulation amount and the input speed of the corresponding model group.
Before scheduling unstable resources according to the adjustment step length and the adjustment strategy, judging whether the adjustment strategy and the adjustment step length meet the configurable resources of the system, if the adjustment strategy and the adjustment step length do not exceed the configurable resources of the system, scheduling the unstable resources directly according to the adjustment step length and the adjustment strategy; if the configurable resources of the system are exceeded, the allocation of global resources is performed. When global resource allocation is performed, resources may be allocated according to the priority, the accumulation amount, the input speed, and the like of the model group. The priority of the model group is set by the user, and the model group with high priority can be considered as a certain model group which is required to be operated. When the resources are insufficient and the accumulation amount is high, the resources need to be added to the model group, but the resources exceed the configurable resources of the system after the resources are added, and at this time, the resources can be allocated according to the priority of the model group. In implementation, the global resource allocation may be performed according to any one or two of the three quantities of the priority, the accumulation amount, and the input speed of the model group, or according to the three quantities in common. In order to improve the accuracy of the global resource allocation, it is preferable that the allocation of the global resource is performed in common according to three quantities. If the configurable resources of the system are still not satisfied after the global resource allocation, an alarm API of the platform is called.
According to the method and the device for scheduling the unstable resources, before the unstable resources are scheduled according to the adjustment step length and the adjustment strategy, whether the adjustment strategy and the adjustment step length meet the configurable resources of the system or not is judged, so that the adjustment of the model group resources can be effectively prevented from being invalid, and the stability of the key service flow is ensured according to the distribution scheme of the global resources.
In the above embodiment, the resource is scheduled for the unstable model group, and the resource is not scheduled for the stable model group, but the resource may be unreasonably allocated in the stable model group, so that the operating state of the model group needs to be further determined. In implementation, after obtaining the current operation information of each online model group and the historical operation information corresponding to each online model group, the method further includes:
judging the delay parameters, the processing speed and the stability of the online model group under the condition that the online model group is stable;
and if the delay parameters, the processing speed and the stability of the online model groups meet the corresponding preset requirements, storing the scheduled operation data of each online model group.
In the above embodiments, the determination of whether the online model group is stable is described, and details are not described here. The method for judging the stability of the online model group comprises the steps of judging the stability of the stable model group judged before the unstable model group is dispatched, and also comprises the step of obtaining the stable model group after all the model groups are judged to be stable after the unstable model group is dispatched. And under the condition that the online model group is stable, judging the delay parameter, the processing speed and the stability of the online model group. These three indexes are indexes describing the operating state of the model group. The delay parameter reflects the timeliness of data processing of the model group, the processing speed reflects the rationality of resource allocation, the stability of the model group means that the model group cannot be started and stopped frequently, otherwise, a certain amount of data loss and the instability of the processing speed are caused. And if the delay parameters, the processing speed and the stability of the online model groups meet corresponding preset requirements, storing the scheduled operation data of each online model group. The corresponding preset requirements met by each index are not limited, and reasonable preset requirements are set according to actual conditions.
After obtaining the current operation information of each online model group and the historical operation information corresponding to each online model group, the method further judges the delay parameter, the processing speed and the stability of the online model group under the condition that the online model group is stable; and if the delay parameters, the processing speed and the stability of the online model groups meet the corresponding preset requirements, storing the scheduled operation data of each online model group. By the method provided by the embodiment, whether the resources of the model group are reasonable can be further judged according to the index capable of reflecting the running state of the model group, and the running data is stored as historical running data, so that the rule of load change is more comprehensively known, and the resources can be reasonably scheduled according to the change of the load.
In the above embodiment, the preset conditions that the delay parameter, the processing speed, and the stability of the online model set satisfy are not limited, and in implementation, the preset requirements that the delay parameter satisfies preferably include: the mean value of the data delay processing time of each online model group is smaller than a threshold value;
the preset requirements met by the processing speed include: the average processing speed of each online model group is less than the preset times of the maximum processing speed of each online model group within the first preset time;
the preset requirements met by the stability of the online model group include: and restarting each online model group within a second preset time for a time less than a preset value.
In implementation, it is determined whether the mean value of the data delay processing time of all model groups is within N hours (N can be adjusted) for the delay parameter, and it is calculated according to equation (1), and in the scenario where data continuously flows in, whether the current accumulated data can be consumed within N hours. Wherein the consumption time calculation formula of the accumulation amount of each model group is shown as formula (1):
Figure BDA0003571275900000091
l in the above formula (1)iRepresents the mean value of the accumulation amount of the ith model group in the time delta t,
Figure BDA0003571275900000092
represents the mean of the processing speed of the ith model group in the delta t time,
Figure BDA0003571275900000093
represents the average value of the load input speed in the ith model group in the time delta t. The specific value of Δ t is not limited.
The requirement for the processing speed is that the average processing speed of the model set during the Δ t time in operation cannot be higher than a preset multiple of the maximum processing speed (in terms of the currently configured upper limit of processing capacity) of the model set. The first preset time and the preset times of the specific numerical values are not limited, and the appropriate first preset time and the preset times of the numerical values are selected according to the actual situation.
The requirement for the stability of the model sets is that the restarting times of each online model set in the second preset time are less than a preset value. The second preset time and the preset value are not limited and are determined according to actual conditions. And if the number of times of restarting the model group within three days is less than or equal to two, the model group is considered to be stable, and if the number of times of restarting the model group within three days exceeds two, the model group is considered to be unstable. It should be noted that the criterion for determining the stability of the model group in the operating state is different from the criterion for determining whether the model group is stable as mentioned above, and the criterion for the determination is also different.
It should be noted that, in implementation, the operation state of the model set may be determined according to any one or two of the three indexes, i.e., the delay parameter, the processing speed and the stability of the online model set, or according to the three indexes together. In order to more accurately acquire the operating state of the model group, it is preferable that the operating state of the model group is determined according to the three indexes in common.
The method provided by the embodiment further judges whether the resources of the model group are reasonable according to the delay parameter, the processing speed and the stability of the online model group, so that the model group with unreasonable resource allocation can be scheduled in time.
In order to quickly acquire historical operating information corresponding to online model groups and improve resource scheduling time, in implementation, the acquiring of the historical operating information corresponding to each online model group includes:
acquiring historical operation information of each model group in a flight task;
and acquiring historical operating information corresponding to each online model group from the historical operating information of each model group.
When resource scheduling is needed, the historical operation information of each model group in the Flink task is stored in the database, and when the historical operation information of the online model group needs to be acquired, the historical operation information can be directly acquired from the database. Since the historical operating data is acquired at regular time intervals, if the historical operating information corresponding to each online model group is not stored in advance, reasonable resources cannot be timely allocated to the model group according to the current input speed, secondly, the online model groups may be increased, and when the historical operating information of the online model groups does not contain the online model groups, the model groups cannot be timely analyzed for resource scheduling, the historical operating information of each model group in the Flink task is acquired in advance, and even if the online model groups are increased, the historical operating information corresponding to the increased online model groups can be found from the historical operating information of each model group.
Acquiring historical operating information of each model group in a Flink task, which is provided by the embodiment; historical operation information corresponding to each online model group is obtained from the historical operation information of each model group, and the historical operation information of each online model group is contained in the historical operation information of each model group obtained in advance, so that the historical operation information corresponding to each online model group can be quickly obtained, and the resource scheduling time is prolonged.
In the foregoing embodiment, a method for resource scheduling is described in detail, and the present application also provides embodiments corresponding to a system and an electronic device for resource scheduling. It should be noted that the present application describes the embodiments of the apparatus portion from two perspectives, one from the perspective of the function module and the other from the perspective of the hardware.
Fig. 2 is a block diagram of a system for resource scheduling according to an embodiment of the present application. The present embodiment is based on the angle of the function module, including:
a first obtaining module 10, configured to obtain each online model group in the Flink task;
a second obtaining module 11, configured to obtain current operation information of each online model group and historical operation information corresponding to each online model group;
and the scheduling module 12 is configured to schedule the resource according to the current operation information of each online model group and the historical operation information corresponding to each online model group.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
In the system for scheduling resources provided in this embodiment, first, each online model group in the Flink task is obtained through the first obtaining module, then, the current operation information of each online model group and the historical operation information corresponding to each online model group are obtained through the second obtaining module, and finally, the resources are scheduled through the scheduling module according to the current operation information of each online model group and the historical operation information corresponding to each online model group. In the system, the change rule of the load can be acquired through the historical information of each model group, so that the reasonable scheduling of resources can be realized according to the change rule of the load and the current operation information.
Fig. 3 is a block diagram of an electronic device according to another embodiment of the present application. This embodiment is based on a hardware perspective, and as shown in fig. 3, the electronic device includes:
a memory 20 for storing a computer program;
a processor 21 for implementing the steps of the method of resource scheduling as mentioned in the above embodiments when executing the computer program.
The electronic device provided by the embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.
The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The Processor 21 may be implemented in hardware using at least one of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), and a Programmable Logic Array (PLA). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor, also called a CPU, for processing data in an awake state; a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 21 may further include an Artificial Intelligence (AI) processor for processing computational operations related to machine learning.
The memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing the following computer program 201, wherein after being loaded and executed by the processor 21, the computer program can implement the relevant steps of the method for resource scheduling disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, Windows, Unix, Linux, and the like. The data 203 may include, but is not limited to, data related to the above-mentioned method of resource scheduling, and the like.
In some embodiments, the electronic device may further include a display 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.
Those skilled in the art will appreciate that the configuration shown in fig. 3 is not intended to be limiting of electronic devices and may include more or fewer components than those shown.
The electronic device provided by the embodiment of the application comprises a memory and a processor, and when the processor executes a program stored in the memory, the following method can be realized: the effect of the resource scheduling method is the same as that of the resource scheduling method.
Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps as set forth in the above-mentioned method embodiments.
It is to be understood that if the method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods described in the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The computer-readable storage medium provided by the present application includes the above-mentioned method for resource scheduling, and the effects are the same as above.
In order to enable those skilled in the art to better understand the technical solution of the present application, the foregoing present application is further described in detail with reference to fig. 4, and fig. 4 is a schematic application scenario diagram of a resource scheduling method provided in an embodiment of the present application. The process comprises the following steps:
s13: automatic index collection and processing;
s14: model set operating data;
s15: inquiring whether an abnormal offline or abnormal restarting model group exists; if yes, the process proceeds to step S16, otherwise, the process proceeds to step S20;
s16: scanning an abnormal log;
s17: judging whether the memory is insufficient, if so, entering step S18, otherwise, entering step S19;
s18: adding workmemory;
s19: an online model group;
s20: querying all online model groups;
s21: reading historical operating data of the model group;
s22: a model of automated resource allocation;
s23: outputting the distribution scheme;
s24: judging whether the model group is stable; if not, go to step S25-step S28, and then go to step S30; if yes, go to step S29;
s25: a set of offline models;
s26: adjusting the allocated resources;
s27: an online model group;
s28: if the three times of line-up are not successful, alarming;
s29: calculating whether each index meets the requirement;
s30: and storing historical operation data.
The resource scheduling method provided by this embodiment acquires and processes historical operation information of an online model group through automatic index acquisition, then establishes an automatic resource allocation model and outputs an allocation scheme, and under the condition that the model group is unstable, the online model group is firstly disconnected, then the allocated resources are adjusted, and after the allocated resources are adjusted, the online model group is connected, if the online is failed, the online is repeated three times, if the online is still unsuccessful, an alarm is given, and historical operation data is stored; and under the condition that the model group is stable, calculating whether each index meets the requirement, wherein each index refers to the delay parameter, the processing speed and the stability of the online model group mentioned in the embodiment, and if so, storing historical operation data. Therefore, in the method, the change rule of the load can be acquired through the historical information of each model group, so that the reasonable scheduling of the resources can be realized according to the change rule of the load and the current operation information.
The method, system, electronic device, and medium for resource scheduling provided by the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (11)

1. A method for resource scheduling, comprising:
acquiring each online model group in the flight task;
acquiring current operation information of each online model group and historical operation information corresponding to each online model group;
and scheduling resources according to the current operation information of each online model group and the historical operation information corresponding to each online model group.
2. The method according to claim 1, wherein the obtaining of each online model group in the Flink task comprises:
under the condition that the abnormal offline or abnormal restarting model groups do not exist, each online model group is obtained;
scanning an abnormal log and acquiring the information of the abnormal log under the condition that an abnormal offline or abnormal restarting model group exists;
determining the memory occupation condition according to the abnormal log information;
if the memory meets the preset requirement, the abnormal off-line model set is on-line and each on-line model set is obtained;
and if the memory does not meet the preset requirement, increasing the memory and returning to the step of determining the memory occupation condition according to the abnormal log information.
3. The method according to claim 2, wherein the scheduling resources according to the current operation information of each online model group and the historical operation information corresponding to each online model group comprises:
utilizing a machine learning model to complete two classifications of historical operation information corresponding to each online model group and current operation information of each online model group so as to determine an adjustment strategy of each online model group; the adjustment strategy comprises increasing resources or decreasing resources;
fitting the current operation information of each online model group by a method of fitting a plurality of curves to determine the adjustment step length of each online model group;
and scheduling the resources of each online model group according to the adjustment step length and the adjustment strategy.
4. The method of claim 3, wherein the scheduling the resources of each online model group according to the adjustment step size and the adjustment policy comprises:
obtaining an unstable model group in the online model group;
offline the unstable model set;
and scheduling the resources of the unstable model group according to the adjustment step length and the adjustment strategy.
5. The method of claim 4, further comprising, before the scheduling the resources of the unstable model group according to the adjustment step size and the adjustment strategy:
under the condition that the adjustment step length and the adjustment strategy meet the configurable resources of the system, the step of scheduling the resources of the unstable model group according to the adjustment step length and the adjustment strategy is carried out;
and under the condition that the adjustment step length and the adjustment strategy meet the condition that the configurable resources of the system are not met, performing global resource allocation according to the priority, the accumulation amount and the input speed of the corresponding model group.
6. The method according to claim 2, further comprising, after the obtaining current operation information of each online model group and historical operation information corresponding to each online model group:
judging the delay parameter, the processing speed and the stability of the online model group under the condition that the online model group is stable;
and if the delay parameters, the processing speed and the stability of the online model groups meet corresponding preset requirements, storing the scheduled operation data of each online model group.
7. The method of claim 6, wherein the preset requirement met by the delay parameter comprises: the mean value of the data delay processing time of each online model group is smaller than a threshold value;
the preset requirements that the processing speed satisfies include: the average processing speed of each online model group is smaller than the preset times of the maximum processing speed of each online model group within a first preset time;
the preset requirements met by the stability of the online model group include: and the restarting frequency of each online model group is smaller than a preset value in second preset time.
8. The method according to any one of claims 1 to 7, wherein the obtaining of the historical operation information corresponding to each online model group comprises:
acquiring historical operation information of each model group in the Flink task;
and obtaining historical operation information corresponding to each online model group from the historical operation information of each model group.
9. A system for resource scheduling, comprising:
the first acquisition module is used for acquiring each online model group in the Flink task;
the second acquisition module is used for acquiring the current operation information of each online model group and the historical operation information corresponding to each online model group;
and the scheduling module is used for scheduling resources according to the current operation information of each online model group and the historical operation information corresponding to each online model group.
10. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of resource scheduling according to any one of claims 1 to 8 when executing the computer program.
11. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the method of resource scheduling according to any one of claims 1 to 8.
CN202210320056.XA 2022-03-29 2022-03-29 Resource scheduling method, system, electronic device and medium Pending CN114741187A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210320056.XA CN114741187A (en) 2022-03-29 2022-03-29 Resource scheduling method, system, electronic device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210320056.XA CN114741187A (en) 2022-03-29 2022-03-29 Resource scheduling method, system, electronic device and medium

Publications (1)

Publication Number Publication Date
CN114741187A true CN114741187A (en) 2022-07-12

Family

ID=82276494

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210320056.XA Pending CN114741187A (en) 2022-03-29 2022-03-29 Resource scheduling method, system, electronic device and medium

Country Status (1)

Country Link
CN (1) CN114741187A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701150A (en) * 2023-06-19 2023-09-05 深圳市银闪科技有限公司 Storage data safety supervision system and method based on Internet of things
CN117234711A (en) * 2023-09-05 2023-12-15 合芯科技(苏州)有限公司 Dynamic allocation method, system, equipment and medium for Flink system resources

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701150A (en) * 2023-06-19 2023-09-05 深圳市银闪科技有限公司 Storage data safety supervision system and method based on Internet of things
CN116701150B (en) * 2023-06-19 2024-01-16 深圳市银闪科技有限公司 Storage data safety supervision system and method based on Internet of things
CN117234711A (en) * 2023-09-05 2023-12-15 合芯科技(苏州)有限公司 Dynamic allocation method, system, equipment and medium for Flink system resources

Similar Documents

Publication Publication Date Title
US10558498B2 (en) Method for scheduling data flow task and apparatus
CN109783237B (en) Resource allocation method and device
CN114741187A (en) Resource scheduling method, system, electronic device and medium
US20150295970A1 (en) Method and device for augmenting and releasing capacity of computing resources in real-time stream computing system
CN111813624B (en) Robot execution time length estimation method based on time length analysis and related equipment thereof
CN112328399A (en) Cluster resource scheduling method and device, computer equipment and storage medium
CN110929799B (en) Method, electronic device, and computer-readable medium for detecting abnormal user
EP3961384A1 (en) Automatic derivation of software engineering artifact attributes from product or service development concepts
CN111381970B (en) Cluster task resource allocation method and device, computer device and storage medium
CN113849848B (en) Data permission configuration method and system
CN114490160A (en) Method, device, equipment and medium for automatically adjusting data tilt optimization factor
CN108595251B (en) Dynamic graph updating method, device, storage engine interface and program medium
US20080215664A1 (en) Occasionally connected edge application architecture
US11636377B1 (en) Artificial intelligence system incorporating automatic model updates based on change point detection using time series decomposing and clustering
CN115774602A (en) Container resource allocation method, device, equipment and storage medium
CN110019783B (en) Attribute word clustering method and device
CN113946566B (en) Web system fingerprint database construction method and device and electronic equipment
CN115665157A (en) Balanced scheduling method and system based on application resource types
CN114860672A (en) Node management method and system for batch processing data task
CN110728372B (en) Cluster design method and cluster system for dynamic loading of artificial intelligent model
CN113296951A (en) Resource allocation scheme determination method and equipment
CN113343133A (en) Display page generation method, related device and computer program product
CN112231299A (en) Method and device for dynamically adjusting feature library
CN112632990B (en) Label acquisition method, device, equipment and readable storage medium
CN117519913B (en) Method and system for elastically telescoping scheduling of container memory resources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination