CN113419837A - Method and device for scheduling machine learning task - Google Patents

Method and device for scheduling machine learning task Download PDF

Info

Publication number
CN113419837A
CN113419837A CN202110782059.0A CN202110782059A CN113419837A CN 113419837 A CN113419837 A CN 113419837A CN 202110782059 A CN202110782059 A CN 202110782059A CN 113419837 A CN113419837 A CN 113419837A
Authority
CN
China
Prior art keywords
machine learning
task
cluster
model
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110782059.0A
Other languages
Chinese (zh)
Inventor
李龙飞
周俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202110782059.0A priority Critical patent/CN113419837A/en
Publication of CN113419837A publication Critical patent/CN113419837A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The present disclosure discloses a method and apparatus for scheduling machine learning tasks. The method comprises the following steps: receiving a machine learning task submitted by a user to a cluster, wherein the cluster is configured with a plurality of machine learning models capable of executing the machine learning task; selecting a target machine learning model from the multiple machine learning models according to the model statistical information of the multiple machine learning models; and scheduling the machine learning task to a target working node of the cluster, and indicating the target working node to process the machine learning task by adopting the target machine learning model.

Description

Method and device for scheduling machine learning task
Technical Field
The disclosure relates to the field of task scheduling, in particular to a method and a device for scheduling machine learning tasks.
Background
With the rise of artificial intelligence, machine learning models are applied more and more widely as core technologies of artificial intelligence. To provide sufficient computational and storage resources for the machine learning model, the machine learning model is typically loaded and run by the cluster, such that a user may request the machine learning model to process the machine learning task by submitting the machine learning task to the cluster.
At present, the reliability requirement of a user on a cluster processing machine learning task is higher and higher. However, existing task scheduling mechanisms mainly aim to reduce processing latency of tasks, and such task scheduling mechanisms can reduce reliability of cluster processing machine learning tasks.
Disclosure of Invention
In view of this, the present disclosure provides a method and an apparatus for scheduling a machine learning task to improve reliability of cluster processing of the machine learning task.
In a first aspect, a method of scheduling a machine learning task is provided, the method comprising: receiving a machine learning task submitted by a user to a cluster, wherein the cluster is configured with a plurality of machine learning models capable of executing the machine learning task; selecting a target machine learning model from the multiple machine learning models according to the model statistical information of the multiple machine learning models; and scheduling the machine learning task to a target working node of the cluster, and indicating the target working node to process the machine learning task by adopting the target machine learning model.
In a second aspect, an apparatus for scheduling machine learning tasks is provided, the apparatus comprising: a receiving unit configured to receive a machine learning task submitted by a user to a cluster, the cluster being configured with a plurality of machine learning models capable of executing the machine learning task; a first processing unit configured to select a target machine learning model from the plurality of machine learning models according to model statistical information of the plurality of machine learning models; a second processing unit configured to schedule the machine learning task to a target work node of the cluster and instruct the target work node to process the machine learning task using the target machine learning model.
In a third aspect, there is provided an apparatus for scheduling machine learning tasks, comprising a memory having executable code stored therein and a processor configured to execute the executable code to implement the method of the first aspect.
In a fourth aspect, there is provided a computer readable storage medium having stored thereon executable code which, when executed, is capable of implementing the method of the first aspect.
In a fifth aspect, there is provided a computer program product comprising executable code which, when executed, is capable of implementing the method of the first aspect.
The utility model provides a scheduling scheme of machine learning task can select suitable target machine learning model from many machine learning models based on the model statistical information of many machine learning task models in the cluster to the machine learning task that handles user and submit, is favorable to improving the reliability that cluster processing machine learning task like this.
Drawings
Fig. 1 is a schematic diagram of a cluster to which the embodiments of the present disclosure are applicable.
Fig. 2 is a flowchart of a method of scheduling a machine learning task according to an embodiment of the present disclosure.
Fig. 3 is a flowchart of a method of scheduling a machine learning task according to another embodiment of the present disclosure.
Fig. 4 is a flowchart of a method of scheduling a machine learning task according to another embodiment of the present disclosure.
Fig. 5 is a schematic structural diagram of an apparatus for scheduling a machine learning task according to an embodiment of the present disclosure.
Fig. 6 is a schematic structural diagram of an apparatus for scheduling a machine learning task according to another embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments.
With the wide application of machine learning, users have higher and higher requirements on the reliability of machine learning tasks, wherein the factors influencing the reliability are more, such as the success rate of a machine learning model, the size of a memory required by the machine learning model, and the like. However, the conventional task scheduling mechanism mainly aims to reduce the processing delay of the task, and the task is distributed to a certain working node immediately after being received. Such a task scheduling mechanism may result in less reliability of the cluster processing machine learning tasks.
In order to improve the reliability of cluster processing machine learning tasks, the disclosure provides a scheme for scheduling machine learning tasks, based on model statistical information of machine learning models supported by clusters, a proper target machine learning model is selected from the machine learning models supported by the clusters, and the machine learning tasks submitted by users are processed by adopting the target machine learning model.
For ease of understanding, a cluster to which embodiments of the present disclosure are applicable will be described below with reference to fig. 1. The cluster shown in fig. 1 comprises a scheduling apparatus 110, a plurality of working nodes 120 and a database 130.
The scheduling apparatus 110 may include a model statistics unit 111, a queue statistics unit 112, and a work node statistics unit 113. In some embodiments, the scheduling apparatus 110 may include one or more of a model statistics unit 111, a queue statistics unit 112, and a work node statistics unit 113.
The model statistical unit 111 may be configured to perform statistics on model statistical information of machine learning models supported by the cluster. Wherein the model statistics may include one or more of the following information of the machine learning model: the method comprises the steps of model size, memory occupation, number of models, calling times in a preset time period, time required for executing the machine learning task and success rate of executing the machine learning task.
The model statistical information may be fed back (or reported) from the working node 120 to the scheduling apparatus 110, specifically, the feedback mode is not specifically limited in the embodiment of the present disclosure. For example, the worker node 120 may periodically feed back worker node statistics to the scheduling apparatus 110. For another example, the working node 120 may feed back statistical information of the working node 120 to the scheduling device 110 based on the indication of the scheduling device 110.
The queue statistics unit 112 may be configured to count queue statistics of task queues in the cluster. The queue statistical information may include one or more of the following information of the pending tasks of the cluster: commit time, waited time, and wait time; and/or queue statistics of a task queue of a cluster may include one or more of the following information of a processing task or a processed task of the cluster: a work node executing a task, an amount of resources consumed to execute the task, and a time consumed to execute the task.
The resource amount consumed by executing the task may include one or more of a resource amount of a Central Processing Unit (CPU) resource consumed by executing the task, a resource amount of a Graphics Processing Unit (GPU) resource consumed by executing the task, a resource amount of a memory resource of a work node consumed by executing the task, a resource amount of a network resource consumed by executing the task, and the like.
In some embodiments, the waiting time of the pending task may be obtained by referring to the waiting time of the same type of task. For example, the average value of the waiting time of the tasks of the same type in the cluster can be used as the waiting time of the task to be processed. For another example, the maximum value of the waiting time of the tasks of the same type in the cluster may be used as the waiting time of the task to be processed. Of course, the waiting time may be predicted by other machine learning models.
The above-mentioned queue statistical information obtaining mode can be divided into two modes, namely obtaining by the scheduling device itself and reporting by the working node. For example, the queue statistics information of the tasks to be processed, the work node executing the task being processed, and the work node executing the processed task may be counted by the scheduling apparatus itself. The above-mentioned amount of resources consumed for executing the processing task, the amount of resources consumed for executing the processed task, the time consumed for executing the processing task, and the time consumed for executing the processed task may be fed back by the work node, and a specific feedback manner may refer to the above description of the feedback manner regarding the model statistical information.
The working node counting unit 113 may be configured to count the working node statistical information of the working nodes in the cluster. The working node statistical information comprises one or more of the following information of the working node: processing power, free resources, and memory capacity.
The above statistical information of the working nodes may be fed back by the working nodes, and the feedback manner may be referred to the above introduction of the feedback manner of the statistical information of the model. Of course, the processing capacity and/or memory capacity in the working node statistical information may also be input by the operation and maintenance personnel when the cluster is built.
As described above, the scheduling device 110 may count one or more of model statistics, queue statistics and working node statistics in the whole cluster, and therefore, the scheduling device may also be referred to as a central control device of the cluster.
A plurality of worker nodes 120 may be used to perform tasks submitted by users to the cluster. For example, a machine learning model may be employed to process a machine learning task.
The database 130 may be configured to store model information of a machine learning model in a cluster, a task processing condition in the cluster, a processing capability and a load condition of a work node in the cluster, and the like, so that the scheduling device 110 performs statistics to obtain the model statistical information, the queue statistical information, and the work node statistical information.
In some embodiments, the database 130 may be stored in a storage device separate from the scheduling apparatus 110. For example, in a storage server. In other embodiments, the database 130 may also be stored in an internal storage device of the scheduling apparatus 110, for example, may be stored in an internal memory of the scheduling apparatus 110.
The method of the embodiments of the present disclosure will be described with reference to fig. 2 based on the clusters shown in fig. 1. Fig. 2 is a flow chart of a method of scheduling machine learning tasks of an embodiment of the present disclosure. It should be understood that the method shown in fig. 2 may be performed by the scheduling apparatus 110 shown in fig. 1, and may also be performed by other apparatuses having a control function in the cluster. The method shown in fig. 2 includes steps S210 to S230.
In step S210, a machine learning task submitted by a user to a cluster is received.
The cluster may be configured with various machine learning models capable of performing machine learning tasks, where the machine learning model may refer to a general machine learning model such as a linear regression model, and may also refer to a Neural Network model such as a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN).
The different types of machine learning models may refer to machine learning models of different principles and/or architectures, and may also refer to machine learning models of different versions under the same principle and/or architecture. For example, convolutional neural networks and cyclic neural networks belong to different kinds of machine learning models. Different versions of the model under the same convolutional neural network can also be understood as different kinds of machine learning models.
The type of the machine learning task and the type of the machine learning model are not particularly limited in the embodiments of the present disclosure. For example, the machine learning task may be a prediction task, and accordingly, the machine learning model may be a neural network model having a prediction function. For another example, the machine learning task may be a classification task, and accordingly, the machine learning model may be a machine learning model having a data classification function. For another example, the machine learning task may be an image dimension reduction task, and accordingly, the machine learning model may be a machine learning model having an image dimension reduction function.
In step S220, a target machine learning model is selected from the plurality of machine learning models based on the model statistics of the plurality of machine learning models.
The following describes a selection manner of the target machine learning model in combination with specific information content based on the above described model statistical information. It should be noted that, several common selection manners are listed below by way of example only, and the target machine learning model may be selected based on any one of the model statistical information or any multiple information in the embodiment of the present disclosure, and for brevity, the following description is omitted.
As an example, when the target model statistical information includes model sizes of the multiple machine learning models and success rates of the multiple machine learning models for executing the machine learning task, the target machine learning model selected in step 220 may be the smallest model size when the success rate is the highest among the multiple machine learning models. Therefore, the method is beneficial to selecting the machine learning model with reasonable model size under the condition of ensuring the success rate of the target machine learning model so as to reasonably utilize the cluster resources.
As an example, the target model statistical information includes memory occupancy amounts of the multiple machine learning models and success rates of the multiple machine learning models for executing the machine learning tasks, and the target machine learning model selected in the step 220 may be the one with the smallest memory occupancy amount when the success rate is the highest among the multiple machine learning models. Therefore, the cluster resources are reasonably utilized under the condition of ensuring the success rate of the target machine learning model.
As an example, the target model statistical information includes the number of calls of the multiple machine learning models in the preset time period, and the target machine learning model selected in step 220 may be the machine learning model with the highest number of calls in the preset time period.
Generally, the number of calls within a preset time period may reflect the user experience of the user using the machine learning model, for example, the success rate of executing the machine learning task is high, or the user experience of the machine learning model with short time required for executing the machine learning task is generally good, and the number of calls is also high. Therefore, the target machine learning model is selected based on the calling times in the preset time period, and the user experience of executing the machine learning task for the user by the cluster is improved.
The process of selecting the target machine learning model may be directly selected by the scheduling apparatus 110 as shown in fig. 1 based on the model statistical information, or may be determined by the scheduling apparatus 110 based on the model statistical information and the feedback of the user, and the specific process may refer to the method shown in fig. 3 below.
In step S230, the machine learning task is dispatched to the target work node of the cluster, and the target work node is instructed to process the machine learning task by using the target machine learning model.
As described above, the process of selecting the target machine learning model may be performed by the scheduling apparatus in combination with the model statistics and the feedback of the user, that is, the above step S220 includes steps S310 to S320, and specifically, the method flow may refer to fig. 3.
In step S310, a machine learning model suitable for processing a machine learning task is recommended to a user based on model statistics of a plurality of machine learning models.
In step S320, a target machine learning model is selected from the plurality of machine learning models according to the user' S feedback for the recommendation.
In some embodiments, the scheduling device may recommend a machine learning model suitable for processing the machine learning task to the user directly after the user submits the machine learning task for reference by the user. In other embodiments, the scheduling apparatus may recommend to the user after the user has selected the machine learning model, and in the presence of other, more optimal machine learning models.
For example, the machine learning model selected by the user based on the task requirement is convolutional neural network version 1, and the scheduling device finds that the success rate of convolutional neural network version 2 is high based on the model statistical information, at this time, the scheduling device can recommend convolutional neural network version 2 to the user, and accordingly, after the user feeds back and receives the recommendation of the scheduling device, the scheduling device can select convolutional neural network version 2 as the target machine learning model.
Of course, in the embodiment of the present disclosure, the scheduling apparatus may also only feed back the model statistical information to the user, and the user may autonomously select the target machine learning model based on the model statistical information instead of recommending the machine learning model to the user.
In some embodiments, assuming that the user has chosen the function of agreeing to automatically select the machine learning model, when the performance of the machine learning model selected by the user is poor, or the scheduling device finds a target machine learning model more suitable for processing the machine learning task based on the model statistical information, the scheduling device may automatically adjust the machine learning model selected by the user to the target machine learning model, so as to further improve the reliability of the machine learning task. If the user does not opt-in to agree to the function of automatically selecting the machine learning model, the scheduling device does not readjust the machine learning model that the user has selected.
As described above, because the user has a high requirement on the time delay of the machine learning task (especially, the prediction task), in the conventional machine learning task scheduling mechanism, after the user submits the machine learning task request to the cluster, the cluster may roughly select the working node with a low load as the target working node based on the load condition of the working node, and send the machine learning task to the target working node for processing as soon as possible, so as to reduce the time consumed in the selection process of the working node. However, if the selected target working node has a lower load relative to other working nodes but has no idle resource to process the machine learning task, the machine learning task still needs to wait, and the problem of long time delay of the machine learning task cannot be improved.
Therefore, in order to reduce the waiting time of the cluster processing machine learning task, in the embodiment of the present disclosure, before the machine learning task is scheduled to the target working node (i.e., step S230), the target working node may be selected from the plurality of working nodes based on the statistical information of the working nodes of the plurality of working nodes of the cluster, so as to meet the higher requirement of the user on the time delay of the cluster processing machine learning task.
In some embodiments, in order to improve the user experience, if the working node statistical information of the plurality of working nodes indicates that the cluster does not have a working node capable of executing the machine learning task according to the user's requirement, information rejecting execution of the machine learning task may be further sent to the user. In this way, after receiving the information, the user may select to resubmit the machine learning task to the scheduling device of another cluster to continue the scheduling process shown in fig. 2 or fig. 3, which is beneficial to reducing the time for the user to wait for the cluster to process the machine learning task. Of course, the user may also choose to continue waiting after receiving the above information.
It should be noted that, if the cluster does not have a working node capable of executing the machine learning task according to the requirement of the user, the information may not be fed back to the user, which is not limited in the embodiment of the present disclosure.
Users generally need to execute the machine learning tasks quickly for the machine learning tasks submitted by the users, and even if the submitted machine learning tasks do not need to occupy too large resources, the users can request the clusters for larger resources to execute the machine learning tasks. Thus, after the machine learning task is sent to the worker node, if the worker node does not have sufficient resources, the machine learning task submitted by the user needs to continue waiting.
In order to solve the above problem, in the embodiment of the present disclosure, queue statistics of a task queue of a cluster may be sent to a user before a machine learning task is scheduled to a target work node of the cluster, so that the user may determine whether to adjust a resource requested to execute the machine learning task based on the queue statistics.
For example, after a machine learning task submitted to a cluster by a user and before the machine learning task is forwarded to a target work node, the user may obtain queue statistical information of the cluster, and at this time, if the queue statistical information indicates that the waiting time required for machine learning tasks of the same type is longer, and the waiting time required for machine learning tasks of another type that request resources are less and functions are similar is shorter, the user may adjust resources requested by the machine learning tasks, so as to shorten the waiting time of the machine learning tasks.
In other embodiments, the scheduler may automatically adjust the resources requested to perform the machine learning task based on the type of machine learning task and queue statistics submitted by the user. For example, after a machine learning task submitted to a cluster by a user, the scheduling device determines that the waiting time required for the machine learning task of the type is longer and the waiting time required for another machine learning task with less request resources and similar functions is shorter based on the queue statistical information of the cluster, and the scheduling device can automatically adjust the resources requested by the machine learning task so as to shorten the waiting time of the machine learning task.
To facilitate understanding of the solution of the embodiment of the present disclosure, a specific flow of scheduling a machine learning task of the embodiment of the present disclosure is described below with reference to fig. 4 based on the cluster shown in fig. 1. The method shown in fig. 4 includes steps S410 to S420.
It should be understood that the flow shown in fig. 4 comprehensively considers the model statistical information, the queue statistical information, and the work node statistical information, and of course, only one of the pieces of information, or a combination of any two of the pieces of information may also be considered in the embodiment of the present disclosure.
In step S410, the user submits a machine learning task to the scheduler.
In step S411, the scheduling apparatus feeds back queue statistics to the user.
In step S412, the user adjusts the resources requested to execute the machine learning task based on the queue statistics and resubmits the machine learning task.
In step S413, the scheduling apparatus feeds back the model statistical information to the user.
In step S414, the user selects a first machine learning model based on the model statistics.
In step S415, the user transmits a model selection result indicating selection of the first machine learning model to the scheduling apparatus.
In step S416, the scheduling apparatus determines whether a second machine learning model exists based on the model statistical information, wherein the second machine learning model has higher performance than the first machine learning model. If the second machine learning model exists, executing step S417 and step S418; if the second machine learning model does not exist, step S419 is directly performed.
In step S417, the scheduling apparatus recommends the second machine learning model to the user.
In step S418, the scheduling device receives user feedback to agree to adjust the first machine learning model to the second machine learning model.
In step S419, the scheduling apparatus selects a target worker node based on the worker node statistical information.
In step S420, the scheduling device schedules the machine learning task to the target work node and instructs the target work node to execute the machine learning task using the selected machine learning model.
Specifically, in the case where the second machine learning model exists, the selected machine learning model may be the second machine learning model, and in the case where the second machine learning model does not exist, the selected machine learning model may be the first machine learning model.
Method embodiments of the present disclosure are described in detail above in conjunction with fig. 1-4, and apparatus embodiments of the present disclosure are described in detail below in conjunction with fig. 5-6. It is to be understood that the description of the method embodiments corresponds to the description of the apparatus embodiments, and therefore reference may be made to the preceding method embodiments for parts not described in detail.
Fig. 5 is a schematic structural diagram of an apparatus for scheduling a machine learning task according to an embodiment of the present disclosure. The apparatus 500 shown in fig. 5 may be the scheduling apparatus 100. The apparatus 500 may comprise a receiving unit 510, a first processing unit 520 and a second processing unit 530. These units are described in detail below.
A receiving unit 510 configured to receive a machine learning task submitted by a user to a cluster, the cluster being configured with a plurality of machine learning models capable of performing the machine learning task.
A first processing unit 520 configured to select a target machine learning model from the plurality of machine learning models according to model statistics of the plurality of machine learning models.
A second processing unit 530 configured to schedule the machine learning task to a target work node of the cluster and instruct the target work node to process the machine learning task using the target machine learning model.
Optionally, the apparatus further comprises: a recommending unit configured to recommend a machine learning model suitable for processing the machine learning task to the user according to model statistical information of the plurality of machine learning models; a third processing unit configured to select a target machine learning model from the plurality of machine learning models according to the user's feedback for the recommendation.
Optionally, the model statistics include one or more of the following information of the machine learning model: the method comprises the steps of model size, memory occupation, number of models, calling times in a preset time period, time required for executing the machine learning task and success rate of executing the machine learning task.
Optionally, the apparatus further comprises: a fourth processing unit, configured to select the target working node from a plurality of working nodes of the cluster according to working node statistical information of the working nodes, where the working node statistical information includes one or more of the following information of the working nodes: processing power, free resources, and memory capacity.
Optionally, the apparatus further comprises: a fifth processing unit, configured to send, to the user, information for rejecting execution of the machine learning task if the working node statistical information of the plurality of working nodes indicates that the cluster does not have a working node capable of executing the machine learning task according to the requirement of the user.
Optionally, the apparatus further comprises: a sending unit configured to send queue statistics information of the task queue of the cluster to the user, where the queue statistics information of the task queue of the cluster includes one or more of the following information of the to-be-processed task of the cluster: commit time, waited time, and wait time; and/or the queue statistics of the task queue of the cluster include one or more of the following information of the processing tasks or processed tasks of the cluster: a work node executing a task, an amount of resources consumed to execute the task, and a time consumed to execute the task.
Fig. 6 is a schematic structural diagram of an apparatus for scheduling a machine learning task according to another embodiment of the present disclosure. The apparatus 600 shown in fig. 6 may be any node having a control function in a cluster. For example, the apparatus 600 may be a scheduling apparatus or other control server. The apparatus 600 may include a memory 610 and a processor 620. Memory 610 may be used to store executable code. The processor 620 is operable to execute the executable code stored in the memory 610 to implement the steps of the various methods described previously. In some embodiments, the apparatus 600 may further include a network interface 630, and the data exchange between the processor 620 and the external device may be implemented through the network interface 630.
In the embodiment of the present disclosure, the processor 620 may adopt a general-purpose CPU, a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, for executing related programs, so as to implement the technical solutions provided by the embodiments of the present disclosure.
The memory 610, which may include both read-only memory and random access memory, provides instructions and data to the processor 620. A portion of processor 620 may also include non-volatile random access memory. For example, the processor 620 may also store information of the device type.
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware or any other combination. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the disclosure are, in whole or in part, generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., Digital Video Disk (DVD)), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
In the above embodiments, the term "and/or" herein is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
In the foregoing embodiments, in the various embodiments of the present disclosure, the sequence numbers of the foregoing processes do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (13)

1. A method of scheduling machine learning tasks, comprising:
receiving a machine learning task submitted by a user to a cluster, wherein the cluster is configured with a plurality of machine learning models capable of executing the machine learning task;
selecting a target machine learning model from the multiple machine learning models according to the model statistical information of the multiple machine learning models;
and scheduling the machine learning task to a target working node of the cluster, and indicating the target working node to process the machine learning task by adopting the target machine learning model.
2. The method of claim 1, the selecting a target machine learning model from the plurality of machine learning models according to the model statistics of the plurality of machine learning models, comprising:
recommending a machine learning model suitable for processing the machine learning task to the user according to the model statistical information of the multiple machine learning models;
selecting a target machine learning model from the plurality of machine learning models based on the user's feedback for the recommendation.
3. The method of claim 1, the model statistics comprising one or more of the following information of a machine learning model: the method comprises the steps of model size, memory occupation, number of models, calling times in a preset time period, time required for executing the machine learning task and success rate of executing the machine learning task.
4. The method of claim 1, prior to the scheduling the machine learning task to a target work node of the cluster, the method further comprising:
selecting the target working node from the plurality of working nodes according to working node statistical information of the plurality of working nodes of the cluster, wherein the working node statistical information comprises one or more of the following information of the working nodes: processing power, free resources, and memory capacity.
5. The method of claim 4, the selecting the target worker node from the plurality of worker nodes of the cluster based on worker node statistics for the plurality of worker nodes, comprising:
and if the working node statistical information of the working nodes indicates that the cluster does not have the working nodes capable of executing the machine learning task according to the requirement of the user, sending information for refusing to execute the machine learning task to the user.
6. The method of claim 1, prior to scheduling the machine learning task to a target work node of the cluster, the method further comprising:
sending queue statistical information of the task queue of the cluster to the user, wherein the queue statistical information of the task queue of the cluster comprises one or more of the following information of the task to be processed of the cluster: commit time, waited time, and wait time; and/or
The queue statistics of the task queue of the cluster include one or more of the following information of the processing tasks or processed tasks of the cluster: a work node executing a task, an amount of resources consumed to execute the task, and a time consumed to execute the task.
7. An apparatus to schedule machine learning tasks, comprising:
a receiving unit configured to receive a machine learning task submitted by a user to a cluster, the cluster being configured with a plurality of machine learning models capable of executing the machine learning task;
a first processing unit configured to select a target machine learning model from the plurality of machine learning models according to model statistical information of the plurality of machine learning models;
a second processing unit configured to schedule the machine learning task to a target work node of the cluster and instruct the target work node to process the machine learning task using the target machine learning model.
8. The apparatus of claim 7, further comprising:
a recommending unit configured to recommend a machine learning model suitable for processing the machine learning task to the user according to model statistical information of the plurality of machine learning models;
a third processing unit configured to select a target machine learning model from the plurality of machine learning models according to the user's feedback for the recommendation.
9. The apparatus of claim 7, the model statistics comprising one or more of the following information of a machine learning model: the method comprises the steps of model size, memory occupation, number of models, calling times in a preset time period, time required for executing the machine learning task and success rate of executing the machine learning task.
10. The apparatus of claim 7, further comprising:
a fourth processing unit, configured to select the target working node from a plurality of working nodes of the cluster according to working node statistical information of the working nodes, where the working node statistical information includes one or more of the following information of the working nodes: processing power, free resources, and memory capacity.
11. The apparatus of claim 10, the apparatus further comprising:
a fifth processing unit, configured to send, to the user, information for rejecting execution of the machine learning task if the working node statistical information of the plurality of working nodes indicates that the cluster does not have a working node capable of executing the machine learning task according to the requirement of the user.
12. The apparatus of claim 7, further comprising:
a sending unit configured to send queue statistics information of the task queue of the cluster to the user, where the queue statistics information of the task queue of the cluster includes one or more of the following information of the to-be-processed task of the cluster: commit time, waited time, and wait time; and/or
The queue statistics of the task queue of the cluster include one or more of the following information of the processing tasks or processed tasks of the cluster: a work node executing a task, an amount of resources consumed to execute the task, and a time consumed to execute the task.
13. An apparatus to schedule a machine learning task, comprising a memory having executable code stored therein and a processor configured to execute the executable code to implement the method of any of claims 1-6.
CN202110782059.0A 2021-07-09 2021-07-09 Method and device for scheduling machine learning task Pending CN113419837A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110782059.0A CN113419837A (en) 2021-07-09 2021-07-09 Method and device for scheduling machine learning task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110782059.0A CN113419837A (en) 2021-07-09 2021-07-09 Method and device for scheduling machine learning task

Publications (1)

Publication Number Publication Date
CN113419837A true CN113419837A (en) 2021-09-21

Family

ID=77720654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110782059.0A Pending CN113419837A (en) 2021-07-09 2021-07-09 Method and device for scheduling machine learning task

Country Status (1)

Country Link
CN (1) CN113419837A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190087383A1 (en) * 2017-09-19 2019-03-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Intelligent big data system, and method and apparatus for providing intelligent big data service
CN110750342A (en) * 2019-05-23 2020-02-04 北京嘀嘀无限科技发展有限公司 Scheduling method, scheduling device, electronic equipment and readable storage medium
US20200184494A1 (en) * 2018-12-05 2020-06-11 Legion Technologies, Inc. Demand Forecasting Using Automatic Machine-Learning Model Selection
CN112966438A (en) * 2021-03-05 2021-06-15 北京金山云网络技术有限公司 Machine learning algorithm selection method and distributed computing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190087383A1 (en) * 2017-09-19 2019-03-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Intelligent big data system, and method and apparatus for providing intelligent big data service
US20200184494A1 (en) * 2018-12-05 2020-06-11 Legion Technologies, Inc. Demand Forecasting Using Automatic Machine-Learning Model Selection
CN110750342A (en) * 2019-05-23 2020-02-04 北京嘀嘀无限科技发展有限公司 Scheduling method, scheduling device, electronic equipment and readable storage medium
CN112966438A (en) * 2021-03-05 2021-06-15 北京金山云网络技术有限公司 Machine learning algorithm selection method and distributed computing system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
田春岐;李静;王伟;张礼庆;: "一种基于机器学习的Spark容器集群性能提升方法", 信息网络安全, no. 04, 10 April 2019 (2019-04-10) *

Similar Documents

Publication Publication Date Title
CN108345501B (en) Distributed resource scheduling method and system
US9372729B2 (en) Task scheduling method and apparatus
US10459915B2 (en) Managing queries
JP4992408B2 (en) Job allocation program, method and apparatus
US20170047069A1 (en) Voice processing method and device
EP3832574A2 (en) Method and apparatus for processing transaction requests in blockchain, device and medium
WO2022227693A1 (en) Command distribution apparatus and method, chip, computer device, and medium
WO2020238989A1 (en) Method and apparatus for scheduling task processing entity
CN109840149B (en) Task scheduling method, device, equipment and storage medium
WO2022227614A1 (en) Command distribution apparatus and method, chip, computer device, and storage medium
CN111240864A (en) Asynchronous task processing method, device, equipment and computer readable storage medium
JP2008077266A (en) Service control unit, distributed service control system, service control method, and program
JP2009541851A (en) Resource-based scheduler
CN114579285A (en) Task running system and method and computing device
CN114500401B (en) Resource scheduling method and system for coping with burst traffic
JP2009237918A (en) Distributed content delivery system, center server, distributed content delivery method and distributed content delivery program
CN113419837A (en) Method and device for scheduling machine learning task
CN112486638A (en) Method, apparatus, device and storage medium for executing processing task
US8869171B2 (en) Low-latency communications
CN111258729B (en) Redis-based task allocation method and device, computer equipment and storage medium
CN115712572A (en) Task testing method and device, storage medium and electronic device
CN115629854A (en) Distributed task scheduling method, system, electronic device and storage medium
CN111930485B (en) Job scheduling method based on performance expression
CN112187667B (en) Data downloading method, device, equipment and storage medium
US20210004387A1 (en) Changing the number of replicas of a pod based on size of message queue

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination