CN113419837A

CN113419837A - Method and device for scheduling machine learning task

Info

Publication number: CN113419837A
Application number: CN202110782059.0A
Authority: CN
Inventors: 李龙飞; 周俊
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2021-09-21

Abstract

The present disclosure discloses a method and apparatus for scheduling machine learning tasks. The method includes: receiving a machine learning task submitted by a user to a cluster, where the cluster is configured with multiple machine learning models capable of executing the machine learning task; Select a target machine learning model from multiple machine learning models; schedule the machine learning task to a target worker node of the cluster, and instruct the target worker node to use the target machine learning model to process the machine learning task.

Description

Method and device for scheduling machine learning task

Technical Field

The disclosure relates to the field of task scheduling, in particular to a method and a device for scheduling machine learning tasks.

Background

With the rise of artificial intelligence, machine learning models are applied more and more widely as core technologies of artificial intelligence. To provide sufficient computational and storage resources for the machine learning model, the machine learning model is typically loaded and run by the cluster, such that a user may request the machine learning model to process the machine learning task by submitting the machine learning task to the cluster.

At present, the reliability requirement of a user on a cluster processing machine learning task is higher and higher. However, existing task scheduling mechanisms mainly aim to reduce processing latency of tasks, and such task scheduling mechanisms can reduce reliability of cluster processing machine learning tasks.

Disclosure of Invention

In view of this, the present disclosure provides a method and an apparatus for scheduling a machine learning task to improve reliability of cluster processing of the machine learning task.

In a first aspect, a method of scheduling a machine learning task is provided, the method comprising: receiving a machine learning task submitted by a user to a cluster, wherein the cluster is configured with a plurality of machine learning models capable of executing the machine learning task; selecting a target machine learning model from the multiple machine learning models according to the model statistical information of the multiple machine learning models; and scheduling the machine learning task to a target working node of the cluster, and indicating the target working node to process the machine learning task by adopting the target machine learning model.

In a second aspect, an apparatus for scheduling machine learning tasks is provided, the apparatus comprising: a receiving unit configured to receive a machine learning task submitted by a user to a cluster, the cluster being configured with a plurality of machine learning models capable of executing the machine learning task; a first processing unit configured to select a target machine learning model from the plurality of machine learning models according to model statistical information of the plurality of machine learning models; a second processing unit configured to schedule the machine learning task to a target work node of the cluster and instruct the target work node to process the machine learning task using the target machine learning model.

In a third aspect, there is provided an apparatus for scheduling machine learning tasks, comprising a memory having executable code stored therein and a processor configured to execute the executable code to implement the method of the first aspect.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon executable code which, when executed, is capable of implementing the method of the first aspect.

In a fifth aspect, there is provided a computer program product comprising executable code which, when executed, is capable of implementing the method of the first aspect.

The utility model provides a scheduling scheme of machine learning task can select suitable target machine learning model from many machine learning models based on the model statistical information of many machine learning task models in the cluster to the machine learning task that handles user and submit, is favorable to improving the reliability that cluster processing machine learning task like this.

Drawings

Fig. 1 is a schematic diagram of a cluster to which the embodiments of the present disclosure are applicable.

Fig. 2 is a flowchart of a method of scheduling a machine learning task according to an embodiment of the present disclosure.

Fig. 3 is a flowchart of a method of scheduling a machine learning task according to another embodiment of the present disclosure.

Fig. 4 is a flowchart of a method of scheduling a machine learning task according to another embodiment of the present disclosure.

Fig. 5 is a schematic structural diagram of an apparatus for scheduling a machine learning task according to an embodiment of the present disclosure.

Fig. 6 is a schematic structural diagram of an apparatus for scheduling a machine learning task according to another embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments.

With the wide application of machine learning, users have higher and higher requirements on the reliability of machine learning tasks, wherein the factors influencing the reliability are more, such as the success rate of a machine learning model, the size of a memory required by the machine learning model, and the like. However, the conventional task scheduling mechanism mainly aims to reduce the processing delay of the task, and the task is distributed to a certain working node immediately after being received. Such a task scheduling mechanism may result in less reliability of the cluster processing machine learning tasks.

In order to improve the reliability of cluster processing machine learning tasks, the disclosure provides a scheme for scheduling machine learning tasks, based on model statistical information of machine learning models supported by clusters, a proper target machine learning model is selected from the machine learning models supported by the clusters, and the machine learning tasks submitted by users are processed by adopting the target machine learning model.

For ease of understanding, a cluster to which embodiments of the present disclosure are applicable will be described below with reference to fig. 1. The cluster shown in fig. 1 comprises a scheduling apparatus 110, a plurality of working nodes 120 and a database 130.

The scheduling apparatus 110 may include a model statistics unit 111, a queue statistics unit 112, and a work node statistics unit 113. In some embodiments, the scheduling apparatus 110 may include one or more of a model statistics unit 111, a queue statistics unit 112, and a work node statistics unit 113.

The model statistical unit 111 may be configured to perform statistics on model statistical information of machine learning models supported by the cluster. Wherein the model statistics may include one or more of the following information of the machine learning model: the method comprises the steps of model size, memory occupation, number of models, calling times in a preset time period, time required for executing the machine learning task and success rate of executing the machine learning task.

The model statistical information may be fed back (or reported) from the working node 120 to the scheduling apparatus 110, specifically, the feedback mode is not specifically limited in the embodiment of the present disclosure. For example, the worker node 120 may periodically feed back worker node statistics to the scheduling apparatus 110. For another example, the working node 120 may feed back statistical information of the working node 120 to the scheduling device 110 based on the indication of the scheduling device 110.

The queue statistics unit 112 may be configured to count queue statistics of task queues in the cluster. The queue statistical information may include one or more of the following information of the pending tasks of the cluster: commit time, waited time, and wait time; and/or queue statistics of a task queue of a cluster may include one or more of the following information of a processing task or a processed task of the cluster: a work node executing a task, an amount of resources consumed to execute the task, and a time consumed to execute the task.

The resource amount consumed by executing the task may include one or more of a resource amount of a Central Processing Unit (CPU) resource consumed by executing the task, a resource amount of a Graphics Processing Unit (GPU) resource consumed by executing the task, a resource amount of a memory resource of a work node consumed by executing the task, a resource amount of a network resource consumed by executing the task, and the like.

In some embodiments, the waiting time of the pending task may be obtained by referring to the waiting time of the same type of task. For example, the average value of the waiting time of the tasks of the same type in the cluster can be used as the waiting time of the task to be processed. For another example, the maximum value of the waiting time of the tasks of the same type in the cluster may be used as the waiting time of the task to be processed. Of course, the waiting time may be predicted by other machine learning models.

The above-mentioned queue statistical information obtaining mode can be divided into two modes, namely obtaining by the scheduling device itself and reporting by the working node. For example, the queue statistics information of the tasks to be processed, the work node executing the task being processed, and the work node executing the processed task may be counted by the scheduling apparatus itself. The above-mentioned amount of resources consumed for executing the processing task, the amount of resources consumed for executing the processed task, the time consumed for executing the processing task, and the time consumed for executing the processed task may be fed back by the work node, and a specific feedback manner may refer to the above description of the feedback manner regarding the model statistical information.

The working node counting unit 113 may be configured to count the working node statistical information of the working nodes in the cluster. The working node statistical information comprises one or more of the following information of the working node: processing power, free resources, and memory capacity.

The above statistical information of the working nodes may be fed back by the working nodes, and the feedback manner may be referred to the above introduction of the feedback manner of the statistical information of the model. Of course, the processing capacity and/or memory capacity in the working node statistical information may also be input by the operation and maintenance personnel when the cluster is built.

As described above, the scheduling device 110 may count one or more of model statistics, queue statistics and working node statistics in the whole cluster, and therefore, the scheduling device may also be referred to as a central control device of the cluster.

A plurality of worker nodes 120 may be used to perform tasks submitted by users to the cluster. For example, a machine learning model may be employed to process a machine learning task.

The database 130 may be configured to store model information of a machine learning model in a cluster, a task processing condition in the cluster, a processing capability and a load condition of a work node in the cluster, and the like, so that the scheduling device 110 performs statistics to obtain the model statistical information, the queue statistical information, and the work node statistical information.

In some embodiments, the database 130 may be stored in a storage device separate from the scheduling apparatus 110. For example, in a storage server. In other embodiments, the database 130 may also be stored in an internal storage device of the scheduling apparatus 110, for example, may be stored in an internal memory of the scheduling apparatus 110.

The method of the embodiments of the present disclosure will be described with reference to fig. 2 based on the clusters shown in fig. 1. Fig. 2 is a flow chart of a method of scheduling machine learning tasks of an embodiment of the present disclosure. It should be understood that the method shown in fig. 2 may be performed by the scheduling apparatus 110 shown in fig. 1, and may also be performed by other apparatuses having a control function in the cluster. The method shown in fig. 2 includes steps S210 to S230.

In step S210, a machine learning task submitted by a user to a cluster is received.

The cluster may be configured with various machine learning models capable of performing machine learning tasks, where the machine learning model may refer to a general machine learning model such as a linear regression model, and may also refer to a Neural Network model such as a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN).

The different types of machine learning models may refer to machine learning models of different principles and/or architectures, and may also refer to machine learning models of different versions under the same principle and/or architecture. For example, convolutional neural networks and cyclic neural networks belong to different kinds of machine learning models. Different versions of the model under the same convolutional neural network can also be understood as different kinds of machine learning models.

The type of the machine learning task and the type of the machine learning model are not particularly limited in the embodiments of the present disclosure. For example, the machine learning task may be a prediction task, and accordingly, the machine learning model may be a neural network model having a prediction function. For another example, the machine learning task may be a classification task, and accordingly, the machine learning model may be a machine learning model having a data classification function. For another example, the machine learning task may be an image dimension reduction task, and accordingly, the machine learning model may be a machine learning model having an image dimension reduction function.

In step S220, a target machine learning model is selected from the plurality of machine learning models based on the model statistics of the plurality of machine learning models.

The following describes a selection manner of the target machine learning model in combination with specific information content based on the above described model statistical information. It should be noted that, several common selection manners are listed below by way of example only, and the target machine learning model may be selected based on any one of the model statistical information or any multiple information in the embodiment of the present disclosure, and for brevity, the following description is omitted.

As an example, when the target model statistical information includes model sizes of the multiple machine learning models and success rates of the multiple machine learning models for executing the machine learning task, the target machine learning model selected in step 220 may be the smallest model size when the success rate is the highest among the multiple machine learning models. Therefore, the method is beneficial to selecting the machine learning model with reasonable model size under the condition of ensuring the success rate of the target machine learning model so as to reasonably utilize the cluster resources.

As an example, the target model statistical information includes memory occupancy amounts of the multiple machine learning models and success rates of the multiple machine learning models for executing the machine learning tasks, and the target machine learning model selected in the step 220 may be the one with the smallest memory occupancy amount when the success rate is the highest among the multiple machine learning models. Therefore, the cluster resources are reasonably utilized under the condition of ensuring the success rate of the target machine learning model.

As an example, the target model statistical information includes the number of calls of the multiple machine learning models in the preset time period, and the target machine learning model selected in step 220 may be the machine learning model with the highest number of calls in the preset time period.

Generally, the number of calls within a preset time period may reflect the user experience of the user using the machine learning model, for example, the success rate of executing the machine learning task is high, or the user experience of the machine learning model with short time required for executing the machine learning task is generally good, and the number of calls is also high. Therefore, the target machine learning model is selected based on the calling times in the preset time period, and the user experience of executing the machine learning task for the user by the cluster is improved.

The process of selecting the target machine learning model may be directly selected by the scheduling apparatus 110 as shown in fig. 1 based on the model statistical information, or may be determined by the scheduling apparatus 110 based on the model statistical information and the feedback of the user, and the specific process may refer to the method shown in fig. 3 below.

In step S230, the machine learning task is dispatched to the target work node of the cluster, and the target work node is instructed to process the machine learning task by using the target machine learning model.

As described above, the process of selecting the target machine learning model may be performed by the scheduling apparatus in combination with the model statistics and the feedback of the user, that is, the above step S220 includes steps S310 to S320, and specifically, the method flow may refer to fig. 3.

In step S310, a machine learning model suitable for processing a machine learning task is recommended to a user based on model statistics of a plurality of machine learning models.

In step S320, a target machine learning model is selected from the plurality of machine learning models according to the user' S feedback for the recommendation.

In some embodiments, the scheduling device may recommend a machine learning model suitable for processing the machine learning task to the user directly after the user submits the machine learning task for reference by the user. In other embodiments, the scheduling apparatus may recommend to the user after the user has selected the machine learning model, and in the presence of other, more optimal machine learning models.

For example, the machine learning model selected by the user based on the task requirement is convolutional neural network version 1, and the scheduling device finds that the success rate of convolutional neural network version 2 is high based on the model statistical information, at this time, the scheduling device can recommend convolutional neural network version 2 to the user, and accordingly, after the user feeds back and receives the recommendation of the scheduling device, the scheduling device can select convolutional neural network version 2 as the target machine learning model.

Of course, in the embodiment of the present disclosure, the scheduling apparatus may also only feed back the model statistical information to the user, and the user may autonomously select the target machine learning model based on the model statistical information instead of recommending the machine learning model to the user.

In some embodiments, assuming that the user has chosen the function of agreeing to automatically select the machine learning model, when the performance of the machine learning model selected by the user is poor, or the scheduling device finds a target machine learning model more suitable for processing the machine learning task based on the model statistical information, the scheduling device may automatically adjust the machine learning model selected by the user to the target machine learning model, so as to further improve the reliability of the machine learning task. If the user does not opt-in to agree to the function of automatically selecting the machine learning model, the scheduling device does not readjust the machine learning model that the user has selected.

As described above, because the user has a high requirement on the time delay of the machine learning task (especially, the prediction task), in the conventional machine learning task scheduling mechanism, after the user submits the machine learning task request to the cluster, the cluster may roughly select the working node with a low load as the target working node based on the load condition of the working node, and send the machine learning task to the target working node for processing as soon as possible, so as to reduce the time consumed in the selection process of the working node. However, if the selected target working node has a lower load relative to other working nodes but has no idle resource to process the machine learning task, the machine learning task still needs to wait, and the problem of long time delay of the machine learning task cannot be improved.

Therefore, in order to reduce the waiting time of the cluster processing machine learning task, in the embodiment of the present disclosure, before the machine learning task is scheduled to the target working node (i.e., step S230), the target working node may be selected from the plurality of working nodes based on the statistical information of the working nodes of the plurality of working nodes of the cluster, so as to meet the higher requirement of the user on the time delay of the cluster processing machine learning task.

In some embodiments, in order to improve the user experience, if the working node statistical information of the plurality of working nodes indicates that the cluster does not have a working node capable of executing the machine learning task according to the user's requirement, information rejecting execution of the machine learning task may be further sent to the user. In this way, after receiving the information, the user may select to resubmit the machine learning task to the scheduling device of another cluster to continue the scheduling process shown in fig. 2 or fig. 3, which is beneficial to reducing the time for the user to wait for the cluster to process the machine learning task. Of course, the user may also choose to continue waiting after receiving the above information.

It should be noted that, if the cluster does not have a working node capable of executing the machine learning task according to the requirement of the user, the information may not be fed back to the user, which is not limited in the embodiment of the present disclosure.

Users generally need to execute the machine learning tasks quickly for the machine learning tasks submitted by the users, and even if the submitted machine learning tasks do not need to occupy too large resources, the users can request the clusters for larger resources to execute the machine learning tasks. Thus, after the machine learning task is sent to the worker node, if the worker node does not have sufficient resources, the machine learning task submitted by the user needs to continue waiting.

In order to solve the above problem, in the embodiment of the present disclosure, queue statistics of a task queue of a cluster may be sent to a user before a machine learning task is scheduled to a target work node of the cluster, so that the user may determine whether to adjust a resource requested to execute the machine learning task based on the queue statistics.

For example, after a machine learning task submitted to a cluster by a user and before the machine learning task is forwarded to a target work node, the user may obtain queue statistical information of the cluster, and at this time, if the queue statistical information indicates that the waiting time required for machine learning tasks of the same type is longer, and the waiting time required for machine learning tasks of another type that request resources are less and functions are similar is shorter, the user may adjust resources requested by the machine learning tasks, so as to shorten the waiting time of the machine learning tasks.

In other embodiments, the scheduler may automatically adjust the resources requested to perform the machine learning task based on the type of machine learning task and queue statistics submitted by the user. For example, after a machine learning task submitted to a cluster by a user, the scheduling device determines that the waiting time required for the machine learning task of the type is longer and the waiting time required for another machine learning task with less request resources and similar functions is shorter based on the queue statistical information of the cluster, and the scheduling device can automatically adjust the resources requested by the machine learning task so as to shorten the waiting time of the machine learning task.

To facilitate understanding of the solution of the embodiment of the present disclosure, a specific flow of scheduling a machine learning task of the embodiment of the present disclosure is described below with reference to fig. 4 based on the cluster shown in fig. 1. The method shown in fig. 4 includes steps S410 to S420.

It should be understood that the flow shown in fig. 4 comprehensively considers the model statistical information, the queue statistical information, and the work node statistical information, and of course, only one of the pieces of information, or a combination of any two of the pieces of information may also be considered in the embodiment of the present disclosure.

In step S410, the user submits a machine learning task to the scheduler.

In step S411, the scheduling apparatus feeds back queue statistics to the user.

In step S412, the user adjusts the resources requested to execute the machine learning task based on the queue statistics and resubmits the machine learning task.

In step S413, the scheduling apparatus feeds back the model statistical information to the user.

In step S414, the user selects a first machine learning model based on the model statistics.

In step S415, the user transmits a model selection result indicating selection of the first machine learning model to the scheduling apparatus.

In step S416, the scheduling apparatus determines whether a second machine learning model exists based on the model statistical information, wherein the second machine learning model has higher performance than the first machine learning model. If the second machine learning model exists, executing step S417 and step S418; if the second machine learning model does not exist, step S419 is directly performed.

In step S417, the scheduling apparatus recommends the second machine learning model to the user.

In step S418, the scheduling device receives user feedback to agree to adjust the first machine learning model to the second machine learning model.

In step S419, the scheduling apparatus selects a target worker node based on the worker node statistical information.

In step S420, the scheduling device schedules the machine learning task to the target work node and instructs the target work node to execute the machine learning task using the selected machine learning model.

Specifically, in the case where the second machine learning model exists, the selected machine learning model may be the second machine learning model, and in the case where the second machine learning model does not exist, the selected machine learning model may be the first machine learning model.

Method embodiments of the present disclosure are described in detail above in conjunction with fig. 1-4, and apparatus embodiments of the present disclosure are described in detail below in conjunction with fig. 5-6. It is to be understood that the description of the method embodiments corresponds to the description of the apparatus embodiments, and therefore reference may be made to the preceding method embodiments for parts not described in detail.

Fig. 5 is a schematic structural diagram of an apparatus for scheduling a machine learning task according to an embodiment of the present disclosure. The apparatus 500 shown in fig. 5 may be the scheduling apparatus 100. The apparatus 500 may comprise a receiving unit 510, a first processing unit 520 and a second processing unit 530. These units are described in detail below.

A receiving unit 510 configured to receive a machine learning task submitted by a user to a cluster, the cluster being configured with a plurality of machine learning models capable of performing the machine learning task.

A first processing unit 520 configured to select a target machine learning model from the plurality of machine learning models according to model statistics of the plurality of machine learning models.

A second processing unit 530 configured to schedule the machine learning task to a target work node of the cluster and instruct the target work node to process the machine learning task using the target machine learning model.

Optionally, the apparatus further comprises: a recommending unit configured to recommend a machine learning model suitable for processing the machine learning task to the user according to model statistical information of the plurality of machine learning models; a third processing unit configured to select a target machine learning model from the plurality of machine learning models according to the user's feedback for the recommendation.

Optionally, the model statistics include one or more of the following information of the machine learning model: the method comprises the steps of model size, memory occupation, number of models, calling times in a preset time period, time required for executing the machine learning task and success rate of executing the machine learning task.

Optionally, the apparatus further comprises: a fourth processing unit, configured to select the target working node from a plurality of working nodes of the cluster according to working node statistical information of the working nodes, where the working node statistical information includes one or more of the following information of the working nodes: processing power, free resources, and memory capacity.

Optionally, the apparatus further comprises: a fifth processing unit, configured to send, to the user, information for rejecting execution of the machine learning task if the working node statistical information of the plurality of working nodes indicates that the cluster does not have a working node capable of executing the machine learning task according to the requirement of the user.

Optionally, the apparatus further comprises: a sending unit configured to send queue statistics information of the task queue of the cluster to the user, where the queue statistics information of the task queue of the cluster includes one or more of the following information of the to-be-processed task of the cluster: commit time, waited time, and wait time; and/or the queue statistics of the task queue of the cluster include one or more of the following information of the processing tasks or processed tasks of the cluster: a work node executing a task, an amount of resources consumed to execute the task, and a time consumed to execute the task.

Fig. 6 is a schematic structural diagram of an apparatus for scheduling a machine learning task according to another embodiment of the present disclosure. The apparatus 600 shown in fig. 6 may be any node having a control function in a cluster. For example, the apparatus 600 may be a scheduling apparatus or other control server. The apparatus 600 may include a memory 610 and a processor 620. Memory 610 may be used to store executable code. The processor 620 is operable to execute the executable code stored in the memory 610 to implement the steps of the various methods described previously. In some embodiments, the apparatus 600 may further include a network interface 630, and the data exchange between the processor 620 and the external device may be implemented through the network interface 630.

In the embodiment of the present disclosure, the processor 620 may adopt a general-purpose CPU, a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, for executing related programs, so as to implement the technical solutions provided by the embodiments of the present disclosure.

The memory 610, which may include both read-only memory and random access memory, provides instructions and data to the processor 620. A portion of processor 620 may also include non-volatile random access memory. For example, the processor 620 may also store information of the device type.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware or any other combination. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the disclosure are, in whole or in part, generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., Digital Video Disk (DVD)), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

In the above embodiments, the term "and/or" herein is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

In the foregoing embodiments, in the various embodiments of the present disclosure, the sequence numbers of the foregoing processes do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of scheduling machine learning tasks, comprising:

receiving a machine learning task submitted by a user to a cluster, where the cluster is configured with multiple machine learning models capable of executing the machine learning task;

selecting a target machine learning model from the multiple machine learning models according to the model statistics of the multiple machine learning models;

The machine learning task is scheduled to the target worker node of the cluster, and the target worker node is instructed to use the target machine learning model to process the machine learning task.

2. The method according to claim 1, wherein selecting a target machine learning model from the multiple machine learning models according to the model statistics of the multiple machine learning models, comprising:

recommending to the user a machine learning model suitable for processing the machine learning task according to the model statistics of the multiple machine learning models;

According to the user's feedback on the recommendation, a target machine learning model is selected from the plurality of machine learning models.

3. The method according to claim 1, wherein the model statistical information includes one or more of the following information of the machine learning model: model size, memory footprint, number of models, number of calls within a preset time period , the time required to perform the machine learning task, and the success rate of performing the machine learning task.

4. The method of claim 1, before the scheduling of the machine learning task to a target worker node of the cluster, the method further comprising:

The target worker node is selected from the plurality of worker nodes according to worker node statistical information of the plurality of worker nodes of the cluster, where the worker node statistical information includes one or more of the following information of the worker node: Processing power, free resources, and memory capacity.

5. The method according to claim 4, wherein the selecting the target worker node from the plurality of worker nodes according to worker node statistics of the plurality of worker nodes of the cluster comprises:

If the worker node statistics of the plurality of worker nodes indicate that the cluster does not have worker nodes capable of executing the machine learning task according to the user's requirements, send information for refusing to execute the machine learning task to the user.

6. The method of claim 1, before the scheduling of the machine learning task to a target worker node of the cluster, the method further comprising:

Send the queue statistics information of the task queue of the cluster to the user, where the queue statistics information of the task queue of the cluster includes one or more of the following information of the tasks to be processed in the cluster: submission time, waiting time time and waiting time; and/or

The queue statistics information of the task queue of the cluster includes one or more of the following information of the task being processed or the task that has been processed in the cluster: the worker node executing the task, the amount of resources consumed by executing the task, and the amount of resources used to execute the task. time consumed.

7. An apparatus for scheduling machine learning tasks, comprising:

a receiving unit, configured to receive a machine learning task submitted by a user to a cluster, where the cluster is configured with multiple machine learning models capable of executing the machine learning task;

a first processing unit, configured to select a target machine learning model from the plurality of machine learning models according to model statistical information of the plurality of machine learning models;

The second processing unit is configured to schedule the machine learning task to a target worker node of the cluster, and instruct the target worker node to use the target machine learning model to process the machine learning task.

8. The apparatus of claim 7, further comprising:

a recommending unit, configured to recommend a machine learning model suitable for processing the machine learning task to the user according to the model statistical information of the multiple machine learning models;

The third processing unit is configured to select a target machine learning model from the plurality of machine learning models according to the user's feedback on the recommendation.

9. The apparatus according to claim 7, wherein the model statistical information includes one or more of the following information of the machine learning model: model size, memory footprint, number of models, number of calls within a preset time period , the time required to perform the machine learning task, and the success rate of performing the machine learning task.

10. The apparatus of claim 7, further comprising:

a fourth processing unit, configured to select the target worker node from the plurality of worker nodes according to worker node statistical information of the plurality of worker nodes of the cluster, where the worker node statistical information includes the following information of the worker nodes One or more of: processing power, free resources, and memory capacity.

11. The apparatus of claim 10, further comprising:

a fifth processing unit, configured to, if the worker node statistics of the plurality of worker nodes indicate that the cluster does not have worker nodes capable of executing the machine learning task according to the user's needs, send a request to the user to refuse to perform the task Information about the machine learning task.

12. The apparatus of claim 7, further comprising:

A sending unit, configured to send the queue statistics of the task queue of the cluster to the user, where the queue statistics of the task queue of the cluster include one or more of the following information of the tasks to be processed in the cluster : time submitted, time elapsed, and time to wait; and/or

The queue statistics information of the task queue of the cluster includes one or more of the following information of the task being processed or the task has been processed of the cluster: the worker node executing the task, the amount of resources consumed by executing the task, and the amount of resources used to execute the task. time consumed.

13. An apparatus for scheduling machine learning tasks, comprising a memory and a processor, wherein executable code is stored in the memory, and the processor is configured to execute the executable code to implement any of claims 1-6. one of the methods described.