CN113033814A

CN113033814A - Method, apparatus and storage medium for training machine learning model

Info

Publication number: CN113033814A
Application number: CN201911253316.0A
Authority: CN
Inventors: 韩卫强; 李云彬; 彭祚聪; 权圣
Original assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Current assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2021-06-25

Abstract

The application discloses a method, a device and a storage medium for training a machine learning model. Wherein, the method comprises the following steps: receiving a training task to be processed; creating resources for operating the machine learning platform by utilizing a custom resource component in the container cluster management system; starting a training task based on machine learning platform resources in a container cluster management system; executing a training task to obtain a trained machine learning model; and loading the machine learning model and providing a model service corresponding to the machine learning model. In this way, the machine learning platform can be deployed in the container cluster management system without modifying the container cluster management system code or creating a custom interface service.

Description

Method, apparatus and storage medium for training machine learning model

Technical Field

The present application relates to the field of artificial intelligence and cloud computing, and in particular, to a method, an apparatus, and a storage medium for training a machine learning model.

Background

With the development of artificial intelligence technology, machine learning algorithms, especially deep learning algorithms suitable for large-scale data, are gaining more and more attention and application. Referring to fig. 2 and 3, in a chat robot (BOT) system, prediction of each module (such as an intention recognition model, an emotion recognition model, an FAQ model, and the like) in a Natural Language Understanding (NLU) system requires a prediction system (reference system) to be called; when a user changes knowledge in a knowledge base management system (KMS), the knowledge base management system informs a learning module (learning module) to carry out model training, after the learning module trains a model, a message informs a Natural Language Understanding (NLU) system model training result and updates model information, and the natural language understanding system informs a prediction module to load a new model according to the new model information and unload an old model. When a new user question is input, the natural language understanding system calls the prediction module to obtain the prediction result of the new model.

However, as the input training data increases and the model is online, the user amount may suddenly increase, and the problem that the machine learning training of a single node is limited in memory and the training of weeks or even months is time-consuming is solved. For example, the training of very large scale parametric models such as pre-trained models BERT, end-to-end speech synthesis models, etc. applied to natural language processing modules typically requires several days of computational time.

Distributed machine learning arises as it stands. Most of the existing distributed machine learning platforms adopt Hadoop Yarn as GPU scheduling, but with continuous development of services, the Hadoop Yarn cannot meet the calculation requirements of large-scale machine learning.

Aiming at the technical problem that Hadoop Yarn is mostly adopted as a GPU for scheduling in the existing distributed machine learning platform in the prior art, but with continuous development of business, the Hadoop Yarn cannot meet the calculation requirement of large-scale machine learning, and an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device and a storage medium for training a machine learning model, so as to at least solve the technical problem that Hadoop Yarn is mostly adopted as GPU scheduling in the existing distributed machine learning platform in the prior art, but along with the continuous development of business, the Hadoop Yarn cannot meet the calculation requirement of large-scale machine learning.

According to an aspect of an embodiment of the present disclosure, there is provided a method of training a machine learning model, including: receiving a training task to be processed; creating resources for operating the machine learning platform by utilizing a custom resource component in the container cluster management system; starting a training task based on machine learning platform resources in a container cluster management system; processing the training task to obtain a training result corresponding to the training task, wherein the training result comprises a training model; and loading the training model to generate model service.

According to another aspect of the embodiments of the present disclosure, there is also provided a storage medium including a stored program, wherein the method of any one of the above is performed by a processor when the program is executed.

There is also provided, in accordance with another aspect of the disclosed embodiments, an apparatus for training a machine learning model, including: the receiving module is used for training tasks to be processed; the creation module is used for creating resources for operating the machine learning platform by utilizing a custom resource component in the container cluster management system; the starting module is used for starting a training task based on machine learning platform resources in the container cluster management system; the execution module is used for executing the training task to obtain a trained machine learning model; and a loading module for loading the machine learning model and providing a model service corresponding to the machine learning model

There is also provided, in accordance with another aspect of the disclosed embodiments, an apparatus for training a machine learning model, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: receiving a training task to be processed; creating resources for operating the machine learning platform by utilizing a custom resource component in the container cluster management system; starting a training task based on machine learning platform resources in a container cluster management system; executing a training task to obtain a trained machine learning model; and loading the machine learning model and providing a model service corresponding to the machine learning model.

In an embodiment of the present disclosure, a training task to be processed is received; creating resources for operating the machine learning platform by utilizing a custom resource component in the container cluster management system; starting a training task based on machine learning platform resources in a container cluster management system; processing the training task to obtain a trained machine learning model; and loading the machine learning model and providing a model service corresponding to the machine learning model. In this way, the machine learning platform can be deployed in the container cluster management system without modifying the container cluster management system code or creating a custom interface service. In the embodiment, the distributed training of the machine learning training task is realized by using the distributed function of the container cluster management system. And after the training is finished, a machine learning model after the training is obtained, and model services corresponding to the machine learning are provided by loading the machine learning model, so that other systems can call the model services conveniently. Therefore, even if the machine learning is carried out on a large scale, the calculation can be carried out quickly, the calculation efficiency is improved, the time is saved, and great convenience is brought to users. Therefore, the technical problem that the Hadoop Yarn serving as the GPU scheduling is mostly adopted by the existing distributed machine learning platform, but along with the continuous development of services, the Hadoop Yarn cannot meet the calculation requirement of large-scale machine learning is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal for implementing the method according to embodiment 1 of the present disclosure;

FIG. 2 is a schematic diagram illustrating a response flow of a BOT (chat robot) system according to the background art;

FIG. 3 is a schematic diagram of a model training module and a model prediction module according to the background art;

fig. 4 is an architecture diagram of a method of container-based distributed machine learning according to embodiment 1 of the present disclosure;

fig. 5 is a schematic flow chart of a method of container-based distributed machine learning according to a first aspect of embodiment 1 of the present disclosure;

fig. 6 is a schematic diagram of an apparatus for container-based distributed machine learning according to embodiment 2 of the present disclosure; and

fig. 7 is a schematic diagram of an apparatus for container-based distributed machine learning according to embodiment 3 of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

According to the present embodiment, there is also provided an embodiment of a method of training a machine learning model, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method embodiments provided by the present embodiment may be executed in a mobile terminal, a computer terminal, a server or a similar computing device. FIG. 1 illustrates a block diagram of a hardware architecture of a computing device for implementing training of a machine learning model. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).

The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the method for training a machine learning model in the embodiments of the present disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implementing the above-mentioned method for container-based distributed machine learning of application programs. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory located remotely from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.

It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.

Fig. 4 is a schematic diagram of an architecture of a system for training a machine learning model according to the present embodiment. Referring to fig. 4, the system includes: natural language understanding module (NLU), knowledge base management system (hsa), prediction module (inference), learning module (learning), and distributed machine learning platform. The distributed machine learning platform is a machine learning platform (such as TensorFlow) deployed in a container cluster management system (Kubernets).

In the above operating environment, according to a first aspect of the present embodiment, a method for training a machine learning model is provided, which is implemented by the distributed machine learning platform shown in fig. 4. Fig. 5 shows a flow diagram of the method, which, with reference to fig. 5, comprises:

s502: receiving a training task to be processed;

s504: creating resources for operating the machine learning platform by utilizing a custom resource component in the container cluster management system;

s506: starting a training task based on machine learning platform resources in a container cluster management system;

s508: executing a training task to obtain a trained machine learning model; and

s510: and loading the machine learning model and providing a model service corresponding to the machine learning model.

As described in the background section above, artificial intelligence increases with the input training data, and as the model is online, the user amount may suddenly increase, and there are memory limitations and several weeks or even months of training time consumption for a single node to perform machine learning training. For example, the training of a very large scale parametric model such as a pre-trained model BERT, an end-to-end speech synthesis model, etc. applied to an NLP (natural language processing) module usually requires several days of computation time.

In view of the above problems in the background art, the present embodiment provides a method for training a machine learning model. Specifically, referring to fig. 4, after a user initiates model training in a knowledge base management system, a model training module sends a training task of machine learning to a distributed machine learning platform for training according to configuration information of a robot.

Further, the distributed machine learning platform receives the training task of machine learning sent by the model training module, and then creates a resource running the machine learning platform (e.g., TensorFlow) in the container cluster management system (e.g., Kubernets) through a custom resource (e.g., Kubernets CRD). Of course, it is also possible to create a resource running a deep learning platform (e.g., Pytorch) by a custom resource (kubernets CRD) or other platform that can handle artificial intelligence training tasks. Kubernets, for example, may add new resource types using API program interfaces through kubernets CRD (custom resources), so that there is no need to modify kubernets code or create custom API servers. Next, a training task based on machine learning platform resources is initiated in the container cluster management system.

Further, the distributed machine learning platform trains the training task to obtain the trained machine learning model. And finally, loading the machine learning model by the distributed machine learning platform and providing the model service corresponding to the machine learning model.

Therefore, in this way, the embodiment can deploy the machine learning platform in the container cluster management system without modifying the container cluster management system code or creating a custom interface service. In the embodiment, the distributed training of the machine learning training task is realized by using the distributed function of the container cluster management system. And after the training is finished, a machine learning model after the training is obtained, and model services corresponding to the machine learning model are generated by loading the machine learning model, so that other systems can call the model services conveniently. Therefore, even if the machine learning is carried out on a large scale, the calculation can be carried out quickly, the calculation efficiency is improved, the time is saved, and great convenience is brought to users. Therefore, the technical problem that the Hadoop Yarn serving as the GPU scheduling is mostly adopted by the existing distributed machine learning platform, but along with the continuous development of services, the Hadoop Yarn cannot meet the calculation requirement of large-scale machine learning is solved.

Optionally, before the container cluster management system starts the operation of the training task based on the machine learning platform resource, the method further includes: defining a training task by using a classification field of a standard definition file in a custom resource component in a container cluster management system; and starting a basic operation unit of the training task based on the machine learning platform resource under the condition that the type of the training task conforms to the defined type.

Specifically, before the container cluster management system starts the basic operation unit based on the resource, the distributed machine learning platform verifies the type of the training task by using the classification field in the configuration file in the resource of the machine learning platform. For example, when a resource is extended by a custom resource in a container cluster management system, it may be determined whether the extended resource runtime satisfies a specification definition in the custom resource. For example, the specification of the training task of the machine learning platform (e.g., TensorFlow) may be set as follows:

it is verified whether the training task satisfies the specification definition by a classification field Kind (e.g., kidd: TensorflowJob) in the specification definition file of the training task. When kidd verifies that the specification definition is met, the distributed machine learning platform initiates a training task based on machine learning platform (e.g., TensorFlow) resources.

Therefore, the training task of the machine learning platform resource is started on the container cluster management system, and the specification file of the training task of the machine learning platform resource is required to be in accordance with the specification of the container cluster management system. Therefore, the container cluster management system is favorable for managing the self-defined resources.

Optionally, the training task includes a plurality of subtasks, where each subtask corresponds to a basic operation unit.

Specifically, a training task based on machine learning platform (e.g., TensorFlow) resources includes a plurality of subtasks, where each subtask corresponds to a basic unit of operation (i.e., pod). For example, the distributed machine learning platform defines one management sub-node (e.g., Master) for each sub-task; a specific subtask (e.g., Worker); and a parameter service (PS service). Wherein a manager child node (Master) is used to manage a specific subtask (e.g., Worker). One Master corresponds to one pod, one Worker corresponds to one pod, one PS corresponds to one pod, one Master, one Worker and one PS service are respectively started, wherein each pod is not interfered with each other, and each pod executes own task, so that distributed training is completed. Therefore, each subtask can be executed in the corresponding basic operation unit in a non-interfering manner, and the efficiency of distributed computing is improved.

Optionally, after the training task based on the machine learning platform resource is started, the method further includes: creating a batch scheduler resource by using a custom resource component in the container cluster management system; and scheduling a plurality of basic operation units of the training tasks by using a batch scheduler; a basic unit of operation of a plurality of subtasks in the training task is scheduled using a batch scheduler.

Specifically, after the container cluster management system initiates the initiation of the training task based on the machine learning platform resource, a resource of the batch scheduler kube-batch (e.g., tensrflow) may be created in the container cluster management system (e.g., kubernets CRD) by a custom resource (e.g., kubernets CRD). If the training task is a distributed training task, coordination operation among a plurality of tasks is needed, or a plurality of subtasks are arranged in one training task and coordination operation among the plurality of subtasks is needed. For example, a Master, a Worker and a PS service are required for a training task of machine learning, wherein one Master corresponds to one pod, one Worker corresponds to one pod, and one PS corresponds to one pod, and the training task needs at least 3 pods to be coordinated and completed. At this point, the scheduling component of the container cluster management system (e.g., kubernets) itself has been unable to meet this powerful scheduling task. The distributed machine learning platform PAI needs to schedule these 3 pods simultaneously with a batch scheduler, kubernets, for example, that is pre-deployed in a container cluster management system (e.g., kubernets), where the batch scheduler, kube-batch, can schedule multiple basic operation units (pods) simultaneously.

Therefore, when one task may be a service and may need multiple sub-services to be coordinated and completed or one distributed task needs multiple basic operation units (pod) to be coordinated and completed, waiting is not needed, multiple basic operation units (pod) are coordinated with each other at the same time by using the batch scheduler kube-batch, and the operation is performed at the same time, so that the speed is higher and the efficiency is higher.

Optionally, after the operations of loading the training model and generating the model service, the method further includes: creating model service management resources by using a custom resource component in the container cluster management system; and managing resource monitoring and managing the model service using the model service.

Specifically, after the distributed machine learning platform loads the training model and gets the model service, the resource of the batch scheduler kube-batch (e.g., tensrflow) is created in the container cluster management system (e.g., kubernets CRD) through the custom resource (e.g., kubernets CRD). After the Pai distributed machine learning platform loads the training model to generate the model service. Because the number of model services is very large, a unified tool is needed to monitor and manage the very large number of model services. Model service management resources (e.g., Seldon) are this tool that can monitor and manage very large scale model services. Thus, even if the amount of users increases suddenly, the distributed machine learning platform can still manage resource monitoring and model services by using the model services.

In addition, referring to fig. 4, a user initiates model training (robot Learning) at the knowledge base management system, and Learning (model training module) distributes a training task to the distributed machine Learning platform according to the configuration information of the robot; after the training task is started, the distributed machine learning platform pulls the corpora from the api provided by the knowledge base management system, and simultaneously, the model is stored in the distributed machine learning platform after the model is loaded and trained. When an NLU (natural language understanding module) needs to acquire model services, an reference (model prediction module) calls an interface service of a distributed machine learning platform to acquire the model services, and then the NLU module calls an interface of the prediction module (reference) to acquire the model services. And the NLU module prediction module (reference) and the learning module (learning) receive the message of the model training task and store the result, and inform the prediction module (reference) to load the model. And the prediction module (inference) publishes the model service to the machine learning platform according to the configuration information of the robot.

Further, referring to fig. 1, according to a second aspect of the present embodiment, a storage medium 104 is provided. The storage medium 104 comprises a stored program, wherein the method of any of the above is performed by a processor when the program is run.

Therefore, according to the embodiment, the machine learning platform can be deployed in the container cluster management system in this way without modifying the container cluster management system code or creating a custom interface service. In the embodiment, the distributed training of the machine learning training task is realized by using the distributed function of the container cluster management system. And after the training is finished, a machine learning model after the training is obtained, and model services corresponding to the machine learning are provided by loading the machine learning model, so that other systems can call the model services conveniently. Therefore, even if the machine learning is carried out on a large scale, the calculation can be carried out quickly, the calculation efficiency is improved, the time is saved, and great convenience is brought to users. Therefore, the technical problem that the Hadoop Yarn serving as the GPU scheduling is mostly adopted by the existing distributed machine learning platform, but along with the continuous development of services, the Hadoop Yarn cannot meet the calculation requirement of large-scale machine learning is solved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

Fig. 6 shows an apparatus 600 for container-based distributed machine learning according to the first aspect of the present embodiment, the apparatus 600 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 6, the apparatus 600 includes: a receiving module 610, configured to receive a training task to be processed; a creating module 620, configured to create a resource for running the machine learning platform by using a custom resource component in the container cluster management system; a starting module 630, configured to start a training task based on machine learning platform resources in the container cluster management system; the execution module 640 is configured to execute a training task to obtain a trained machine learning model; and a loading module 650 for loading the machine learning model and providing a model service corresponding to the machine learning model.

Optionally, the starting module 630, before the container cluster management system starts the training task based on the machine learning platform resource, further includes: verifying the training task by using the classification field of the standard definition file in the custom resource component in the container cluster management system; and starting the training task based on the machine learning platform resource under the condition that the type of the training task conforms to the defined type.

Optionally, after the starting module 630 starts the training task based on the machine learning platform resource, the method further includes: the first creating submodule is used for creating the batch scheduler resource by utilizing the self-defined resource component in the container cluster management system; the first scheduling submodule is used for scheduling the basic operation units of the plurality of training tasks by using the batch processing scheduler; and the second scheduling submodule is used for scheduling the basic operation units of the plurality of subtasks in the training task by using the batch processing scheduler.

Optionally, the loading module 650, after loading the training model and generating the model service, further includes:

the second creating submodule is used for creating model service management resources by utilizing the self-defined resource assembly in the container cluster management system; and the monitoring management submodule is used for managing resource monitoring and model service by utilizing the model service.

Thus, according to the embodiment, by the apparatus 600 for container-based distributed machine learning, a machine learning platform can be deployed in a container cluster management system without modifying container cluster management system code or creating a custom interface service. In the embodiment, the distributed training of the machine learning training task is realized by using the distributed function of the container cluster management system. And after the training is finished, a machine learning model after the training is obtained, and model services corresponding to the machine learning are provided by loading the machine learning model, so that other systems can call the model services conveniently. Therefore, even if the machine learning is carried out on a large scale, the calculation can be carried out quickly, the calculation efficiency is improved, the time is saved, and great convenience is brought to users. Therefore, the technical problem that the Hadoop Yarn serving as the GPU scheduling is mostly adopted by the existing distributed machine learning platform, but along with the continuous development of services, the Hadoop Yarn cannot meet the calculation requirement of large-scale machine learning is solved.

Example 3

Fig. 7 shows an apparatus 700 for container-based distributed machine learning according to the present embodiment, the apparatus 700 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 7, the apparatus 700 includes: a processor 710; and a memory 720, coupled to the processor 710, for providing instructions to the processor 710 to process the following process steps: receiving a training task to be processed; creating resources for operating the machine learning platform by utilizing a custom resource component in the container cluster management system; starting a training task based on machine learning platform resources in a container cluster management system; executing a training task to obtain a trained machine learning model; and loading the machine learning model and providing a model service corresponding to the machine learning model.

Optionally, before the container cluster management system starts the operation of the training task based on the machine learning platform resource, the method further includes: verifying the training task by using the classification field of the standard definition file in the custom resource component in the container cluster management system; and starting the training task based on the machine learning platform resource under the condition that the type of the training task conforms to the defined type.

Optionally, after the training task based on the machine learning platform resource is started, the method further includes: creating a batch scheduler resource by using a custom resource component in the container cluster management system; scheduling a plurality of basic operation units of training tasks by using a batch scheduler; and scheduling a plurality of basic operation units of subtasks in the training task by using the batch scheduler.

Thus, according to the embodiment, by the apparatus 700 for container-based distributed machine learning, a machine learning platform can be deployed in a container cluster management system without modifying container cluster management system code or creating a custom interface service. In the embodiment, the distributed training of the machine learning training task is realized by using the distributed function of the container cluster management system. And after the training is finished, obtaining the model in the training result, and generating the model service by loading the model so that other systems can call the model service conveniently. Therefore, even if the machine learning is carried out on a large scale, the calculation can be carried out quickly, the calculation efficiency is improved, the time is saved, and great convenience is brought to users. Therefore, the technical problem that the Hadoop Yarn serving as the GPU scheduling is mostly adopted by the existing distributed machine learning platform, but along with the continuous development of services, the Hadoop Yarn cannot meet the calculation requirement of large-scale machine learning is solved.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of training a machine learning model, comprising:

receiving a training task to be processed;

creating resources for operating the machine learning platform by utilizing a custom resource component in the container cluster management system;

starting a training task based on the machine learning platform resources at the container cluster management system;

executing the training task to obtain a trained machine learning model; and

and loading the machine learning model and providing a model service corresponding to the machine learning model.

2. The method of claim 1, further comprising, prior to the container cluster management system initiating operation of the training task based on the machine learning platform resource:

verifying the training task by using the classification field of the standard definition file in the custom resource component in the container cluster management system; and

initiating the training task based on the machine learning platform resource if the type of the training task conforms to the defined type.

3. The method of claim 2, wherein the training task comprises a plurality of subtasks, wherein each subtask corresponds to a basic unit of operation.

4. The method of claim 1, wherein after initiating the training task based on the machine learning platform resource, further comprising:

creating a batch scheduler resource by using a custom resource component in the container cluster management system;

scheduling a plurality of basic operation units of the training tasks by using the batch scheduler; and

and scheduling basic operation units of a plurality of subtasks in the training task by using the batch scheduler.

5. The method of claim 1, further comprising, after the operations of loading the training model, generating a model service:

creating model service management resources by utilizing a custom resource component in the container cluster management system; and

and managing resource monitoring and managing the model service by using the model service.

6. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 5 is performed by a processor when the program is run.

7. An apparatus for training a machine learning model, comprising:

the receiving module is used for training tasks to be processed;

the creation module is used for creating resources for operating the machine learning platform by utilizing a custom resource component in the container cluster management system;

the starting module is used for starting a training task based on the machine learning platform resources in the container cluster management system;

the execution module is used for executing the training task to obtain a trained machine learning model; and

and the loading module is used for loading the machine learning model and providing model service corresponding to the machine learning model.

8. The apparatus of claim 7, wherein the initiation module, prior to the container cluster management system initiating the training task based on the machine learning platform resource, further comprises:

9. The apparatus of claim 8, wherein the training task comprises a plurality of subtasks, and wherein each subtask corresponds to a basic unit of operation.

10. An apparatus for training a machine learning model, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:

receiving a training task to be processed;

executing the training task to obtain a trained machine learning model; and