CN114265690A

CN114265690A - Method and device for realizing remote training

Info

Publication number: CN114265690A
Application number: CN202111545018.6A
Authority: CN
Inventors: 于子淇; 林立翔; 游亮; 龙欣; 张尉东; 卓钧亮; 戚余航; 刘思超
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-04-01

Abstract

The embodiment of the specification provides a method for realizing remote training, which comprises the following steps: the client runs a target training code for executing a model training task; the client responds to the call to the server generated by running the target training code, generates a remote training request and sends the remote training request to the server, wherein the remote training request comprises the cluster parameters input aiming at the model training task; the server applies and configures cluster resources in a resource pool according to the cluster parameters to obtain a target cluster; and the target cluster responds to a remote method call RMI generated by the client running the target training code and executes a calculation task in the model training task.

Description

Method and device for realizing remote training

Technical Field

One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for implementing remote training.

Background

In the field of artificial intelligence, the development of deep learning (deep learning) often increases the prediction accuracy of a model by increasing the scale of a data set and the number of model parameters, and then increases the amount of computation and time consumed by a training process. In order to reduce the training time of the model and improve the iteration efficiency of the model, distributed training is gradually used to replace a single-card or single-machine training mode.

With the increase of the distributed training scale, the requirement on the cluster is higher and higher, if an individual or a company maintains an individual offline cluster, a larger instance resource needs to be provided, and meanwhile, the increase of operation, maintenance and depreciation costs is brought, so that large-scale distributed training at the cloud becomes a mainstream model training mode, and the cloud has enough capacity to maintain a uniform resource pool and execute concurrent training tasks.

However, the current scheme of using cloud resources to perform model training is difficult to meet higher practical application requirements, and therefore, an improved scheme capable of meeting the practical application requirements, including supporting a user to view execution state information of model training in real time, and the like, is urgently needed.

Disclosure of Invention

One or more embodiments of the present specification describe a method and an apparatus for implementing remote training, where a training code is run at a client, and a remote method call RMI is used to transmit local data to a remote cluster for training, so as to check state information of the cluster executing training in real time.

According to a first aspect, there is provided a method of enabling remote training, comprising: the client runs a target training code for executing a model training task; the client responds to the call to the server generated by running the target training code, generates a remote training request and sends the remote training request to the server, wherein the remote training request comprises the cluster parameters input aiming at the model training task; the server applies and configures cluster resources in a resource pool according to the cluster parameters to obtain a target cluster; and the target cluster responds to a remote method call RMI generated by the client running the target training code and executes a calculation task in the model training task.

In one embodiment, the target training code is derived based on an adaptation modification to the original training code, the invocation of the server and the RMI are implemented based on the adaptation modification.

Further, in a specific embodiment, the original training code includes a specific unified computing device architecture cuda operation; the adapting comprises: and automatically identifying the cuda operation, intercepting a call path of an Application Programming Interface (API) corresponding to the cuda operation, and setting the call path to be executed in the server.

In one embodiment, generating a remote training request includes: generating the remote training request based on the identity information of the user bound by the client and the cluster parameters; the applying and configuring of the cluster resources in the resource pool includes: and applying for and configuring the cluster resources under the condition that the identity information passes the verification.

In one embodiment, the remote training request includes a syntax tree constructed based on the target training code; prior to performing a computational task of the model training tasks, the method further comprises: the server analyzes and renders the syntax tree to generate executable codes of the cluster; wherein, executing the calculation task in the model training task comprises: executing the computing task by executing the executable code.

In one embodiment, the cluster parameters include a cluster demand parameter and a cluster configuration parameter; the server applies and configures cluster resources in a resource pool according to the cluster parameters, and the method comprises the following steps: the server applies for the cluster resources matched with the cluster demand parameters in the resource pool; and the server performs the configuration on the cluster resources according to the cluster configuration parameters to obtain a target cluster.

Further, in a specific embodiment, the cluster demand parameters include one or more of the following: the number of occupied machines, the number of image processor GPU cards and the number of central processing unit CPU cards.

In another particular embodiment, the cluster configuration parameter indicates a training environment of the cluster; wherein, the server performs the configuration on the cluster resource according to the cluster configuration parameter, including: the server provides an installation file corresponding to the training environment for the cluster resource; and the cluster resource creates the training environment by operating the installation file.

Further, in one example, the installation file comprises a container docker image file indicating a single training card to be split into a specified number of virtual cards; the cluster resource creates the training environment by running the installation file, including: and the cluster resource divides a certain training card in the cluster resource into a plurality of virtual cards with the specified quantity by operating the container docker image file.

In one embodiment, the generating of the RMI call includes: the client encapsulates the task data related to the model training task into an object in an RMI class by operating the target training code, and the RMI class is registered in an RMI registration center by the server in advance; and the client calls a target method of a remote object in the target cluster based on the object; wherein, executing the calculation task in the model training task comprises: based on the task data, the target method is executed to perform the computing task.

Further, in a specific embodiment, the target training code is obtained based on an adaptive modification of the original training code, the adaptive modification including adding a modifier to the original training code; encapsulating task data related to the model training task into an object in an RMI class, wherein the encapsulating comprises the following steps: the modifier creates the RMI class; the modifier encapsulates the task data as objects in the RMI class.

In another specific embodiment, the task data includes model parameters, and/or training sample data.

In a further specific embodiment, the cluster parameter indicates: not using the CPU in the resource pool; encapsulating task data related to the model training task into an object in an RMI class, wherein the encapsulating comprises the following steps: performing preset processing on training sample data by using a local CPU of a terminal where the client is located; and packaging the training sample data after the preset processing into an object in the RMI class.

In one embodiment, the method further comprises: the target cluster provides the execution state information of the computing task as a return result of the RMI call to the client, wherein the execution state information comprises one or more of the following items: the current training round, the training loss, the number of batch samples corresponding to the training round and the evaluation index value of the training effect.

According to a second aspect, there is provided a method of enabling remote training, comprising: the client runs a target training code for executing a model training task; the client responds to the call to the server generated by running the target training code, generates a remote training request and sends the remote training request to the server, wherein the remote training request comprises the cluster parameters input aiming at the model training task; the server applies and configures cluster resources in a resource pool according to the cluster parameters to obtain a target cluster; and the target cluster responds to a remote method call RMI generated by the client running the target training code and executes a calculation task in the model training task.

According to a third aspect, there is provided a system for enabling remote training, comprising: the client is used for running a target training code for executing a model training task; the client is further used for responding to the call to the server generated by running the target training code, generating a remote training request and sending the remote training request to the server, wherein the remote training request comprises the cluster parameters input aiming at the model training task; the server is used for applying and configuring cluster resources in the resource pool according to the cluster parameters to obtain a target cluster; and the target cluster is used for responding to a remote method call RMI generated by the client running the target training code and executing a calculation task in the model training task.

According to a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

According to a fifth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor which, when executing the executable code, implements the method of the first aspect.

By adopting the method and the device provided by the embodiment of the specification, real remote training can be realized, the client can check the training progress in real time, the remote end carries out actual training calculation, the processes of purchasing and deploying examples, data relocation and the like are omitted, and the resource pool provides cluster resources as required, so that the cost of customers is effectively reduced, and the utilization rate of clusters on the cloud is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 illustrates a diagram of method steps for implementing remote training as disclosed in embodiments of the present specification;

FIG. 2 illustrates a remote training call flow diagram according to one embodiment;

fig. 3 shows a schematic structural diagram of a system for implementing remote training disclosed in an embodiment of the present specification.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

As mentioned earlier, the cloud has sufficient capacity to maintain the uniform resource pool and execute concurrent training tasks. On the one hand, however, the current cloud manufacturer has a deficiency in the resource maintenance and supply manner, for example, the model is usually matched in advance for an alternative model training task and then provided for a user to select, so that the customized design cannot be performed according to the user requirement, and the efficiency of the resource pool cannot be maximized, for example, the user may want to partially use the local resources of the client and partially use the cloud resources, while the current existing scheme cannot decouple the resources at the two ends; on the other hand, when model training is performed by using cloud resources at present, the whole training code is usually run out in the cloud, for example, the training code of the client is copied to the cloud for execution, and after the training is completed, the training result is sent to the client, so that the user cannot follow up the training process in real time, for example, the user knows the current iteration training performed to the next round, the loss of the last round of training, and the like.

Based on the above, the inventor provides a scheme for realizing remote training, which can meet practical application requirements of a user on checking state information of remote training in real time, dynamically applying cluster resources to a resource pool according to expected configuration of the user, and the like, thereby fully improving user experience, exerting the maximum efficiency of the resource pool, and remarkably improving the utilization rate of the cluster resources.

Fig. 1 shows a method step diagram for implementing remote training disclosed in the embodiment of the present specification, and it is first to be noted that there is no absolute sequence between the steps shown in the diagram, as long as the method can be implemented logically. As shown in fig. 1, the method comprises the steps of:

in step S110, the client runs a target training code for executing a model training task.

For ease of understanding, the generation of the target training code will be described below. In one embodiment, the code written in the client is not the target training code capable of realizing remote training, but is the original training code capable of realizing only local cluster training, so that the interaction between the calling of the server and the remote cluster and other parties is not involved in the original training code.

Based on this, the original training code needs to be modified into the target training code, in an embodiment, a service party providing the remote training service may provide a modification guide to the user, where the modification guide is written based on a code library for implementing remote training, which is pre-established by the service party, and for clarity of description, this code library is referred to as a Remote Deep Learning (RDL) library hereinafter; therefore, the user can manually modify the original training code according to the modification guide, so as to obtain the target training code.

In another embodiment, the user may be provided with an RDL adaptation package developed based on an RDL library, for example, in the form of the whl package of python; then, by introducing the RDL adapter into the client and installing the file included therein, automatic modification of part of the code in the original training code may be implemented, for example, adding a definition of a coded server Interface in the server, for example, adding a code related to Remote Method Invocation (RMI), or adding a modifier for reducing complexity of code modification, and for example, further, the original training code may include a specific Unified computing Device Architecture (cuda) operation cuda operation, such as a torch cut set Device operation, and the adapting modification may further include automatically recognizing the cuda operation, intercepting a call path of an Application Programming Interface (API) corresponding to the cuda operation, setting the call path as being executed in the server, and so on. In this manner, the automatically modified code is enabled to adapt to the remote training task.

On the other hand, in an embodiment, the training acceleration adaptation may be performed on the original training code, that is, the code adaptation for implementing the training acceleration is performed, and then the RDL adaptation is performed on the training code obtained after the training acceleration adaptation, so as to obtain the target training code. It should be noted that training acceleration adaptation can be implemented by importing a related adaptation code packet or manually modifying code, and the implementation principle of training acceleration is mainly as follows: by utilizing a performance optimization technology based on communication, the data exchange efficiency between machines and between GPU cards during distributed training is improved, so that the training speed is effectively improved. Illustratively, the adaptation code package may include a communication interface class and a basic component class abstracted uniformly for an Artificial Intelligence (AI) mainstream computing framework (such as a tensrflow framework, a pytorreh framework, an MXNet framework, and the like), and at the same time, a uniform basic communication class and a gradient entry layer are provided, so as to implement uniform distributed performance optimization.

In another embodiment, the server may open the RDL library to the user, so that the user may directly write the target training code in combination with the training requirement of the user.

After obtaining the target training code, the client may run the target training code locally. In one embodiment, the client runs the target training code in response to a user submitting the model training task. Further, in a specific embodiment, the client receives data information, such as a cluster parameter, a training data set, a model parameter, and the like, input by the user for the model training task, and then receives a submission operation triggered by the user based on the data information. On the other hand, in a specific embodiment, the training task is submitted based on the communication framework start script in the adaptation code packet adopted by the training accelerated adaptation.

Therefore, the client can start to run the target training code, so that subsequent calling of the server and the remote cluster is realized by running the target training code locally, and further remote training is realized.

In step S120, the client generates a remote training request in response to a call to the server generated by running the target training code.

Specifically, the client generates a remote training request based on at least the cluster parameters input for the model training task. In one embodiment, the cluster parameters include cluster requirement parameters, such as the number of occupied machines (e.g., 2 servers), the number of Graphics Processing Unit (GPU) cards, the number of Central Processing Unit (CPU) cards, the cluster size (e.g., 20 compute nodes), and so on. In another embodiment, the cluster parameters include cluster configuration parameters, such as an indication parameter of a training environment, a container docker image, and the like.

In one embodiment, the client generates the remote training request based on the cluster parameters and the identity information of the user bound to the client. Illustratively, the identity information may include a registered account number and a password of the user in the cloud computing service.

In another embodiment, the client generates a remote training request based on the cluster parameters and the target training code. In a specific embodiment, the client first constructs an Abstract Syntax Tree (AST) based on the target training code, and then generates the remote training request based on the AST Syntax Tree and the cluster parameters. Further, in one example, the client recursively traverses the operation result for the target training code, constructs the AST syntax tree through the call path, and encrypts and includes the constructed AST syntax tree in the remote training request. In another specific embodiment, the client includes the encrypted target training code in the remote training request.

In yet another embodiment, the client generates a remote training request based on the cluster parameters, the identity information, and the target training code.

Accordingly, the client generates a remote training request, so that the client transmits the remote training request to the server at step S130. Further, in step S140, the server applies for and configures cluster resources in the resource pool according to the remote training request, so as to obtain a target cluster.

In one embodiment, the server applies for a cluster resource matched with the cluster demand parameter in the resource pool according to the cluster demand parameter, and then performs corresponding configuration on the applied cluster resource according to the cluster configuration parameter to obtain a target cluster. It should be understood that the resource pool, or computing resource pool, cloud resource pool, is usually obtained by integrating idle resources of each region, individual or company by a cloud service provider, and the cloud service provider can perform unified planning and scheduling on the resource pool; in addition, the resource pool may include computing resources (e.g., CPU cards, GPU cards, etc.), storage resources, and network resources.

Further, in a specific embodiment, the cluster configuration parameter includes a training environment parameter indicating an operating environment of the code in the cluster, and it is to be understood that the operating environment of the code in the cluster and the operating environment of the code in the client generally need to be consistent; correspondingly, the configuration of the cluster resource may include: the server provides the installation files corresponding to the training environment parameters to the cluster resources, so that the cluster resources create corresponding running environments or training environments by running the installation files. In a more specific embodiment, the training environment is a Python environment, and accordingly, the installation file provided by the server may be the wad's yaml file or both.

In another more specific embodiment, the installation file may be a user-provided container docker image file. Further, the docker image file further indicates to divide a single training card into a plurality of virtual cards of a specified number, and accordingly, the creating of the training environment by the cluster resource through running the installation file may include: the cluster resource divides one or more training cards in the cluster resource into a plurality of virtual cards with specified quantity by operating the docker image file. For example, a single GPU card is split into 2 virtual GPU pods to perform simulations of 2 training cards. Therefore, finer-grained segmentation can be realized according to user configuration, and the concurrency and controllable granularity are improved.

In the above, the target cluster can be obtained by configuring the applied cluster resource.

On the other hand, in an embodiment, the remote training request further includes identity information of the user bound to the client. Accordingly, the implementation of this step may include: and the server verifies the identity information and applies and configures the cluster resources under the condition of passing the verification. Otherwise, the current flow is terminated.

In one embodiment, the remote training request further includes an AST syntax tree, and it should be noted that if the encrypted AST syntax tree is included, a decryption operation is required to obtain the original AST syntax tree. Then, the server may start a codeServing service, and parse and render the AST syntax tree to obtain a code that can be executed by the remote cluster, where the remote cluster may be referred to as a cloud cluster or a cluster in a resource pool. Further, the server may provide the resulting executable code to the target cluster. In another embodiment, the remote training request further includes encrypted target training codes, so that the server may decrypt the target training codes to obtain target training codes, process the target training codes to obtain codes executable by the remote cluster, and provide the codes to the target cluster.

From the above, a configured target cluster can be obtained. On the other hand, in step S150, the client generates an RMI call request in the process of running the target training code, and then, in step S160, the target cluster executes a computation task in the model training task in response to the RMI call request.

For step S150, in an embodiment, the client encapsulates the task data related to the model training task into an object in an RMI Class (RMI-Class) by running the target training code, and calls a target method of a remote object in the target cluster based on the object. It should be noted that the RMI class is pre-registered in the RMI registry by the server.

Further, in a specific embodiment, the task data includes model parameters, and accordingly, the model parameters may be instantiated as objects in the RMI class. In another specific embodiment, the task data includes training sample data, and accordingly, the training sample data may be packaged as an object in the RMI class, or the feature data and the tag data in the training sample may be packaged as objects belonging to the RMI class, respectively.

In one example, the user-entered cluster parameters indicate: not using the CPU in the resource pool; correspondingly, encapsulating the task data involved in the model training task as an object in the RMI class may include: the client performs preset processing on the training sample data by adopting a local CPU, and encapsulates the preset processed training sample data into an object in an RMI class. It is to be understood that the predetermined processing may include data preprocessing (e.g., sample feature alignment), sparse computation (e.g., one-hot coding), and the like. In another example, the cluster parameter indicates: correspondingly, the target cluster may include a remote CPU, which is applied by the server to the resource pool, for performing predetermined processing on the training sample data, and the processed data may also be encapsulated as an object in the RMI class for the GPU terminal in the target cluster to access. Therefore, the decoupling of the CPU/GPU in remote training can be realized, and fine-grained control is achieved by separating CPU/GPU instances.

On the other hand, in a specific embodiment, the target training code is obtained based on an adaptive modification of the original training code, wherein the adaptive modification includes adding a modifier to the original training code; correspondingly, encapsulating the task data related to the model training task into an object in the RMI class includes: the modifier dynamically creates the RMI class and encapsulates the task data as objects in the RMI class. It should be understood that the created RMI class is essentially an RMI client of an RMI Server (Server) in the Server, and can be embedded with communication encryption to ensure data privacy. In one example, a model wrapper (model wrapper) encapsulates model parameters as objects in the RMI class; in another example, a tensor modifier (tensor wrapper) encapsulates tensor data corresponding to training sample data as objects in the RMI class. Therefore, by adding the modifier in the RDL adaptive code packet, the complexity of modifying the original training code can be greatly reduced.

Accordingly, the client may generate an RMI request based on the objects in the RMI class, and then, in step S160, the target cluster may run the above-described target method invoked to perform the computation task in the model training according to the task data introduced by the RMI request. In one embodiment, the target cluster performs the computational tasks in the model training by running the above-described executable code provided by the server. It is to be understood that this computational task typically involves intensive computations, such as matrix multiplication, activation function processing, and the like. On the other hand, in one embodiment, the fetching of task data by the target cluster may also be done by prefetching (prefetch), thereby ensuring that IO and network are not bottlenecks.

In one embodiment, step S160 may further include: and the target cluster provides the execution state information of the computing task as a return result of the remote call to the client. Illustratively, the execution state information may include: the current training round, the training loss, the number of batch samples corresponding to the training round, the evaluation index value (such as accuracy accuracuracy) of the training effect, and the like. Therefore, the client can check the process and the state information of the remote training in real time and acquire the final training result through RMI calling.

On the other hand, for the execution state information and the training result of the computing task, in an embodiment, the user may select to store the execution state information and the training result in the cloud, or specify a partial information result from the execution state information and the training result, store the partial information result in the cloud, and specifically store the partial information result in an Object Storage Service (OSS) of the user, thereby implementing result backup; in another embodiment, the user may obtain locally via RMI for custom processing and saving.

In another aspect, after the execution of the model training task is finished, if the user selects debugging, a local debugging method can be directly implemented on the client, and the remote end feeds back debugging information to the local for displaying.

Further, if the user no longer uses the target cluster, the server will recover the access right of the user to the target cluster, and clear the resource environment, so that the corresponding cluster resource can be applied for use again.

Therefore, by adopting the method for realizing remote training disclosed by the embodiment of the specification, real remote training can be realized, the training progress can be checked by the client in real time, the actual training calculation is carried out at the far end, the processes of purchasing and deploying examples, data relocation and the like are omitted, cluster resources are provided by the resource pool as required, the cost of customers is effectively reduced, and the utilization rate of clusters on the cloud is improved.

For the sake of understanding, the following describes an exemplary implementation flow of the above-mentioned remote training method with reference to fig. 2. FIG. 2 illustrates a remote training call flow diagram according to one embodiment. As shown in fig. 2, the method comprises the following steps:

1. the client receives parameters input by a user, and the parameters comprise: cluster instance number, GPU card number, whether to use remote CPU, environment configuration for training, identity information of user, etc.

2. The training acceleration code is adapted. Specifically, training acceleration code adaptation is performed on the original training code of the client. The adaptive training acceleration code mainly utilizes optimization of a communication technology to reduce communication loss of distributed training across machines and improve cost performance.

3. The RDL code is adapted. Specifically, the training codes after being adapted with the training acceleration codes are adapted with the RDL codes, the codes are remotely realized by brief modification, and the modified code amount can be greatly reduced by means of the RDL library realization.

4. And submitting a model training task. And starting a script through a communication framework in the training acceleration adaptive code packet, and submitting a training task.

5. And dynamically generating a syntax tree. The submission of the model task triggers the sending of a remote training request by locally carrying out RDL (remote description language) calling, at the moment, the code result of local operation is recursively traversed, an AST (access request) syntax tree is constructed through a calling path, and then the AST syntax tree is encrypted and sent to a server (or called an RDL server).

Initiation of codeServing. And after receiving the submitted task, the server starts a codeServing service, analyzes and renders the AST, and generates a code which can be executed by the GPU terminal.

7. And dynamically applying for the cluster. After the server background verifies the identity information of the user, resource application is carried out according to the provided cluster demand configuration, and the clusters in the same region and the available region are preferentially selected.

8. And training cluster configuration. After the cluster is selected, the cluster creates a training environment according to a docker image provided by a user or a yacl file of conda.

9. It is determined whether to use the remote CPU. If the user chooses to use the remote CPU, for example, the data preprocessing and sparse calculation part can be assigned to the CPU to do; otherwise using the local CPU resources of the client; and uniformly packaging the calculated results into RMI remote objects for the GPU cluster to access.

10. A remote GPU is used. After the cluster is configured, the GPU enters a training stage, firstly, remote dynamic acquisition is carried out according to tensor data (tensor) on the side of a CPU, the remote dynamic acquisition can be completed through prefetching (prefetch), and IO and a network are not bottlenecks; after data are obtained, real training is carried out; after training, the result is sent back to the client for dynamic display.

11. And local summarizing. The local client can see the training state and the final result, such as loss, batch, acc and the like of training in real time, so as to achieve the effect of remote training.

12. And storing the cloud. After the user selects to save the cloud, the user can appoint data and models to the OSS of the user and backup results.

13. And storing locally. The user can also choose to locally acquire the remote calculated result by means of RMI for custom processing and storage.

14. And (6) ending. After training is finished, if a user selects debugging, the user can directly perform local debugging on the client, and the remote end feeds debugging information back to the local for display; after the cluster is no longer in use, the cluster will reclaim the access rights and empty the resource environment.

Therefore, by implementing the remote training call flow, the following technical effects can be produced:

1) training and zooming out: the client side can see the execution effect really, the remote side carries out real training, the maintenance cost of the client is reduced, the distributed training task can be executed locally only by providing an algorithm and funds, and the processes of instance purchase, deployment, data relocation and the like are omitted.

2) Example segmentation: the CPU/GPU instances are separated according to the tasks, fine-grained control of instance levels is achieved, the limit of the tasks on specific machine types is reduced, and different CPU/GPU instances can be fully combined to carry out task scheduling under resource recombination.

3) Modification without sense: and the multiple frames are uniformly abstracted, and the model and the tensor at the frame side are intercepted and called by the user to carry out RMI remote dynamic registration, so that the cost of manual adaptation is greatly reduced.

4) New form of selling: the method has the advantages that a new resource form on the cloud is provided, through permission isolation, the training cluster is not completely controllable for users, and the instances are only used for executing the training tasks, so that the selling of the resources can be converted from a single fixed instance into the output of training calculation power, the resource consumption is reduced, the cost of customers can be reduced, and the utilization rate of the cluster on the cloud is improved.

The scheme disclosed by the embodiment of the specification realizes a remote training mode from software, so that a new training effort selling mode becomes possible, the maintenance cost and the scheduling difficulty of a dispersed cluster are reduced, the utilization rate of the whole cloud cluster is improved, an algorithm developer is free from the problems of deployment and maintenance of Infrastructure as a Service (IaaS) layer resources and the like, and distributed training can be performed more efficiently.

According to another aspect of embodiments, corresponding to the remote training method, the embodiments of the present specification further disclose a remote training system. Fig. 3 is a schematic structural diagram of a system for implementing remote training disclosed in an embodiment of the present disclosure, and as shown in fig. 3, the system 300 includes:

the client 310 is configured to run target training code for performing a model training task, and generate a remote training request in response to a call to the server 320 resulting from running the target training code, and send the remote training request to the server 320, where the remote training request includes cluster parameters input for the model training task. And the server 320 is configured to apply for and configure cluster resources in the resource pool according to the cluster parameters to obtain the target cluster 330. And the target cluster 330 is used for executing a computing task in the model training task in response to a remote method call RMI generated by the client 310 running the target training code.

In one embodiment, the target training code is derived based on an adaptive modification to the original training code, the invocation of server 320 and the RMI are implemented based on the adaptive modification.

In an embodiment, the client 310 is configured to generate a remote training request, and specifically includes: generating the remote training request based on the identity information of the user to which the client 310 is bound and the cluster parameters. The server 320 is configured to apply for and configure a cluster resource in a resource pool, and specifically includes: and applying for and configuring the cluster resources under the condition that the identity information passes the verification.

In one embodiment, the remote training request includes a syntax tree constructed based on the target training code; the server 320 is further configured to parse and render the syntax tree to generate an executable code of a cluster; the target cluster 330 is configured to execute a computation task in the model training task, and specifically includes: executing the computing task by executing the executable code.

In one embodiment, the cluster parameters include a cluster demand parameter and a cluster configuration parameter; the server 320 is configured to apply for and configure a cluster resource in a resource pool according to the cluster parameter, and specifically includes: applying for cluster resources in the resource pool matched with the cluster demand parameters; and according to the cluster configuration parameters, performing the configuration on the cluster resources to obtain a target cluster 330.

In a specific embodiment, the cluster demand parameters include one or more of the following: the number of occupied machines, the number of image processor GPU cards and the number of central processing unit CPU cards.

In another particular embodiment, the cluster configuration parameter indicates a training environment of the cluster; the server 320 is configured to perform the configuration on the cluster resource according to the cluster configuration parameter, and specifically includes: the server 320 provides the installation file corresponding to the training environment to the cluster resource, so that the cluster resource creates the training environment by running the installation file.

Further, in one example, the installation file comprises a container docker image file, wherein the instruction is to split a single training card into a specified number of virtual cards; the cluster resource creates the training environment by running the installation file, including: and the cluster resource divides a certain training card in the cluster resource into a plurality of virtual cards with the specified quantity by operating the container docker image file.

In one embodiment, the client 310 is configured to generate the RMI call, and specifically includes: the client 310 encapsulates the task data related to the model training task into an object in an RMI class by running the target training code, wherein the RMI class is pre-registered in an RMI registry by the server 320; client 310 invokes a target method of a remote object in the target cluster 330 based on the object. Correspondingly, the target cluster 330 is configured to execute a computation task in the model training task, and specifically includes: target cluster 330 runs the target method to perform the computing task based on the task data.

In a specific embodiment, the target training code is obtained based on an adaptive modification of an original training code, the adaptive modification including adding a modifier to the original training code; the client 310 is configured to encapsulate task data related to the model training task as an object in an RMI class, and specifically includes: the client 310 creates the RMI class through the modifier and encapsulates the task data as objects in the RMI class through the modifier.

In a particular embodiment, the task data includes model parameters, and/or training sample data.

In another specific embodiment, the cluster parameter indicates: not using the CPU in the resource pool; the client 310 is configured to encapsulate task data related to the model training task as an object in an RMI class, and specifically includes: performing predetermined processing on training sample data by using a local CPU of a terminal where the client 310 is located; and packaging the training sample data after the preset processing into an object in the RMI class.

In one embodiment, the original training code includes a specific unified computing device architecture cuda operation; the adaptation modification comprises: the cuda operation is automatically recognized, and a call path of an Application Programming Interface (API) corresponding to the cuda operation is intercepted and set to be executed in the server 320.

In one embodiment, the target cluster 330 is further configured to provide execution state information of the computing task to the client 310 as a return result of the RMI call, the execution state information including one or more of: the current training round, the training loss, the number of batch samples corresponding to the training round and the evaluation index value of the training effect.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 1 or fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 1 or fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method of implementing remote training, comprising:

the client runs a target training code for executing a model training task;

the client responds to the call to the server generated by running the target training code, generates a remote training request and sends the remote training request to the server, wherein the remote training request comprises the cluster parameters input aiming at the model training task;

the server applies and configures cluster resources in a resource pool according to the cluster parameters to obtain a target cluster;

and the target cluster responds to a remote method call RMI generated by the client running the target training code and executes a calculation task in the model training task.

2. The method of claim 1, wherein the target training code is derived based on an adaptation modification to original training code, the invocation of the server and the RMI being implemented based on the adaptation modification.

3. The method of claim 1, wherein generating a remote training request comprises:

generating the remote training request based on the identity information of the user bound by the client and the cluster parameters;

the applying and configuring of the cluster resources in the resource pool includes:

and applying for and configuring the cluster resources under the condition that the identity information passes the verification.

4. The method of claim 1, wherein the remote training request includes a syntax tree constructed based on the target training code;

prior to performing a computational task of the model training tasks, the method further comprises:

the server analyzes and renders the syntax tree to generate executable codes of the cluster;

wherein, executing the calculation task in the model training task comprises:

executing the computing task by executing the executable code.

5. The method of claim 1, wherein the cluster parameters include a cluster demand parameter and a cluster configuration parameter; the server applies and configures cluster resources in a resource pool according to the cluster parameters, and the method comprises the following steps:

the server applies for the cluster resources matched with the cluster demand parameters in the resource pool;

and the server performs the configuration on the cluster resources according to the cluster configuration parameters to obtain a target cluster.

6. The method of claim 5, wherein the cluster demand parameters include one or more of: the number of occupied machines, the number of image processor GPU cards and the number of central processing unit CPU cards.

7. The method of claim 5, wherein the cluster configuration parameter indicates a training environment of a cluster; wherein, the server performs the configuration on the cluster resource according to the cluster configuration parameter, including:

the server provides an installation file corresponding to the training environment for the cluster resource;

and the cluster resource creates the training environment by operating the installation file.

8. The method of claim 7, wherein the installation file comprises a container docker image file indicating a split of a single training card into a specified number of virtual cards;

the cluster resource creates the training environment by running the installation file, including:

and the cluster resource divides a certain training card in the cluster resource into a plurality of virtual cards with the specified quantity by operating the container docker image file.

9. The method of claim 1, wherein the generation of the RMI call comprises:

the client encapsulates the task data related to the model training task into an object in an RMI class by operating the target training code, and the RMI class is registered in an RMI registration center by the server in advance; and the client calls a target method of a remote object in the target cluster based on the object;

wherein, executing the calculation task in the model training task comprises:

based on the task data, the target method is executed to perform the computing task.

10. The method of claim 9, wherein the target training code is derived based on an adaptive modification to an original training code, the adaptive modification including adding a modifier in the original training code; encapsulating task data related to the model training task into an object in an RMI class, wherein the encapsulating comprises the following steps:

the modifier creates the RMI class;

the modifier encapsulates the task data as objects in the RMI class.

11. The method of claim 9, wherein the task data comprises model parameters, and/or training sample data.

12. The method of claim 9, wherein the cluster parameter indicates: not using the CPU in the resource pool; encapsulating task data related to the model training task into an object in an RMI class, wherein the encapsulating comprises the following steps:

performing preset processing on training sample data by using a local CPU of a terminal where the client is located;

and packaging the training sample data after the preset processing into an object in the RMI class.

13. The method of claim 2, wherein the original training code includes a specific unified computing device architecture cuda operation therein; the adapting comprises:

and automatically identifying the cuda operation, intercepting a call path of an Application Programming Interface (API) corresponding to the cuda operation, and setting the call path to be executed in the server.

14. The method of claim 1, wherein the method further comprises:

the target cluster provides the execution state information of the computing task as a return result of the RMI call to the client, wherein the execution state information comprises one or more of the following items: the current training round, the training loss, the number of batch samples corresponding to the training round and the evaluation index value of the training effect.