CN116775093A

CN116775093A - Distributed training method, device and equipment for codes

Info

Publication number: CN116775093A
Application number: CN202210215609.5A
Authority: CN
Inventors: 闫晓瑞; 武文博; 王斌; 冯俊兰; 邓超
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2023-09-19

Abstract

The invention provides a distributed training method, a device and equipment for codes, wherein the distributed training method for the codes comprises the following steps: acquiring codes and training instructions written by a user through a preset application; according to the training instruction, obtaining target resources designated by a user and used for training the codes; creating a resource object according to the target resource; performing code training according to the resource object scheduling target training container to obtain a training result; and returning the training result to the user through the application. The scheme of the invention does not need to carry out invasive modification on codes, can realize isolation of data and calculation, effectively improves the utilization rate of resources and has strong interactivity.

Description

Distributed training method, device and equipment for codes

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for distributed training of codes.

Background

With the rapid development of machine learning and cloud computing, the traditional distributed training method cannot meet the existing requirements, and the existing distributed training requires users to write a Kubeflow resource object file and needs to carry out invasive modification on training codes. The user cannot have better interaction with the code, and the operation of the code and the reading of the result all require the user to have relevant knowledge. In addition, since different storage systems have their own call interfaces, the system distributed training requires modifying a large amount of code when reading the data sets on the different storage systems. At the same time, distributed training to read remote storage may cause I/O bottlenecks, making resources underutilized.

Disclosure of Invention

The invention provides a distributed training method, device and equipment for codes, which improve the resource utilization rate and the data interactivity.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a method of distributed training of codes, the method comprising:

acquiring codes and training instructions written by a user;

according to the training instruction, obtaining target resources designated by a user and used for training the codes;

creating a resource object according to the target resource;

scheduling a target training container in the resource object to carry out code training to obtain a training result;

and returning the training result to the user through the application.

Optionally, acquiring the code and the training instruction written by the user includes:

and receiving user-written codes and training instructions sent by the kernel of the Jupyterhub application.

Optionally, according to the training instruction, obtaining a target resource specified by a user and used for training the code includes:

analyzing the training instruction to obtain an analysis result;

if the analysis result comprises training frames and training container quantity which are designated by a user and used for training the codes, the training frames and the training container quantity which are designated by the user are used as the target resources;

and if the analysis result does not have the training frames and the training container number specified by the user, obtaining the default training frames and training container number as the target resource.

Optionally, the distributed training method of the code further includes:

acquiring a data arrangement request in the training instruction;

according to the data arrangement request, interacting with a virtual distributed storage system, and loading target data on the virtual distributed storage system, wherein the virtual distributed storage system is structured with a middleware between a bottom distributed file system and an upper distributed computing frame.

Optionally, creating a resource object according to the target resource includes:

acquiring a local storage path of the target data mount in the virtual distributed storage system;

and creating the resource object according to the target resource and the local storage path, wherein the resource object comprises a target training framework and a target training container.

Optionally, performing code training according to the training target training container of the resource object to obtain a training result, including:

and obtaining a training result obtained by training the codes stored in the resource object by the target training container according to the target training frame.

Optionally, returning the training result to the user through the application includes:

and sending the training result to the front end of the Jupyterhub application through the kernel of the Jupyterhub application, and sending the training result to a user through the front end of the Jupyterhub application.

The embodiment of the invention also provides a distributed training device for codes, which comprises:

the first acquisition module is used for acquiring codes and training instructions written by a user;

the second acquisition module is used for acquiring target resources which are designated by a user and used for training the codes according to the training instruction;

the processing module is used for creating a resource object according to the target resource; scheduling a target training container in the resource object to carry out code training to obtain a training result; and returning the training result to the user through the application.

The present invention also provides a computing device comprising: a processor, a memory storing a computer program which, when executed by the processor, performs the method as described above.

The invention also provides a computer readable storage medium storing instructions that when run on a computer cause the computer to perform a method as described above.

The scheme of the invention at least comprises the following beneficial effects:

according to the scheme, codes and training instructions written by a user are obtained through a preset application; according to the training instruction, obtaining target resources designated by a user and used for training the codes; creating a resource object according to the target resource; performing code training according to the resource object scheduling target training container to obtain a training result; and returning the training result to the user through the application. The method solves the problem that when the existing distributed training needs a user to modify the distributed training code and cannot solve the problem that the I/O bottleneck appears to influence the resource utilization rate when the data set is read, realizes the isolation of the data from the calculation, effectively improves the resource utilization rate, and has strong interactivity.

Drawings

FIG. 1 is a flow chart of a distributed training method for codes provided by an embodiment of the present invention;

FIG. 2 is an architecture diagram of a distributed training system for code provided by an embodiment of the present invention;

FIG. 3 is a flow chart of distributed training based on Jupyterhub provided by an embodiment of the present invention;

FIG. 4 is a flow chart of a data orchestration module provided by an embodiment of the present invention;

FIG. 5 is a flow chart of a container management module provided by an embodiment of the present invention;

FIG. 6 is a block diagram of a distributed training apparatus for code provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

As shown in fig. 1, an embodiment of the present invention provides a distributed training method for codes, the method including:

step 11, acquiring codes and training instructions written by a user; in specific implementation, codes and training instructions written by a user can be obtained through a preset application; the preset application may be a JupyterHub application;

step 12, according to the training instruction, obtaining a target resource designated by a user and used for training the code; the target resources herein may include: at least one Pod (training container) and its corresponding resources, such as GPU (graphics processor) or CPU (central processing unit);

step 13, creating a resource object according to the target resource;

step 14, scheduling a target training container in the resource object to carry out code training to obtain a training result;

and step 15, returning the training result to the user through the application.

In the embodiment, the code and the training instruction written by the user can be obtained through the preset application, the target resource designated by the user and used for training the code is obtained according to the training instruction, the resource object is created according to the target resource, the target training container in the resource object is scheduled for code training, the training result is obtained, and the training result is returned to the user through the application, so that the code is not required to be modified invasively, the data and the calculation are isolated, the resource utilization rate is effectively improved, and the interactivity is strong.

In yet another alternative embodiment of the present invention, step 11 includes:

In this embodiment, the code written by the user is sent to the kernel of the JupyterHub application through the front end of the JupyterHub application, and further sent to the Kubeflow Master (host) through the kernel of the JupyterHub application, so that the container management is further performed, and thus, the code written by the user in JupyterHub is transferred through the Jupyter kernel and stored in the resource object ConfigMap of Kubernetes, and the user code can be persisted and can be distributed and trained without reconstructing an image.

As shown in FIG. 2, an alternative embodiment of the present invention provides a distributed training system for code, the core modules of the system comprising: training container management module and data orchestration module two parts, wherein training container management module includes: the container management sub-module and the user code storage sub-module, the data arrangement module includes: a file mounting sub-module and a data management sub-module.

The functions of the modules are as follows:

the training container management module can receive information transmitted by a preset application Jupyterhub kernel, create a resource object for distributed training, and a user can select a Tensorflow or Pytorch distributed training framework, specify a Master (host), the number of PS and workbench and resources allocated to the container through the container management sub-module. The code written by the user is transmitted through the Jupyterhub kernel and stored in the resource object ConfigMap (namely the code storage sub-module) of the Kunetes Master, so that the user code can be stored in a lasting mode, and meanwhile, the distributed training can be carried out without reconstructing a mirror image.

The training container management submodule can mainly create and operate a target Kubeflow resource object through a training request of a user, and simultaneously directly returns and displays an operation result of exclusive use of the Kubeflow resource to the user.

The user defines the required distributed training framework at the beginning of the code through magic commands, and designates the Master, PS and Worker quantities and the resources of the allocation container. The resources that may be assigned are GPU, CPU, and memory.

Magic commands refer in the system to commands beginning with a symbol%, for example:

"% frame=tensorflow" represents that the user-defined distributed training framework is tensorflow. The Kubeflow container management submodule realizes the use of Kubeflow on the Jupyterhub, solves the problem that a user needs to carry out invasive modification on codes by using the Kubeflow and has related knowledge of Kubernetes, and meanwhile, the user can see a training result on the Jupyterhub after submitting codes on the Jupyterhub, and solves the problem that the distributed training by using the Kubeflow lacks interactivity.

The functions of the user code storage sub-module include:

the user code storage sub-module can store training codes written by a user on JupyterHub in a resource object ConfigMap. The resource object ConfigMap is stored under Namespace appointed by a user, and the names of the Kubeflow resource objects created by the names are consistent. After storing the file in the ConfigMap, the kubelow resource object reads and runs the code using the file mount function of Kubernetes. The training codes and the environment are not required to be reconstructed into images by the user, and the images are uploaded to an image warehouse. The user code storage submodule is mainly based on the resource object ConfigMap of the Kubernetes and the container file mounting function of the Kubernetes, so that the user code is stored in a lasting mode, and the problem that a training image needs to be reconstructed is solved.

The data orchestration module is capable of interacting with aluxio through information passed from the JupyterHub kernel. The Alluxio is a memory-based distributed file system, is a middleware which is constructed between a bottom distributed file system and an upper distributed computing framework, has the main role of providing data access service in a memory or other storage facilities in the form of files, can read data sets on different storage systems by using only one scheduling interface in distributed training based on the Alluxio data arrangement system, and has the advantages of faster speed of reading files in the Alluxio, alleviation of I/O bottleneck problems existing in the data sets read in distributed training, and improvement of resource utilization. The data arrangement module realizes interaction between the user and the aluxio system through the magic command, realizes data arrangement through the Jupyterhub, does not need the user to have the aluxio knowledge, and separates the data from the distributed computing framework.

The file mounting submodule can mount the storage path of the data set to the aluxio file system through a mounting rule defined on the JupyterHub by a user. The user defines the mounting rules by magic commands, for example: "% mkdir/tracking-data/imaging net hdfs: v/IP: port ", which means that the files on the designated hdfs are mounted under the/tracking-data/image directory in the Alluxio file system. In a container for distributed training tasks, the container needs to read a local data set or a remotely stored data set. And reading the locally stored data set, namely copying the local file into the Alluxio through a copyfrom local command, and finally, mounting data required by the Kubeflow resource object onto an Alluxio file system to enable codes in a container to read the data set faster. The container needs to read the data set stored remotely, firstly, files of the remote storage system are mounted in the aluxio through a mount command, and finally, the data set is read in an acceleration mode through a mounting function, so that the resource utilization rate is improved. In addition, when data migration occurs to the user's data set, there is no need to modify the portion of the code that is relevant to the data reading, nor the configuration environment. Only the magic command needs to be modified to re-mount the file, so that a large amount of invasive modification of codes is avoided. The file mounting sub-module realizes the data arrangement by using the JupyterHub, and accelerates the speed of reading the data set, so that the training time is not prolonged due to the I/O performance bottleneck, the resource utilization rate is improved, and the problem that a large amount of code modification is required due to data migration is solved.

The data management sub-module can be implemented to manage different data storage systems on JupyterHub. Based on the aluxio system, different storage systems are operated with magic commands. The module separates data from computation and a user can interact with different storage systems on JupyterHub without configuring different environments and learning different storage system knowledge.

In yet another alternative embodiment of the present invention, step 12 includes:

step 121, analyzing the training instruction to obtain an analysis result;

step 122, if the analysis result includes the training frames TEnsorflow and training containers Pod number specified by the user for training the code, the training frames and training containers number specified by the user are used as the target resource;

and step 123, if the analysis result does not include the training frames and the training container number specified by the user, obtaining a default training frame and training container number as the target resource.

In this embodiment, the analysis result is obtained by analyzing the obtained training instruction, where the analysis result may include the training frames and the training containers specified by the user, training is performed according to the training frames and the training containers specified by the user, and if the analysis result does not include the training frames and the training containers specified by the user, training is performed according to the default training frames and training containers.

In yet another alternative embodiment of the present invention, step 12 may further include:

step 124, obtaining a data arrangement request in the training instruction;

step 125, interact with a virtual distributed storage system, and mount target data on the virtual distributed storage system, where a middleware is configured between a bottom distributed file system and an upper distributed computing frame.

In this embodiment, according to the obtained data arrangement request, the target data may be mounted on a local storage path of the virtual distributed storage system alloxio, so that the user may access the data in the alloxio by using the local storage path, and thus, the data and the computation may be isolated by using the virtual distributed storage system alloxio.

As shown in fig. 3 and 4, in an alternative embodiment of the present invention, the user uses JupyterHub for data arrangement and distributed training as follows:

process 1: the Jupyterhub front end transmits a request to the Jupyter kernel through the ZeroMQ, and the Jupyterhub kernel parses a command transmitted from the Jupyterhub front end by a user. Wherein ZeroMQ is a messaging tool;

process 2: judging whether a data arrangement request exists in the command, if the data arrangement request is analyzed, calling a related API of the Alluxio to interact with the Alluxio, and storing, updating and deleting data in the Alluxio.

Otherwise, directly entering a container management module;

process 3: judging whether a mounting request exists in the command, if the data mounting request is analyzed, mounting the data in the aluxio on a specified local path, and enabling a user to access the data in the aluxio by using the local path;

process 4: after the data arrangement request is processed, the method finally enters a container management module to establish and operate the container.

In yet another alternative embodiment of the present invention, step 13 includes:

step 131, obtaining a local storage path of the target data mount in the virtual distributed storage system;

and step 132, creating the resource object according to the target resource and the local storage path, wherein the resource object comprises a target training framework and a target training container.

In this embodiment, the resource object may be further created from a local storage path, which enables isolation of data from computation in JupyterHub.

In yet another alternative embodiment of the present invention, step 14 includes:

step 141, obtaining a training result obtained by training the code stored in the resource object by the target training container according to the target training frame.

In the embodiment, the target training container trains codes stored in the storage and code storage module according to the training frame indicated by the user, so as to obtain training results and improve the resource utilization rate.

In yet another alternative embodiment of the present invention, step 15 includes:

and step 151, sending the training result to the front end of the Jupyterhub application through the kernel of the Jupyterhub application, and sending the training result to a user through the front end of the Jupyterhub application.

In this embodiment, the training result is returned to the user through the front end of the JupyterHub application, so that interactivity is increased.

In yet another alternative embodiment of the present invention, as shown in FIG. 5, the workflow of the container management module is as follows:

scheme 1: the Jupiter hub front end transmits a request to the Jupiter hub kernel through the ZeroMQ, and the Jupiter hub kernel analyzes a command transmitted by a user from the Jupiter hub front end;

process 2: judging whether a user designates a training frame or not, if not, prompting the user to designate the training frame, otherwise, judging whether the user designates the amount of resources required to be allocated to a training container or not;

process 3: if the user designates the resource size allocated to the training container, acquiring the resource size designated by the user, otherwise, representing that the resource size of the training container is not limited;

process 4: the basic mirror image is used as a training mirror image, and the user-specified resource quantity size and the data set mounting path are local paths for mounting the Alluxio data in the data arrangement module through training codes in the ConfigMap;

process 5: and running training codes by using the training container, waiting for the completion of the running of the training container, and returning a training result or a related prompt to the Jupyterhub front end through the ZeroMQ.

The process of performing distributed training by using the container management module and the data arrangement module is as follows:

the magic commands taking the one-time distributed training using the Tensorflow framework as an example are:

％framework＝tensorflow

％ps＝1；％cpu＝2；％memory＝400

％cleanPolicy＝none

％mkdir/training-data/imagenet

％mount/training-data/imagenet hdfs://IP:Port

process 1: the container management module receives a distributed training request transmitted by the Jupyterhub kernel by using a Tensorflow framework;

(data set mounting stage)

Process 2: the data arrangement module mounts a remote data set;

the file mounting submodule uses mounting information defined by magic commands to mount a remote data set into an Alluxio, the Alluxio uses a memory to store data to accelerate access to the data, in the embodiment, firstly a/tracking-data/imaging net folder is created, then files in a hsds of a distributed storage system are mounted into the/tracking-data/imaging net, and then a training container can acquire a target data set by mounting the/tracking-data/imaging net of the Alluxio;

process 3: the data management sub-module can manage data sets on different distributed storage systems through a magic command of a user, and when one training model needs to be used for the data sets on different distributed storage systems, only the magic command is needed to be used for modification, and a code total data set calling interface is not needed to be modified;

(code management and container creation run phase)

Process 4: the user code storage sub-module is responsible for storing training codes of users in a resource object ConfigMap, and acquiring and running the training codes in a training container in a mounting mode;

process 5: the Kubeflow management submodule creates a corresponding resource object TFJob by using resource information defined by a magic command by a user, wherein the magic command in the embodiment defines a training frame of a container as Tensorflow, 1 parameter node and 3 working nodes, and the allocated resources are all 2 CPUs and 400Mi memories, wherein Miwei Kubernetes resource units. The user is not required to write the resource object file and set the environment variables. Code in ConfigMap is run on the target training container, and the path of the read data set in the training container is mounted on the Alluxio/training-data/imaging net. Communication between target training containers is governed by a TF-Operator of Kubeflow;

(return training results stage)

Process 6: after the training of the target training container is completed, the Jupyterhub kernel reads the running log of the target training container, and returns the training result in the log to the Jupyterhub front end. According to the clearPolicy strategy defined by the magic command, whether to clear the completed target training container is selected, and none represents that the target training container is not cleared after completion. Running and all represent cleaning;

process 7: the user completes a Tensorflow distributed training, and the training result is displayed to the user at the Jupyterhub front end.

In the embodiment of the invention, the I/O performance bottleneck existing when the data set is read is relieved through data arrangement, so that the resource utilization rate is improved, no matter whether the distributed training is performed by using Kubeflow or the data migration occurs, no additional invasive modification is needed to be performed on codes, the isolation of data and calculation is realized in the Jupyterhub by using an Alluxio data arrangement system, a user can write training codes by using the Jupyterhub, the execution result of the training codes is directly obtained, and the interactivity is enhanced.

As shown in fig. 6, an embodiment of the present invention further provides a distributed training apparatus 60 for codes, the apparatus 60 including:

a first obtaining module 61, configured to obtain a code and a training instruction written by a user;

a second obtaining module 62, configured to obtain, according to the training instruction, a target resource specified by a user for training the code;

a processing module 63, configured to create a resource object according to the target resource; scheduling a target training container in the resource object to carry out code training to obtain a training result; and returning the training result to the user through the application.

Alternatively to this, the method may comprise,

analyzing the training instruction to obtain an analysis result;

Optionally, the second obtaining module 62 may further be configured to:

acquiring a data arrangement request in the training instruction;

It should be noted that, the device is a device corresponding to the above method, and all implementation manners in the above method embodiments are applicable to the embodiment of the device, so that the same technical effects can be achieved.

Embodiments of the present invention also provide a computing device comprising: a processor, a memory storing a computer program which, when executed by the processor, performs the method as described above. All the implementation manners in the method embodiment are applicable to the embodiment, and the same technical effect can be achieved.

Embodiments of the present invention also provide a computer-readable storage medium comprising storing instructions that, when executed on a computer, cause the computer to perform a method as described above. All the implementation manners in the method embodiment are applicable to the embodiment, and the same technical effect can be achieved.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

Furthermore, it should be noted that in the apparatus and method of the present invention, it is apparent that the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present invention. Also, the steps of performing the series of processes described above may naturally be performed in chronological order in the order of description, but are not necessarily performed in chronological order, and some steps may be performed in parallel or independently of each other. It will be appreciated by those of ordinary skill in the art that all or any of the steps or components of the methods and apparatus of the present invention may be implemented in hardware, firmware, software, or a combination thereof in any computing device (including processors, storage media, etc.) or network of computing devices, as would be apparent to one of ordinary skill in the art after reading this description of the invention.

The object of the invention can thus also be achieved by running a program or a set of programs on any computing device. The computing device may be a well-known general purpose device. The object of the invention can thus also be achieved by merely providing a program product containing program code for implementing said method or apparatus. That is, such a program product also constitutes the present invention, and a storage medium storing such a program product also constitutes the present invention. It is apparent that the storage medium may be any known storage medium or any storage medium developed in the future. It should also be noted that in the apparatus and method of the present invention, it is apparent that the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present invention. The steps of executing the series of processes may naturally be executed in chronological order in the order described, but are not necessarily executed in chronological order. Some steps may be performed in parallel or independently of each other.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A method of distributed training of codes, the method comprising:

acquiring codes and training instructions written by a user;

creating a resource object according to the target resource;

and returning the training result to the user through the application.

2. The method of claim 1, wherein obtaining the user-written code and training instructions comprises:

3. The method for distributed training of codes according to claim 1, wherein obtaining target resources specified by a user for training the codes according to the training instructions comprises:

analyzing the training instruction to obtain an analysis result;

4. A distributed training method of a code according to claim 3, further comprising:

acquiring a data arrangement request in the training instruction;

5. The method of distributed training of code of claim 4, wherein creating a resource object from the target resource comprises:

6. The method of claim 5, wherein training the code according to the training target training container to obtain a training result comprises:

7. The distributed training method of codes according to claim 2, wherein returning said training results to a user through an application comprises:

8. A distributed training apparatus for codes, the apparatus comprising:

9. A computing device, comprising: a processor, a memory storing a computer program which, when executed by the processor, performs the method of any one of claims 1 to 7.

10. A computer readable storage medium storing instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 7.