CN112000473A

CN112000473A - Distributed training method and device for deep learning model

Info

Publication number: CN112000473A
Application number: CN202010806430.8A
Authority: CN
Inventors: 乔萧雅; 刘国宝; 周雍恺
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2020-11-27
Also published as: TWI783355B; WO2022033024A1; TW202207030A

Abstract

The application discloses a distributed training method and a distributed training device for a deep learning model, and the specific implementation scheme is as follows: the method comprises the following steps: acquiring training state data corresponding to a training task sent by a deep learning platform; generating an elastic expansion strategy according to the cluster resource requirement corresponding to the training task; dynamically adjusting the number of training nodes corresponding to the training task by adopting an elastic expansion strategy; and executing the training task according to the training state data and the adjusted training nodes. The adaptability of cluster resource requirements corresponding to the training tasks is improved, the utilization rate of GPU or CPU resources is improved, and the training tasks can be executed correctly and efficiently by using the adjusted training nodes under the condition that the training nodes are added or deleted at any time.

Description

Distributed training method and device for deep learning model

Technical Field

The present application relates to the field of deep learning, and more particularly to the field of distributed training.

Background

The deep learning framework/platform supports a distributed training mode, namely, a plurality of devices are used, each device can be provided with a plurality of GPUs (Graphics Processing units), and the deep learning model is trained on the GPUs in each device in a parallelized mode. Existing deep learning frameworks/platforms, such as the PS (Parameter service) architecture native to the TensorFlow (dataflow programming), support asynchronous training mode. When the deep learning framework/platform runs, the deep learning framework/platform can be deployed to a specific physical cluster, and nodes in the TensorFlow cluster are divided into two types: a parameter server (parameter server) and a work server (worker). The parameter server stores the parameters of the model, and the work server is responsible for calculating the gradient of the parameters and configuring the GPU for the work server. In each iteration process, the work server obtains the parameters from the parameter server, then returns the calculated gradients to the parameter server, and the parameter server aggregates the gradients returned from the work server, then updates the parameters, and broadcasts the new parameters to the work server.

However, when the deep learning model is used for different training tasks, some training tasks need more GPUs, and other training tasks only need fewer GPUs, or some special training tasks show certain periodic characteristics when using GPUs, and use peaks and valleys exist, so that the GPUs are in an idle state in some training tasks. Therefore, for different training tasks, the number of the work servers cannot be adaptively adjusted, resulting in a low utilization rate of the GPU cluster.

Disclosure of Invention

The embodiment of the application provides a distributed training method and a distributed training device for a deep learning model, which are used for solving the problems in the related art, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a distributed training method for a deep learning model, including:

acquiring training state data corresponding to a training task sent by a deep learning platform;

generating an elastic expansion strategy according to the cluster resource requirement corresponding to the training task;

dynamically adjusting the number of training nodes corresponding to the training task by adopting an elastic expansion strategy;

and executing the training task according to the training state data and the adjusted training nodes.

In one embodiment, acquiring training state data corresponding to a training task sent by a deep learning platform includes:

acquiring a first application program interface sent by a deep learning platform, wherein the first application program interface is obtained by calling the deep learning platform according to a calling instruction sent by a user;

the first application program interface comprises training state data, the training state data comprise gradients and an updating round number N (N is larger than or equal to 1), and the gradients are calculated after the training nodes before adjustment complete the updating of the Nth round parameters.

In one embodiment, generating an elastic scaling strategy according to cluster resource requirements corresponding to a training task includes:

and generating a first elastic scaling strategy according to the cluster resource requirement sent by the user, wherein the first elastic scaling strategy comprises increasing or decreasing the number of training nodes.

and monitoring cluster resources, determining idle resources, and generating a second elastic stretching strategy according to the idle resources, wherein the second elastic stretching strategy comprises the step of increasing the number of training nodes.

and monitoring training nodes corresponding to the training tasks, determining fault nodes, and generating a third elastic stretching strategy according to cluster resources released by the fault nodes, wherein the third elastic stretching strategy comprises deleting the fault nodes.

and generating a fourth elastic scaling strategy according to the cluster resources required by the training tasks with the priority greater than the threshold, wherein the fourth elastic scaling strategy comprises reducing the number of the training nodes.

In one embodiment, dynamically adjusting the number of training nodes corresponding to a training task by using an elastic scaling strategy includes:

calling a second application program interface, and sending the second application program interface to the force calculation platform, so that the force calculation platform dynamically adjusts the number of training nodes corresponding to the training tasks by adopting an elastic stretching strategy; the second application program interface comprises at least one of a first elastic scaling strategy, a second elastic scaling strategy, a third elastic scaling strategy and a fourth elastic scaling strategy.

In one embodiment, for training state data, performing a training task according to the training state data and the adjusted training nodes includes:

and under the condition that the training task is started, controlling the training node before adjustment to execute a master node election process so as to determine the master node, and controlling the master node to execute the step of constructing the communication topology of the training node before adjustment.

In one embodiment, performing a training task based on training state data and adjusted training nodes includes:

under the condition of adding the training nodes, controlling the main node to execute a step of constructing a new communication topology based on the communication topology of the training nodes before adjustment and the newly added training nodes, and synchronizing the new communication topology and the training state data to the adjusted training nodes;

and the computing platform schedules the added training nodes to corresponding physical nodes according to an elastic expansion strategy so that the physical nodes execute training tasks according to the new communication topology aiming at the training state data.

under the condition of reducing the number of training nodes, the control main node constructs a new communication topology based on the adjusted training nodes and synchronizes the new communication topology with the adjusted training nodes;

and the computing platform dispatches the reduced training nodes to corresponding physical nodes according to an elastic expansion strategy so that the physical nodes execute training tasks according to the new communication topology aiming at the training state data.

In one embodiment, the method further comprises:

controlling the main node to store the training state data in a database;

and under the condition that the training node fails to execute the training task, restarting the training node, and loading the training state data in the database to recover to execute the training task.

In one embodiment, the method further comprises:

the control main node establishes partition indexes aiming at a plurality of data partitions, and the data partitions are obtained by dividing a training metadata set required in a training process;

and under the condition that the main node receives a data reading request sent by the training node, controlling the main node to execute a step of configuring a data partition for the training node by using the partition index.

In one embodiment, the method further comprises:

and recording the read times of each data partition, and distributing the data partitions with the read times smaller than the threshold value when the training node executes the training task.

In a second aspect, an embodiment of the present application provides a distributed training apparatus for a deep learning model, including:

the training state data acquisition module is used for acquiring training state data corresponding to a training task sent by the deep learning platform;

the elastic stretching strategy generating module is used for generating an elastic stretching strategy according to the cluster resource requirement corresponding to the training task;

the training node number adjusting module is used for dynamically adjusting the number of training nodes corresponding to the training task by adopting an elastic expansion strategy;

and the training task execution module is used for executing the training task according to the training state data and the adjusted training nodes.

In one embodiment, a training state data acquisition module includes:

the first application program sending submodule is used for obtaining a first application program interface sent by the deep learning platform, and the first application program interface is obtained by calling the deep learning platform according to a calling instruction sent by a user;

In one embodiment, the elastic scaling strategy generation module comprises:

and the first strategy generation submodule is used for generating a first elastic scaling strategy according to the cluster resource requirement sent by the user, and the first elastic scaling strategy comprises increasing or decreasing the number of training nodes.

In one embodiment, the elastic scaling strategy generation module comprises:

and the second strategy generation submodule is used for monitoring the cluster resources, determining idle resources and generating a second elastic stretching strategy according to the idle resources, wherein the second elastic stretching strategy comprises the step of increasing the number of the training nodes.

In one embodiment, the elastic scaling strategy generation module comprises:

and the third strategy generation submodule is used for monitoring the training nodes corresponding to the training tasks, determining fault nodes and generating a third elastic stretching strategy according to the cluster resources released by the fault nodes, wherein the third elastic stretching strategy comprises deleting the fault nodes.

In one embodiment, the elastic scaling strategy generation module comprises:

and the fourth strategy generation submodule is used for generating a fourth elastic scaling strategy according to the cluster resources required by the training tasks with the priorities larger than the threshold, wherein the fourth elastic scaling strategy comprises the step of reducing the number of the training nodes.

In one embodiment, the training node number adjusting module includes:

the second application program sending submodule is used for calling a second application program interface and sending the second application program interface to the force calculation platform so that the force calculation platform dynamically adjusts the number of training nodes corresponding to the training tasks by adopting an elastic stretching strategy; the second application program interface comprises at least one of a first elastic scaling strategy, a second elastic scaling strategy, a third elastic scaling strategy and a fourth elastic scaling strategy.

In one embodiment, the training task execution module includes:

and the master node election sub-module is used for controlling the training node before adjustment to execute a master node election process under the condition that the training task is started so as to determine the master node and controlling the master node to execute the step of constructing the communication topology of the training node before adjustment.

In one embodiment, a training task execution module includes:

the first communication topology rebuilding sub-module is used for the main node to execute the steps of constructing a new communication topology based on the communication topology of the training nodes before adjustment and the newly added training nodes, and synchronizing the new communication topology and the training state data to the adjusted training nodes;

and the first training task execution submodule is used for dispatching the added training nodes to the corresponding physical nodes by the computing platform according to an elastic expansion strategy so that the physical nodes execute the training tasks according to the new communication topology aiming at the training state data.

In one embodiment, a training task execution module includes:

the second communication topology reconstruction submodule is used for controlling the main node to construct a new communication topology based on the reduced training nodes and synchronizing the new communication topology to the adjusted training nodes under the condition that the number of the training nodes is reduced;

and the second training task execution submodule is used for dispatching the reduced training nodes to the corresponding physical nodes by the computing platform according to an elastic expansion strategy so that the physical nodes execute the training tasks according to the new communication topology aiming at the training state data.

In one embodiment, the method further comprises:

the data storage module is used for controlling the main node to store the training state data in the database;

and the fault-tolerant recovery module is used for restarting the training nodes and loading the training state data in the database to recover the training task under the condition that the training nodes fail to execute the training task.

In one embodiment, the method further comprises:

the index establishing module is used for controlling the main node to establish partition indexes aiming at a plurality of data partitions, and the data partitions are obtained by dividing a training metadata set required in the training process;

and the data partition configuration module is used for controlling the main node to execute the step of configuring the data partition for the training node by using the partition index under the condition that the main node receives the data reading request sent by the training node.

In one embodiment, the method further comprises:

and the data management module is used for recording the read times of each data partition and distributing the data partitions with the read times smaller than the threshold value when the training node executes the training task.

In a third aspect, an electronic device is provided, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above.

In a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of the above.

One embodiment in the above application has the following advantages or benefits: because the elastic expansion strategy is determined according to the cluster resource requirements corresponding to the training task, the adaptability of the cluster resource requirements corresponding to the training task is improved, and the utilization rate of GPU or CPU resources is improved. The number of training nodes corresponding to the training task is dynamically adjusted by adopting an elastic stretching strategy, and the training task is executed according to the training state data and the adjusted training nodes, so that the training task can be correctly and efficiently executed by using the adjusted training nodes under the condition that the training nodes are added or deleted at any time.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram of a distributed training method for deep learning models according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a scenario of a distributed training method for a deep learning model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of another method of distributed training of deep learning models according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another method of distributed training of deep learning models according to an embodiment of the present application;

FIG. 5 is a schematic diagram of another method of distributed training of deep learning models according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another distributed training apparatus for deep learning models, according to an embodiment of the present application;

FIG. 7 is a schematic diagram of another distributed training apparatus for deep learning models, according to an embodiment of the present application;

FIG. 8 is a schematic diagram of another distributed training apparatus for deep learning models, according to an embodiment of the present application;

FIG. 9 is a schematic diagram of another distributed training apparatus for deep learning models, according to an embodiment of the present application;

FIG. 10 is a block diagram of an electronic device for implementing a distributed training method for deep learning models according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As shown in fig. 1, the present embodiment provides a distributed training method for a deep learning model, including the following steps:

step S110: acquiring training state data corresponding to a training task sent by a deep learning platform;

step S120: generating an elastic expansion strategy according to the resource requirements corresponding to the training task;

step S130: dynamically adjusting the number of training nodes corresponding to the training task by adopting an elastic expansion strategy;

step S140: and executing the training task according to the training state data and the adjusted training nodes.

In one example, the deep learning platform may include a PS (Parameter server) Architecture native to the TensorFlow (dataflow programming), a propeller (paddlepulp), and a buffer (conditional Architecture for Fast Feature embedding), among others. The deep learning platform selects a training task and a pre-trained neural network model, and trains the pre-trained neural network model for multiple times by using training metadata to obtain the trained neural network model. Wherein the training metadata comprises: parameters of the pre-trained neural network model, data set indexes, update rounds of the parameters, and the like. For example, a plurality of mini-batchs (small batch data sets) are created, and for each mini-batch, parameters (such as weight parameters) are updated for multiple times by adopting a gradient descent method, and finally, trained parameters are obtained, so that a trained neural network model is obtained. In the training process, after the Nth round of parameter updating is completed, the gradient is calculated, and the training state data comprises the gradient and the updating round. And the deep learning platform (TensorFlow) sends the training state data corresponding to the training task to the elastic expansion device.

As shown in fig. 2, the elastic stretching device includes a training task management module, an elastic stretching scheduling module, a fault-tolerant recovery module, a data transmission module, and a data management module. Cluster resources refer to CPU resources or GPU resources. The cluster resource requirements may include resource requirements sent by a user, idle resources, resources required by a higher-level training task, resources released by a failed node, and the like. Since the cluster resource requirements needed for different training tasks may be different, or the cluster resource requirements for the same training task at different stages or conditions may also be different. Therefore, the elastic stretching scheduling module generates an elastic stretching strategy according to the cluster resource requirements corresponding to the training tasks. The elastic scaling strategy comprises a strategy of adding or subtracting training nodes. Different cluster resource requirements correspond to different elastic scaling strategies.

All the computations in the TensorFlow are converted into nodes on the computational graph, i.e., training nodes. For example, the training nodes in a TensorFlow cluster include a parameter server (parameter server) and a work server (worker). The training task management module manages all training nodes, and for each training task, a master node is selected from all training nodes executing the training task. The main node manages the rest training nodes, for example, the main node monitors the work server (worker) to execute a training task, the training progress of the work server (worker), establishes a communication topology between the work servers (worker), and stores training metadata of model training in different work servers (worker). The training metadata includes model parameters, data set indices, model parameter update rounds, and the like.

The elastic stretching scheduling module acquires training state data sent by a deep learning platform (TensorFlow), and sends the training state data and an elastic stretching strategy to a main node corresponding to a training task. And the training task management module controls the main node to execute the step of reconstructing the calculation graph according to the adjusted number of the training nodes. For example, in the case of adding a training node in the elastic scaling strategy, a new training node is added to reconstruct the computation graph without stopping the original computation graph. And the elastic expansion strategy is to delete the training nodes and reconstruct the calculation graph under the condition of reducing the training nodes and without stopping the original calculation graph. The time required for reconstructing the calculation graph can be preset, and the time delay caused by dynamically adjusting the number of the training nodes by the elastic scaling strategy is effectively reduced.

And the computing platform adopts an elastic expansion strategy to dynamically adjust the number of training nodes corresponding to the training task. The computing platform may include a GPU cluster scheduling management system such as the K8S system (kubernets, container cluster management system). The K8S system allocates CPU or GPU cluster resources for the adjusted training nodes. The K8S system dynamically adds or deletes training nodes using a flexible scaling strategy. The physical nodes of the K8S system comprise a master node (master node) and a node (computing node), and a scheduler on the master node schedules the adjusted training node (worker) to the node according to an elastic expansion strategy. Since the new computation graph corresponding to the adjusted training node has been determined previously, the training task is executed on the node according to the new communication topology for the training state data.

In the present embodiment, the elastic telescopic device is designed as a module between the tensrflow and K8S. The elastic stretching device determines the elastic stretching strategy according to the cluster resource requirements corresponding to the training task, so that the adaptability of the cluster resource requirements corresponding to the training task is improved, and the GPU resource utilization rate is improved. The number of training nodes corresponding to the training task is dynamically adjusted by adopting an elastic stretching strategy, and the training task is executed according to the training state data and the adjusted training nodes, so that the training task can be correctly and efficiently executed by using the adjusted training nodes under the condition that the training nodes are added or deleted at any time.

In one embodiment, as shown in fig. 3, step S110 includes:

step S111: acquiring a first application program interface sent by a deep learning platform, wherein the first application program interface is obtained by calling the deep learning platform according to a calling instruction sent by a user;

In one example, the elastic telescoping device is introduced into the Python as a library. The deep learning platform calls a first application program interface according to a calling instruction sent by a user, and sends the first application program interface to the elastic stretching device, so that the elastic stretching scheduling module receives the first application program. The method has high portability, provides a simple API for the cluster and the deep learning framework, and can adapt to various GPU cluster management scheduling schemes and deep learning frameworks.

In the implementation mode, the deep learning platform sends the training state data to the elastic expansion device in a mode of calling the first application program, and therefore invasive modification of the deep learning platform is effectively avoided.

In one embodiment, as shown in fig. 3, step S120 includes:

step S121: and generating a first elastic scaling strategy according to the cluster resource requirement sent by the user, wherein the first elastic scaling strategy comprises increasing or decreasing the number of training nodes.

In one example, in the display interface, a setting button for cluster resource requirements is provided, and a user can adaptively set the cluster resource requirements according to a training task. And in response to an instruction of clicking a key by a user, the elastic scaling scheduling module is used for generating a first elastic scaling strategy according to the cluster resource requirement sent by the user, wherein the first elastic scaling strategy comprises increasing or decreasing the number of training nodes. The elastic scaling scheduling module sends the first elastic scaling strategy to a computational platform, such as a K8S system, and the computational platform allocates CPU or GPU resources for the increased training nodes or acquires CPU or GPU resources released by the decreased training nodes according to the first scaling strategy.

In the embodiment, the number of training nodes is increased or decreased according to the cluster resource requirements sent by the user, so that the flexibility and adaptability to the cluster resource requirements in the distributed training process of the deep learning model are realized.

In one embodiment, as shown in fig. 3, step S120 includes:

step S122: and monitoring cluster resources, determining idle resources, and generating a second elastic stretching strategy according to the idle resources, wherein the second elastic stretching strategy comprises the step of increasing the number of training nodes.

In this embodiment, the elastic stretching scheduling module monitors the cluster resources, determines the idle resources, and increases the corresponding number of training nodes according to the size of the idle resources to obtain the second elastic stretching strategy. The elastic stretching scheduling module sends the second elastic stretching strategy to a force computing platform, such as a K8S system, and the force computing platform allocates idle CPU or GPU resources to the added training nodes according to the second stretching strategy, so that the utilization rate of the idle resources is effectively improved.

In one embodiment, as shown in fig. 3, step S120 includes:

step S123: and monitoring training nodes corresponding to the training tasks, determining fault nodes, and generating a third elastic stretching strategy according to cluster resources released by the fault nodes, wherein the third elastic stretching strategy comprises deleting the fault nodes.

In this embodiment, since the failed node cannot execute the training task, the failed node is deleted to release the corresponding cluster resource, so as to obtain the third elastic scaling strategy. The elastic stretching scheduling module sends the third elastic stretching strategy to a force computing platform, such as a K8S system, and the force computing platform returns the released CPU or GPU resources to the deep learning platform according to the third stretching strategy, so that other training tasks can be reused, and the utilization rate of the CPU or GPU resources is effectively improved.

In one embodiment, as shown in fig. 3, step S120 includes:

step S124: and generating a fourth elastic scaling strategy according to the cluster resources required by the training tasks with the priority greater than the threshold, wherein the fourth elastic scaling strategy comprises reducing the number of the training nodes.

In this embodiment, since the training task with a higher priority may need a large amount of cluster resources, the number of training nodes is reduced according to the cluster resources needed by the training task with a priority greater than the threshold, and the CPU or GPU resources corresponding to the reduced training nodes are released. The elastic scaling scheduling module sends the fourth elastic scaling strategy to a computational platform, such as a K8S system, and the computational platform returns the released CPU or GPU resources to the deep learning platform according to the fourth scaling strategy, so that the training tasks with higher priority can be used conveniently, and the adaptability of the CPU or GPU resources is effectively improved.

In one embodiment, as shown in fig. 3, step S130 includes:

step S131: calling a second application program interface, and sending the training state data and the second application program interface to the force computing platform so that the force computing platform dynamically adjusts the number of training nodes corresponding to the training tasks by adopting an elastic stretching strategy; the second application program interface comprises at least one of a first elastic scaling strategy, a second elastic scaling strategy, a third elastic scaling strategy and a fourth elastic scaling strategy.

In one example, the flexible scaling scheduling module calls a second application program interface and sends the second application program interface to a master node in the training nodes, and the master node forwards the second application program to the computing force platform. The computing platform adds or deletes training nodes according to an elastic expansion strategy in the second application program, has no obvious perception in the aspect of users, has high portability, and can adapt to various GPU cluster management and scheduling systems and deep learning platforms.

In one embodiment, as shown in FIG. 4, step 140, comprises:

step 141: and under the condition that the training task is started, controlling the training node before adjustment to execute a master node election process so as to determine the master node, and controlling the master node to execute the step of constructing the communication topology of the training node before adjustment.

In one example, when a training task is started, the training task management module controls each training node before adjustment to execute a master node election process so as to determine the master node. Each joining or exiting training node needs to notify the master node. And taking the address of the main node as connection information, and connecting the training node with the main node by requesting the information. For an annular distributed training structure (ringreduce), each work server (worker) forms a connection by registering with an adjacent work server (worker), and only the address information of the initial work server (worker) needs to be stored each time. For a distributed training structure (parameter server worker) for parameter management, all work servers (worker) register with a parameter server (parameter server), store address information of the parameter server (parameter server), write the address information into an ETCD, and become a master node.

The master node needs to refresh its address information periodically. If the master node does not refresh the updated address information periodically, the address information will automatically expire. If the connection information is invalid or overdue, the training node elects the main node again, one training node can be randomly selected, the address of the selected training node is written into an ETCD (distributed consistent key value storage system), and the randomly selected training node is used as the main node.

A training node acting as a master node may exit because of the expansion or contraction of the computational graph. Thus, a master node discovery/election process is run in each training node, which initiates election to create a new master node when the master node is not visible to all training nodes. After selecting a master node, other training nodes will connect to the master node and send registration messages to join the training task. During the training task execution, the master node will infer the activity of the training node process from the gradient synchronization request after each mini-batch training.

In one embodiment, as shown in FIG. 4, step 140, comprises:

step 142: under the condition of adding the training nodes, controlling the main node to execute a step of constructing a new communication topology based on the communication topology of the training nodes before adjustment and the newly added training nodes, and synchronizing the new communication topology and the training state data to the adjusted training nodes;

step 143: and the computing platform schedules the added training nodes to corresponding physical nodes according to an elastic expansion strategy so that the physical nodes execute training tasks according to the new communication topology aiming at the training state data.

In one example, to reduce the time to adjust the training nodes in parallel, the flexible scaling scheduling module may cause the entire computation to build a computation graph without stopping, and add a new training node process if the cluster resources are sufficient. In the case of adding training nodes, adding a new training node process to a running training task requires three steps: firstly, preparing an execution context; secondly, constructing a communication topology; thirdly, preparing a model. The execution context preparation includes the steps of loading dynamic libraries, such as cuDNN (GPU acceleration library for deep neural networks), cuBLAS (line generation library of CUDA standard, CUDA, computer Unified Device Architecture, universal parallel computing Architecture), preparing training state data, allocating space on GPU memory and CPU memory, and the like, which takes the longest time. A declarative framework like TensorFlow also requires the construction and optimization of computational graphs. For communication topology construction, a new training node is added to a process needing to be linked to a master node, and all the training nodes need to form a new ring topology to synchronize model parameters. The new training node also needs to synchronize to the latest model before starting training.

When a new training node is added, the process of the training node before adjustment does not need to be stopped. Each new training node thread starts two independent threads, a main thread and a background thread. The main thread performs context preparation, while the background thread performs the main node election thread and sends a registration request to the main node. After the master node receives the registration request of a new training node, the training task management module controls the master node to construct a new communication topology and broadcasts the new communication topology to all the training nodes. At this time, the communication topology of the training node before adjustment is not destroyed, so that the training node before adjustment can continue training without being affected. When the new trainee process completes performing context preparation and receives the new communication topology, it sends a prepare complete message to the master node. And after the master node receives the message of completing the preparation, monitoring the communication topology of the training nodes before adjustment, after the training nodes complete t-round training of the mini-batch, randomly selecting one training node by the master node to synchronize the parameters of the training node to a new training node, informing all the training nodes of the training task to organize according to the new communication topology, and executing the training task.

In one embodiment, as shown in FIG. 4, step 140, comprises:

step 144: under the condition of reducing the number of training nodes, the control main node constructs a new communication topology based on the adjusted training nodes and synchronizes the new communication topology with the adjusted training nodes;

step 145: and the computing platform dispatches the reduced training nodes to corresponding physical nodes according to an elastic expansion strategy so that the physical nodes execute training tasks according to the new communication topology aiming at the training state data.

In an example, in the case of reducing the number of training nodes, for example, deleting fault points or reducing the training nodes according to user requirements, and the like, when a mini-batch training is finished, the K8S system terminates and rejects the training nodes, and it is ensured that the task execution process of the training nodes before adjustment is not affected. When the master node receives the request for reducing the training nodes, the training task management module controls the master node to construct a new communication topology and broadcasts the new communication topology to the rest training nodes. And simultaneously, after the training node before monitoring and adjusting by the master node completes the training of the t round, allowing the training node to exit. And the rest training nodes start training according to the new communication topology. If the master node leaves at this point, it deletes the address to which it is connected to the training node so that the training node can elect a new master node. The old master node will send the training metadata to the new master node before exiting, and all remaining training nodes will connect to the new master node at the scheduled time. Under the condition of normal exit, the rest training nodes do not need to stop to wait for the exit of the training nodes, so that the time delay can be ignored.

In one embodiment, as shown in fig. 4, the method further includes:

step S161: controlling the main node to store the training state data in a database;

step S162: and under the condition that the training node fails to execute the training task, restarting the training node, and loading the training state data in the database to recover to execute the training task.

In one example, the training task management module controls the master node to periodically write training status data to a database, such as a persistent storage library (ETCD). If one trainee process fails before all synchronization is completed, then the models on the other trainee processes can only complete a partial update. Therefore, when the training fails, the fault-tolerant recovery module can recover the training task by loading the training state data.

In this embodiment, a consistency recovery strategy is proposed to recover training from a fault, and the regularly stored training state data can ensure the consistency of model updating to a certain extent. The approximation recovery is to construct a new communication topology based on the existing training nodes and continue training.

In one embodiment, as shown in fig. 5, the method further includes:

step S171: the control main node establishes partition indexes aiming at a plurality of data partitions, and the data partitions are obtained by dividing a training metadata set required in a training process;

step S172: and under the condition that the main node receives a data reading request sent by the training node, controlling the main node to execute a step of configuring a data partition for the training node by using the partition index.

In one example, the data set is logically divided into a plurality of data partitions, the training task management module controls the master node to establish partition indexes of the data set, when a new data partition is needed by a training node, a data reading request is sent to the master node, and the master node responds to the request by using metadata (file size, path, offset and the like) of the unallocated partition, so that the training node reads the data of the unallocated partition. And dynamically distributing the data partitions to the training nodes according to the training requirements of different training nodes. The data transfer module causes the GPU or CPU to train in a relatively saturated state at all times. The data transmission module can ensure that each epoch (generation training) can traverse the data set once.

To track the progress of the data allocation, each training node records an offset in its current task that indicates where the next batch (data set) should start reading from the current data partition. And synchronizing the offset and the model parameters to the main node by the training node when the mini-batch training is finished each time. When new training nodes join a training task, the master node need only assign them some unprocessed partition data. When one training node process leaves under normal exit conditions, it reports the metadata of the current partition and its offset into the partition to the master node so that the master node can assign the unprocessed data remaining in this partition to another training node process. If the master node needs to leave, it sends the partition index list and the training schedules of all the training nodes to the new master node before the new master node exits. The main node also stores the index list of the data partitions and the data use progress of different training nodes into a persistent storage library (ETCD) so as to recover the training task in time when the training fails.

In one embodiment, as shown in fig. 5, the method further includes:

step S180: and recording the read times of each data partition, and distributing the data partitions with the read times smaller than the threshold value when the training node executes the training task.

In one example, in order to track the progress of data allocation, a specific data management module is specified, the data management module is monitored by the flexible scheduling module, the condition that each data block is read, such as the number of times of reading, is mainly recorded, the data management module updates along with the data reading of the training nodes, and when a new training node joins a training task, partition data with less number of times of reading is allocated to the training nodes by the data management module.

In another embodiment, as shown in fig. 6, a distributed training apparatus for deep learning model is provided, including:

a training state data obtaining module 110, configured to obtain training state data corresponding to a training task sent by a deep learning platform;

an elastic stretching strategy generating module 120, configured to generate an elastic stretching strategy according to the cluster resource requirement corresponding to the training task;

a training node number adjusting module 130, configured to dynamically adjust the number of training nodes corresponding to the training task by using an elastic scaling strategy;

and a training task executing module 140, configured to execute a training task according to the training state data and the adjusted training node.

In one embodiment, as shown in fig. 7, the training state data acquisition module 110 includes:

the first application program sending submodule 111 is used for obtaining a first application program interface sent by the deep learning platform, and the first application program interface is obtained by calling the deep learning platform according to a calling instruction sent by a user;

In one embodiment, as shown in fig. 7, the elastic scaling strategy generation module 120 includes:

the first policy generation submodule 121 is configured to generate a first elastic scaling policy according to a cluster resource requirement sent by a user, where the first elastic scaling policy includes increasing or decreasing the number of training nodes.

In one embodiment, the elastic scaling strategy generation module 120 includes:

and the second policy generation sub-module 122 is configured to monitor the cluster resources, determine idle resources, and generate a second elastic stretching policy according to the idle resources, where the second elastic stretching policy includes increasing the number of training nodes.

and the third policy generation sub-module 123 is configured to monitor the training nodes corresponding to the training tasks, determine a fault node, and generate a third elastic scaling policy according to the cluster resource released by the fault node, where the third elastic scaling policy includes deleting the fault node.

and the fourth policy generation sub-module 124 is configured to generate a fourth elastic scaling policy according to the cluster resources required by the training tasks with the priorities greater than the threshold, where the fourth elastic scaling policy includes reducing the number of training nodes.

In one embodiment, as shown in fig. 7, the training node number adjusting module 130 includes:

the second application program sending submodule 131 is configured to call a second application program interface and send the second application program interface to the computational platform, so that the computational platform dynamically adjusts the number of training nodes corresponding to the training task by using an elastic scaling strategy; the second application program interface comprises at least one of a first elastic scaling strategy, a second elastic scaling strategy, a third elastic scaling strategy and a fourth elastic scaling strategy.

In one embodiment, as shown in FIG. 8, the training task execution module 140 includes:

the master node election sub-module 141, when the training task is started, controls the training node before adjustment to execute a master node election process to determine the master node, and controls the master node to execute the step of constructing the communication topology of the training node before adjustment.

a first communication topology reconstruction sub-module 142, configured to execute, by the master node, a step of constructing a new communication topology based on the communication topology of the training node before adjustment and the newly added training node, and synchronizing the new communication topology and the training state data with the adjusted training node;

and the first training task execution submodule 143 is configured to schedule the increased training nodes to corresponding physical nodes by the computing platform according to an elastic scaling strategy, so that the physical nodes execute the training tasks according to the new communication topology for the training state data.

In one embodiment, as shown in FIG. 8, the training task execution module 140 includes

A second communication topology rebuilding sub-module 144, configured to, when the number of training nodes is reduced, control the master node to build a new communication topology based on the reduced training nodes, and synchronize the new communication topology with the adjusted training nodes;

and the second training task execution submodule 145 is configured to schedule the reduced training nodes to corresponding physical nodes according to an elastic scaling strategy by the computational platform, so that the physical nodes execute the training tasks according to the new communication topology for the training state data.

In one embodiment, as shown in fig. 8, the method further includes:

the data storage module 161 is configured to control the master node to store the training state data in the database;

and the fault-tolerant recovery module 162 is configured to restart the training node and load the training state data in the database to recover the training task when the training node fails to execute the training task.

In one embodiment, as shown in fig. 9, the method further includes:

the index establishing module 171 is configured to control the master node to establish partition indexes for multiple data partitions, where the data partitions are obtained by dividing a training metadata set required in a training process;

and a data partition configuring module 172, configured to, when the master node receives a data reading request sent by a training node, control the master node to perform a step of configuring a data partition for the training node by using the partition index.

In one embodiment, as shown in fig. 9, the method further includes:

and the data management module 180 is configured to record the number of times each data partition is read, and allocate the data partition whose read number is less than the threshold value when the training node executes the training task.

Please refer to the corresponding description in the above method for the functions of each module in each apparatus in the embodiments, which are not described herein again.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 10 is a block diagram of an electronic device for a distributed training method of a deep learning model according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 10, the electronic apparatus includes: one or more processors 1001, memory 1002, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 10 illustrates an example of one processor 1001.

The memory 1002 is a non-transitory computer readable storage medium provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method for distributed training of deep learning models provided herein. A non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform a method of distributed training of a deep learning model as provided herein.

The memory 1002 may be used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the distributed training method of the deep learning model in the embodiment of the present application (for example, the training state data obtaining module 110, the elastic scaling strategy generating module 120, the training node number adjusting module 130, and the training task executing module 140 shown in fig. 6). The processor 1001 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 1002, that is, implements a distributed training method of a deep learning model in the above method embodiments.

The memory 1002 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by use of an electronic device according to a distributed training method of a deep learning model, and the like. Further, the memory 1002 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 1002 may optionally include memory located remotely from the processor 1001, which may be connected over a network to an electronic device for a distributed training method for deep learning models. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An electronic device of a distributed training method of a deep learning model may further include: an input device 1003 and an output device 1004. The processor 1001, the memory 1002, the input device 1003, and the output device 1004 may be connected by a bus or other means, and the bus connection is exemplified in fig. 10.

The input device 1003 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic equipment of a method, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 1004 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A distributed training method of a deep learning model is characterized by comprising the following steps:

dynamically adjusting the number of training nodes corresponding to the training task by adopting the elastic stretching strategy;

2. The method according to claim 1, wherein the acquiring training state data corresponding to the training task sent by the deep learning platform comprises:

acquiring a first application program interface sent by the deep learning platform, wherein the first application program interface is obtained by calling the deep learning platform according to a calling instruction sent by a user;

the first application program interface comprises the training state data, the training state data comprise a gradient and an updating turn N (N is greater than or equal to 1), and the gradient is calculated after the training node before adjustment completes the updating of the Nth turn of parameters.

3. The method of claim 1, wherein generating an elastic scaling strategy according to cluster resource requirements corresponding to the training task comprises:

and generating a first elastic scaling strategy according to the cluster resource requirement sent by the user, wherein the first elastic scaling strategy comprises increasing or decreasing the number of the training nodes.

4. The method of claim 1, wherein generating an elastic scaling strategy according to cluster resource requirements corresponding to the training task comprises:

monitoring the cluster resources, determining idle resources, and generating a second elastic stretching strategy according to the idle resources, wherein the second elastic stretching strategy comprises increasing the number of the training nodes.

5. The method of claim 1, wherein generating an elastic scaling strategy according to cluster resource requirements corresponding to the training task comprises:

monitoring training nodes corresponding to the training tasks, determining fault nodes, and generating a third elastic stretching strategy according to cluster resources released by the fault nodes, wherein the third elastic stretching strategy comprises deleting the fault nodes.

6. The method of claim 1, wherein generating an elastic scaling strategy according to cluster resource requirements corresponding to the training task comprises:

and generating a fourth elastic scaling strategy according to the cluster resources required by the training tasks with the priorities larger than the threshold, wherein the fourth elastic scaling strategy comprises reducing the number of the training nodes.

7. The method according to any one of claims 3-6, wherein dynamically adjusting the number of training nodes corresponding to the training task using the elastic scaling strategy comprises:

calling a second application program interface, and sending the second application program interface to a force calculation platform, so that the force calculation platform dynamically adjusts the number of training nodes corresponding to the training task by adopting the elastic stretching strategy; wherein the second application program interface comprises at least one of the first elastic scaling policy, the second elastic scaling policy, the third elastic scaling policy, and the fourth elastic scaling policy.

8. The method of claim 7, wherein performing the training task based on the training state data and the adjusted training nodes comprises:

and under the condition that the training task is started, controlling the training node before adjustment to execute a master node election process so as to determine a master node, and controlling the master node to execute the step of constructing the communication topology of the training node before adjustment.

9. The method of claim 8, wherein performing the training task based on the training state data and the adjusted training nodes comprises:

under the condition of adding the training node, controlling the main node to execute a step of constructing a new communication topology based on the communication topology of the training node before adjustment and the newly added training node, and synchronizing the new communication topology and the training state data with the adjusted training node;

and the computing platform schedules the added training nodes to corresponding physical nodes according to the elastic expansion strategy so that the physical nodes execute the training tasks according to the new communication topology aiming at the training state data.

10. The method of claim 8, wherein performing the training task based on the training state data and the adjusted training nodes comprises:

under the condition of reducing the number of the training nodes, controlling the main node to construct a new communication topology based on the adjusted training nodes, and synchronizing the new communication topology with the adjusted training nodes;

and the computing platform dispatches the reduced training nodes to corresponding physical nodes according to the elastic expansion strategy so that the physical nodes execute the training tasks according to the new communication topology aiming at the training state data.

11. The method of claim 8, further comprising:

controlling the master node to store the training state data in a database;

and under the condition that the training node fails to execute the training task, restarting the training node, and loading training state data in the database to recover to execute the training task.

12. The method of claim 1, further comprising:

controlling the master node to establish partition indexes for a plurality of data partitions, wherein the data partitions are obtained by dividing a training metadata set required in a training process;

and under the condition that the main node receives a data reading request sent by the training node, controlling the main node to execute a step of configuring data partitions for the training node by using the partition indexes.

13. The method of claim 11, further comprising:

and recording the read times of each data partition, and distributing the data partitions with the read times smaller than a threshold value when the training node executes the training task.

14. A distributed training apparatus for deep learning models, comprising:

the training node number adjusting module is used for dynamically adjusting the number of training nodes corresponding to the training task by adopting the elastic stretching strategy;

15. The apparatus of claim 14, wherein the training status data acquisition module comprises:

16. The apparatus of claim 14, wherein the elastic scaling strategy generation module comprises:

and the first strategy generation submodule is used for generating a first elastic scaling strategy according to the cluster resource requirement sent by the user, wherein the first elastic scaling strategy comprises increasing or decreasing the number of the training nodes.

17. The apparatus of claim 14, wherein the elastic scaling strategy generation module comprises:

18. The apparatus of claim 14, wherein the elastic scaling strategy generation module comprises:

19. The apparatus of claim 14, wherein the elastic scaling strategy generation module comprises:

20. The apparatus of claims 16-19, wherein the training node number adjustment module comprises:

the second application program sending submodule is used for calling a second application program interface and sending the second application program interface to a force calculation platform so that the force calculation platform dynamically adjusts the number of training nodes corresponding to the training tasks by adopting the elastic stretching strategy; wherein the second application program interface comprises at least one of the first elastic scaling policy, the second elastic scaling policy, the third elastic scaling policy, and the fourth elastic scaling policy.

21. The apparatus of claim 20, wherein the training task performing module comprises:

22. The apparatus of claim 21, wherein the training task performing module comprises:

a first communication topology rebuilding sub-module, configured to, in a case where the training node is added, execute a step of building a new communication topology based on the communication topology of the training node before adjustment and the newly added training node, and synchronize the new communication topology and the training state data with the adjusted training node;

and the first training task execution submodule is used for dispatching the added training nodes to corresponding physical nodes by the computing platform according to the elastic expansion strategy so as to enable the physical nodes to execute the training tasks according to the new communication topology aiming at the training state data.

23. The apparatus of claim 21, wherein the training task performing module comprises:

the second communication topology rebuilding submodule is used for controlling the main node to build a new communication topology based on the reduced training nodes and synchronizing the new communication topology to the adjusted training nodes under the condition that the number of the training nodes is reduced;

and the second training task execution submodule is used for dispatching the reduced training nodes to corresponding physical nodes by the computing platform according to the elastic expansion strategy so as to enable the physical nodes to execute the training tasks according to the new communication topology aiming at the training state data.

24. The apparatus of claim 21, further comprising:

the data storage module is used for controlling the main node to store the training state data into a database;

and the fault-tolerant recovery module is used for restarting the training nodes and loading the training state data in the database to recover and execute the training tasks under the condition that the training nodes fail to execute the training tasks.

25. The apparatus of claim 21, further comprising:

the index establishing module is used for controlling the main node to establish partition indexes aiming at a plurality of data partitions, and the data partitions are obtained by dividing a training metadata set required in a training process;

26. The apparatus of claim 24, further comprising:

and the data management module is used for recording the read times of the data partitions and distributing the data partitions with the read times smaller than a threshold value when the training node executes the training task.

27. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.

28. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-13.