CN116450312A

CN116450312A - Scheduling strategy determination method and system for pipeline parallel training

Info

Publication number: CN116450312A
Application number: CN202310222902.9A
Authority: CN
Inventors: 王思宇; 刁岚松; 曹宗雁; 佀畅; 林伟
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-03-02
Filing date: 2023-03-02
Publication date: 2023-07-18

Abstract

The application provides a scheduling strategy determining method and system for parallel training of a flow direction line, and belongs to the technical field of cloud computing. The method comprises the following steps: according to the training sample set and the attribute information of each computing node, determining a plurality of candidate scheduling strategies, wherein the candidate scheduling strategies comprise a sample splitting scheme for uniformly splitting the training sample set into a plurality of training sample subsets and a corresponding maximum K value, and the maximum K value is used for indicating the maximum value of the number of the subsets when each computing node performs forward computation or backward computation under the corresponding sample splitting scheme; based on a sample splitting scheme in each candidate scheduling strategy and an association relation between each computing node, generating task images corresponding to each candidate scheduling strategy; and determining a target scheduling strategy with the shortest training time according to the current network state, the task image corresponding to each candidate scheduling strategy and the maximum K value. According to the method and the device, the training efficiency of the model can be improved under the condition of network resource preemption.

Description

Scheduling strategy determination method and system for pipeline parallel training

Technical Field

The application relates to the technical field of cloud computing, in particular to a scheduling policy determination method and system for parallel training of flow direction lines.

Background

With the continuous development of deep learning, the scale of the deep learning model is larger and larger. The distributed training mode is particularly critical for large model training. The pipeline parallel training is used as an important distributed training mode, a deep learning model is divided into a plurality of training stages (stages) which are connected front and back according to layers, the plurality of stages are deployed on a plurality of computing nodes, and the computing sequence of the computing nodes is controlled by adopting a reasonable scheduling strategy, so that the training task of the deep learning model is completed.

At present, when training is performed by adopting a pipeline parallel training mode, a training sample set for each round of training is generally split into a plurality of training sample subsets, and each computing node performs training according to a scheduling strategy of 1F 1B. Where F (i.e., forward) refers to the forward computing phase of the deep learning training and B (i.e., backward) refers to the backward computing phase of the deep learning training. The scheduling policy of 1F1B means that the number of training sample subsets for which each computing node performs forward computation or backward computation at a time is 1. The scheduling strategy of 1F1B is adopted, the network resource occupation is more, and the model training efficiency is lower under the network resource preemption scene.

Therefore, a new scheduling policy determining method for pipeline parallel training needs to be provided, so that higher model training efficiency can be obtained under the network resource preemption scene.

Disclosure of Invention

The embodiment of the application provides a scheduling strategy determining method and a scheduling strategy determining system for pipeline parallel training, which can obtain higher model training efficiency under a network resource preemption scene. The technical scheme is as follows:

in a first aspect, a scheduling policy determining method for pipeline-oriented parallel training is provided, where the method is applied to a network resource preemption scenario, and the method includes:

determining multiple candidate scheduling strategies according to training sample sets of any batch and attribute information of each computing node, wherein a training stage segmented by a deep learning model is deployed on each computing node, the candidate scheduling strategies are used for indicating a training mode of each computing node for training the deep learning model based on the training sample sets, the candidate scheduling strategies comprise a sample splitting scheme for uniformly splitting the training sample sets into multiple training sample subsets and corresponding maximum K values, the maximum K values are used for indicating the maximum value of the subset number when each computing node performs forward computation or backward computation under the corresponding sample splitting scheme, and the maximum K values are more than or equal to 2;

Based on a sample splitting scheme in each candidate scheduling strategy and an association relation between each computing node, generating task images corresponding to each candidate scheduling strategy;

and determining a target scheduling strategy with the shortest training time length from a plurality of candidate scheduling strategies according to the current network state, the task image corresponding to each candidate scheduling strategy and the maximum K value, wherein the target scheduling strategy is used for controlling the calculation process of each calculation node.

In a second aspect, a scheduling policy determining apparatus for pipeline-oriented parallel training is provided, where the apparatus is applied to a network resource preemption scenario, and the apparatus includes:

the first determining module is used for determining a plurality of candidate scheduling strategies according to training sample sets of any batch and attribute information of each computing node, the computing nodes are deployed with training phases segmented by a deep learning model, the candidate scheduling strategies are used for indicating a training mode that each computing node trains the deep learning model based on the training sample sets, the candidate scheduling strategies comprise a sample splitting scheme for uniformly splitting the training sample sets into a plurality of training sample subsets and a corresponding maximum K value, the maximum K value is used for indicating the maximum value of the subset number when each computing node performs forward computation or backward computation under the corresponding sample splitting scheme, and the maximum K value is more than or equal to 2;

The generation module is used for generating task images corresponding to each candidate scheduling strategy based on a sample splitting scheme in each candidate scheduling strategy and the association relation between each computing node;

the second determining module is used for determining a target scheduling strategy with the shortest training time length from a plurality of candidate scheduling strategies according to the current network state, the task image corresponding to each candidate scheduling strategy and the maximum K value, and the target scheduling strategy is used for controlling the calculation process of each calculation node.

In a third aspect, a pipeline parallel training system is provided, where the system includes a scheduling policy determining device for pipeline parallel training and a plurality of computing nodes;

the scheduling policy determining device is configured to execute the scheduling policy determining method for pipeline-oriented parallel training according to the first aspect;

and each computing node is used for receiving the target scheduling strategy sent by the scheduling strategy determining device and performing model training according to the target scheduling strategy.

In a fourth aspect, a computing device is provided that includes a processor and a memory; the memory stores at least one piece of program code; the at least one program code is configured to be invoked and executed by the processor to implement the scheduling policy determination method for pipeline-oriented parallel training as described in the first aspect.

In a fifth aspect, there is provided a computer readable storage medium having stored therein at least one computer program, which when executed by a processor is capable of implementing the scheduling policy determining method for pipeline-oriented parallel training according to the first aspect.

In a sixth aspect, a computer program product is provided, the computer program product comprising a computer program, which when executed by a processor is capable of implementing the scheduling policy determination method for pipeline-oriented parallel training according to the first aspect.

The beneficial effects that technical scheme that this application embodiment provided brought are:

when the deep learning model is trained by adopting the pipeline parallel training mode, a fixed scheduling strategy of 1F1B is not adopted, but the training sample set is split according to the sample number of the training sample set for the round of training in the training process of each round, so that a plurality of sample splitting schemes are obtained, and the number of subsets of the training sample subsets split by each sample splitting scheme and the number of samples in the training sample subsets are different. In the maximum storage space of each computing node, for the same sample splitting scheme, the number of samples in the split training sample subsets is fixed, and the more the number of training sample subsets are calculated in parallel by each computing node, the higher the computing efficiency of each computing node is; for different sample splitting schemes, the more the number of samples in the split training sample subsets is, the fewer the number of subsets of the training sample subsets that the computing node can compute in parallel is, and the more the number of subsets of the training sample subsets that need to compute in parallel is, the fewer the number of communications between the computing nodes is, and the lower the degree of dependence on network state is. According to the method and the system, the maximum storage space of each computing node is used as a limiting condition, the corresponding maximum K value is searched for each sample splitting scheme, multiple candidate scheduling strategies are obtained, the training time of each candidate scheduling strategy in the current network state is further obtained, the influence of the network state on the training efficiency is reduced on the premise that the network state and the computing efficiency of each computing node are comprehensively considered, and therefore when the target scheduling strategy with the shortest training time is selected to control the training process of each computing node, the higher model training efficiency can be obtained under the scene of network resource preemption, and the whole model training process can keep higher training level.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a scheduling policy determining device for pipeline parallel training according to an embodiment of the present application;

FIG. 2 is a flowchart of a scheduling policy determination method for pipeline-oriented parallel training provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a memory limit curve according to an embodiment of the present disclosure;

fig. 4 is a comparison graph of advantages of a scheduling policy determining method for pipeline parallel training under a preemption network, which is provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of a scheduling policy determining device for pipeline parallel training according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a pipelined parallel training system provided in an embodiment of the present application;

FIG. 7 illustrates a block diagram of a computing device provided in an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It will be understood that, as used in the embodiments of the present application, the terms "each," "plurality," and "any" and the like, a plurality includes two or more, each refers to each of the corresponding plurality, and any refers to any of the corresponding plurality. For example, the plurality of words includes 10 words, and each word refers to each of the 10 words, and any word refers to any one of the 10 words.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

Before executing the embodiments of the present application, the terms related to the embodiments of the present application will be explained first.

Forward (abbreviated as F) and backward (abbreviated as B) computation are two computation stages of deep learning. Forward computation refers to the sequential computation process of training samples from input to output. Backward computation refers to the computation of back-propagating the forward computation result from the output to the input direction. The backward calculation essence is to sequentially calculate the gradient of each layer of parameters by a chain rule, and the calculated gradient is used for carrying out gradient descent on the parameters of each layer, so that the model convergence state is achieved by updating the parameters of each layer.

The computing nodes are one computing node for training the deep learning model in a distributed training manner, each computing node comprises at least one computing device, and each computing device is provided with a computing card (such as a GPU (Graphics Processing Unit, graphics processor)). The distributed training mode comprises data parallel computing and model parallel computing. In a data parallel training scenario, a complete deep learning model is deployed on computing nodes, each computing node being configured to perform a partial training sample. In a model parallel computing scene, the deep learning model is segmented according to layers, partial layers of the deep learning model are deployed on each computing node, and each computing node is used for computing all training samples.

A task image is a graph that describes the relationship between tasks. Task images are typically directed graphs, where nodes in the graph represent tasks, node values represent the computation of the tasks, directed edges represent dependencies or communication relationships between the tasks, and weights of the directed edges represent traffic.

batch is a training sample set for one round of training in model training, and can be obtained by splitting a training sample set. Each batch comprises a plurality of training samples, and each training sample is independent of each other and participates in the same wheel training process. When the training samples in one batch all complete the calculation, then it is called complete a round of training of the model.

micro_batch is a finer batch obtained by further splitting the training samples in batch. Briefly, micro_latches are a subset of latches, one including multiple micro_latches, different micro_latches belonging to the same latch participating in the same round of training process. The micro_batch is the minimum unit of one forward calculation or backward calculation by the calculation node, namely, the calculation node performs forward calculation or backward calculation on the training sample by taking the micro_batch as the unit.

bubble is an index for measuring the computational efficiency of pipeline parallel training. Typically pipeline parallel training allows multiple computing devices to compute in parallel in time, which will become blank in time when a computing device is idle. The calculation efficiency of pipeline parallel training can be obtained by calculating the ratio of the number of the bubbles in one round. In general, the higher the bubble duty cycle, the less computationally efficient the pipeline parallel training.

stage is a computational unit for pipeline parallel training. When adopting pipeline parallel training, the deep learning model needs to be segmented into a plurality of stages which are connected in sequence according to layers. The training of the deep learning model includes a forward computing stage and a backward computing stage, the forward computing stage corresponds to the backward computing stage, that is, the levels of the models adopted by the forward computing stage and the backward computing stage are the same, but the propagation directions of the data are different. In order to facilitate control of the forward computing stage and the backward computing stage, the deep learning model is generally segmented for different computing stages, so as to obtain a plurality of forward stages and a plurality of backward stages, wherein the forward stages and the backward stages which comprise the same level in the deep learning model correspond to each other, and the forward stages and the backward stages which correspond to each other are deployed on the same computing node, so that the forward stages and the backward stages on each computing node can be scheduled.

The scheduling policy of 1F1B is abbreviated as the scheduling policy of 1forward stage+1backward stage. The scheduling policy of 1F1B refers to that each computing node is deployed with a forward stage and a backward stage, and the training task of the model is completed by alternately scheduling the deployed forward stage and backward stage. Since one forward stage and one backward stage are deployed on each computing node, and forward stage (or backward stage) is scheduled once to perform forward computation (or backward computation) on only one micro_batch, the scheduling policy of 1F1B limits the number of micro_batches calculated by each computing node to perform forward computation or backward computation once. Under the dispatching strategy of 1F1B, forward stage and backward stage deployed on each computing node are alternately dispatched, and for the last computing node, the micro_batch calculated by the forward stage and the backward stage which are alternately dispatched belong to the same micro_batch, namely, forward stage on the last computing node is called to forward calculate a micro_batch, and backward calculation is carried out on the micro_batch by backward stage on the last computing node; for other computing nodes, micro_latches calculated by alternately scheduled forward stage and backword stage belong to different micro_latches, namely, when the forward stage on other computing nodes is called to forward calculate a certain micro_latch and then the computing node is called again to calculate, the backword stage on the computing node is called to calculate other micro_latches.

Cloud Computing (Cloud Computing) refers to the delivery and usage model of IT (Information Technology ) infrastructure, meaning that required resources are obtained in an on-demand, easily scalable manner over a network; generalized cloud computing refers to the delivery and usage patterns of services, meaning that the required services are obtained in an on-demand, easily scalable manner over a network. Such services may be IT, software, internet related, or other services. Cloud Computing is a product of fusion of traditional computer and network technology developments such as Grid Computing (Grid Computing), distributed Computing (Distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), load balancing (Load balancing), and the like. Cloud computing is classified into Public Clouds (Public Clouds), private Clouds (Private Clouds), hybrid Clouds (Hybrid Clouds), and the like according to operation modes. Public cloud generally refers to a cloud which can be used and is provided by a third party provider for users, the public cloud can be generally used through the Internet, and can be free or low in cost, and the public cloud has the core attribute of sharing resource services, namely, services are provided in an open public network. With the development of cloud computing technology, more deep learning models are trained using computing resources provided by public clouds.

Deep learning is a generic term for a class of pattern analysis methods, and utilizes multiple layers of nonlinear information to perform supervised or unsupervised feature extraction and conversion through a deep neural network. With the continuous development of deep learning, the scale of the deep learning model is larger and larger. In order to shorten the training time of the model and improve the training efficiency of the model, a pipeline parallel training mode is generally adopted for training. And by adopting a pipeline parallel training mode, the deep learning model is required to be segmented into a plurality of stages, and the plurality of stages are deployed on a plurality of computing nodes. In the forward computation, the computation on the latter forward stage depends on the computation result of the former forward stage, and in the backward computation, the computation on the latter backward stage depends on the computation result of the former backward stage, so that data needs to be transmitted across multiple computation nodes. Thus, the training efficiency of the deep learning model depends not only on the computational efficiency of each compute node (which is related to the degree of uniformity among the stages and the amount of data transferred between the stages), but also on the network bandwidth among the individual compute nodes.

However, in public cloud scenarios, network resources are commonly used by various computing devices and applications on the cloud, and cannot be exclusively allocated to various computing nodes for training the deep learning model, which may result in a lower network bandwidth between computing nodes than between computing cards within each computing node. When data is transferred across computing nodes, network congestion can result if other computing devices or applications use network resources, thereby slowing down the speed of communication between the various computing nodes. By adopting the scheduling strategy of 1F1B, although the data amount required to be cached by each computing node is less, and after a certain micro_batch finishes forward computation, the cached forward computation result can be released in time, so that the video memory pressure of each computing node is relieved, but because the network resource is in a preemptive state, the number of bubbles of the computing nodes in pipeline parallel training is more, and the training efficiency of the model is lower. In addition, because the network state is often changed, the network state in the whole model training process cannot be estimated in advance, and therefore, higher training efficiency cannot be obtained by adopting a 1F1B scheduling strategy.

In order to obtain higher training efficiency in a network resource preemption scene, when the deep learning model is trained by adopting a pipeline parallel training mode, the embodiment of the application improves the original 1F1B scheduling strategy and provides a KFKB scheduling strategy. The so-called KFKB scheduling policy is to alternately schedule K forward stages or K backward stages as one scheduling basic unit (inseparable in K groups). Since the KFKB scheduling policy schedules K forward stages or K backward stages simultaneously as a group, K micro_latches can be calculated when each calculation node performs forward calculation or backward calculation once. Wherein K is a packet value, and is also the number of micro_latches that each computing node performs forward computation or backward computation at a time. The value of K is greater than or equal to 1, and different K values correspond to different scheduling strategies, for example, when k=1, the scheduling strategy of KFKB is the scheduling strategy of 1F 1B; when k=2, the scheduling policy of KFKB is the scheduling policy of 2F 2B. By adopting the KFKB scheduling strategy (when K is more than or equal to 2), each computing node does not need to wait for the data which the computing node depends on to enter, and the computing node can be calculated based on the data without the dependence item or the prepared data, so that the computing node is in a computing state as much as possible, the dependence on network resources is relieved, the bandwidth of the computing node is reduced, and the purpose of improving the computing efficiency of the computing node is achieved.

Based on the proposed KFKB scheduling strategy, the embodiment of the application trains the deep learning model in a pipeline parallel training mode, and comprises the following two processes:

the first process: determination of scheduling policy

When training the deep learning model in a certain round, determining a plurality of candidate scheduling strategies according to the training sample set of the batch and attribute information of each computing node, wherein each scheduling strategy comprises a sample splitting scheme for uniformly splitting the training sample set into a plurality of training sample subsets and a corresponding maximum K value, and further selecting a target scheduling strategy with the shortest training duration in the current network state according to the current network state.

The second process: adjustment process for scheduling policy based on network status

Based on the target scheduling strategy determined by the first process, in the subsequent training process of each round, the network state of the current network is monitored, and then the scheduling strategy of each round is adaptively adjusted according to the current network state, so that the higher pipeline performance is always maintained under the network preemption scene.

Fig. 1 is a scheduling policy determining apparatus for pipeline parallel training provided in an embodiment of the present application, where the apparatus includes: an adaptive packetizer, a task image generator and an adaptive packet scheduler. The output end of the self-adaptive grouping device is connected with the input end of the task image generator, and the output end of the task image generator is connected with the input end of the self-adaptive grouping scheduler.

The self-adaptive packetizer (Adaptive Grouper Pass) is used for determining various candidate scheduling strategies according to the training sample set and attribute information of each computing node. The attribute information of the computing nodes includes the number of computing devices included in each computing node, a storage space (such as GPU) of each computing device, a stage deployed on the computing node, and the like. When determining multiple candidate scheduling strategies, the self-adaptive grouping device is used for uniformly dividing the training sample set according to the sample number of the training sample set to obtain multiple sample division schemes, wherein each sample division scheme comprises the subset number of the divided training sample subsets and the sample number in the training sample subsets, and further searching the maximum K value corresponding to each division scheme according to the storage space of each calculation node, wherein the maximum K value is greater than or equal to 2, and further forming one sample division scheme and the corresponding maximum K value into one candidate scheduling strategy. Generally, the greater the number of samples in the training sample subset for each sample splitting scheme, the smaller its corresponding maximum K value; the fewer the number of samples within the training sample subset under each sample splitting scheme, the greater its corresponding maximum K value. For example, in FIG. 1, MBS (micro batch size abbreviated) is inversely proportional to the ratio of K, where when MBS ratio is 6, K ratio is 1; when the ratio of MBS is 4, the ratio of K is 2; when the ratio of MBS is 2, the ratio of K is 4.

The task image generator (Task Graph Builder) is used for generating task images corresponding to each candidate scheduling strategy based on the sample splitting scheme in each candidate scheduling strategy and the association relation between each computing node.

The self-adaptive packet scheduler (Adaptive Grouper scheduler) is used for determining a target scheduling strategy with the shortest training time length from a plurality of candidate scheduling strategies according to the current network state, task images corresponding to each candidate scheduling strategy and the maximum K value, and further sending the determined target scheduling strategy to each computing node so that each computing node carries out model training according to the target scheduling strategy.

In an embodiment of the present application, the adaptive packet scheduler includes a scheduling plan generating module (Schedule Planner), an automatic adjusting module (autopunner), a coordinating module (Coordinator), and the like. The input end of the dispatching plan generating module is connected with the output end of the task image generator, the output end of the dispatching plan generating module is connected with the input end of the automatic adjusting module, and the output end of the automatic adjusting module is connected with the input end of the coordination module. The scheduling plan generating module is used for generating a candidate scheduling plan corresponding to each candidate scheduling strategy according to the task image corresponding to each candidate scheduling strategy and the maximum K value, wherein the candidate scheduling plan is a training plan when each computing node executes a model training task according to the corresponding candidate scheduling strategy. The automatic adjustment module is used for simulating the model training process of each computing node according to the candidate scheduling plans corresponding to each candidate scheduling strategy according to the current network state so as to acquire the training time length of each candidate scheduling strategy, and further selecting the target scheduling strategy with the shortest training time length from multiple candidate scheduling strategies. The coordination module is used for sending the target scheduling strategy to each computing node.

In this embodiment of the present application, referring to fig. 1, the apparatus further includes a model slicer (Auto Parallel Pass), where an output end of the model slicer is connected to an input end of the adaptive packet device, and is configured to determine, according to model information of the deep learning model and attribute information of each computing device in the cluster, multiple model slicing schemes, select a target slicing scheme with a shortest training duration from the multiple model slicing schemes, and further determine attribute information of each computing node based on the target slicing scheme.

In an embodiment of the present application, the apparatus further includes: and the output end of the network state monitoring module is connected with the input end of the automatic adjustment module, and is used for responding to the monitoring instruction sent by the automatic adjustment module when the model training is performed based on the training sample set of the next batch, monitoring the network state of the current network and sending the monitored network state to the automatic adjustment module. The automatic adjustment module reselects a target scheduling strategy with the shortest training time from a plurality of candidate scheduling strategies according to the monitored network state, and then sends the reselected target scheduling strategy to the coordination module, and the coordination module receives the reselected target scheduling strategy and sends the reselected target scheduling strategy to each computing node, so that each computing node can perform model training according to the reselected target scheduling strategy.

Based on the apparatus shown in fig. 1, the embodiment of the present application provides a scheduling policy determining method for pipeline parallel training, where the method is applied to a network preemption scenario, and referring to fig. 2, a method flow provided by the embodiment of the present application includes:

201. the attribute information of each computing node is determined based on model information of the deep learning model and attribute information of the respective computing devices used for model training.

The model information of the deep learning model comprises the number of model layers, the association relation among the layers, the number of convolution kernels on each layer and the like. The attribute information of each computing device used for model training includes the number of computing devices, the GPU size of each computing device, the network bandwidth between each computing device, and so on. The attribute information of a computing node includes the number of computing devices included in the computing node, the GPU size of each computing device, the stage deployed on the computing node, and the like. Based on the model information of the deep learning model and the attribute information of each computing device for performing model training, the following method may be adopted when determining the attribute information of each computing node:

2011. based on model information of the deep learning model and attribute information of each computing device used for model training, a plurality of model segmentation schemes are determined.

The model segmentation scheme is used for indicating a segmentation mode of the deep learning model and computing nodes corresponding to a training stage after segmentation. The method comprises the following steps:

first, according to model information of a deep learning model, the deep learning model is segmented into a plurality of stages.

The step can segment the deep learning model according to the layer number of the deep learning model. As the number of layers of the deep learning is at least two, a plurality of segmentation modes can be provided, and each segmentation mode can obtain a plurality of stages.

According to training characteristics of the deep learning model, the training stage comprises a forward computing stage and a backward computing stage, so that gradient descent is carried out on a subsequent computing result based on forward computing and a subsequent computing result based on backward computing, model parameters are updated, after the deep learning model is sequentially segmented according to layers to obtain a plurality of forward stages, the deep learning model is reversely segmented according to the obtained forward stages to obtain a plurality of backward stages, each forward stage has a corresponding backward stage, and the corresponding forward stages and the deep learning layers included by the backward stages are in mirror image relation. The corresponding forward stage and backward stage are usually deployed on the same computing node, which is friendly to the storage and update process of model parameters.

And secondly, for each segmentation mode, distributing corresponding computing equipment for each stage according to the attribute information of each computing equipment for model training and the segmented stage.

In one possible implementation, if the number of computing devices is an integer multiple of the number of divided stages, a corresponding computing device may be allocated to each stage on average according to the number of divided stages. For example, 8 computing devices for model training, if the number of divided stages is 2, 4 computing devices are allocated to each stage; if there are 4 split stages, 2 computing devices are allocated to each stage.

In another possible implementation, each stage may be allocated a corresponding computing device according to the computing resources that need to be consumed on each stage and the storage space of the respective computing device, e.g., the size of the GPU. For example, if the number of layers corresponding to any stage is greater, the amount of computation on each layer is greater, it is determined that the computing resources needed to be consumed on the stage are greater, so that more computing resources are allocated to the stage, for example, computing devices with greater storage space are allocated, or more computing devices are allocated. For example, if 8 computing devices are used for model training, and there are 2 split stages, and stage1 needs to consume more computing resources than stage2 needs to consume, then 5 computing devices may be allocated to stage1, and 3 computing devices may be allocated to stage 2.

Considering that the split stage includes forward stages and backword stages, the duty ratio of the computing resources consumed by each forward stage in the forward computing stage is consistent with the duty ratio of the computing resources consumed by the corresponding backword stage in the backward computing stage, that is, if more computing resources are required to be consumed by one forward stage, the computing resources required to be consumed by the backword stage corresponding to the forward stage are also more, so that when the corresponding computing device is allocated for each stage, the allocation can be performed based on the computing resources consumed by each stage in the forward computing stage, and the allocation can also be performed based on the computing resources consumed by each stage in the backward computing stage.

Thirdly, the computing equipment corresponding to each stage in each segmentation mode is formed into a computing node, and then each segmentation mode and the computing nodes corresponding to each stage in the corresponding segmentation mode are used as a model segmentation scheme.

2012. And selecting a target segmentation scheme with the shortest training time from multiple model segmentation schemes.

Because different segmentation modes are adopted to segment the deep learning model, the obtained stage is different, the calculation nodes corresponding to different stages are different, and the subsequent training efficiency of the deep learning model is also different, therefore, the segmentation scheme with the highest training efficiency is required to be selected from multiple model segmentation schemes. Usually, the training efficiency is mainly reflected on the training time length, and the longer the training time length is, the lower the training efficiency of the model is; the shorter the training duration, the higher the training efficiency of the model, so that the target segmentation scheme is selected based on the training duration, namely the model segmentation scheme with the highest training efficiency. When the training time length of each model segmentation scheme is obtained, the training time length can be obtained by simulating the training process of the model under different model segmentation schemes.

2013. Based on the target segmentation scheme, attribute information of each computing node is determined.

After the target segmentation scheme is determined, the attribute information of each computing node under the target segmentation scheme can be determined based on the corresponding relation between each stage and the computing node in the target segmentation scheme.

In the embodiment of the application, the deep learning model may be described by using an HLO (High-Level Optimization ) instruction, so that each stage corresponding to a computing node under the target segmentation scheme may be represented by using HLO Computation.

202. And determining a plurality of candidate scheduling strategies according to the training sample set of any batch and the attribute information of each computing node.

The candidate scheduling strategy is used for indicating a training mode of each computing node for training the deep learning model based on the training sample set. The candidate scheduling strategy comprises a sample splitting scheme for uniformly splitting the training sample set into a plurality of training sample subsets and corresponding maximum K values. The maximum K value is used to indicate the maximum of the number of subsets for each compute node to perform one forward or backward computation under the corresponding sample splitting scheme. According to the training sample set of any batch and the attribute information of each computing node, when determining various candidate scheduling strategies, the following method can be adopted:

2021. And uniformly splitting the training sample set according to the number of samples included in the training sample set to obtain a plurality of sample splitting schemes.

Wherein each sample splitting scheme includes a subset number of the split training sample subsets and a sample number within the training sample subsets. According to the number of samples included in the training sample set, a factorization method is adopted to uniformly split the training sample set, so that various sample splitting schemes can be obtained. By uniform splitting, it is meant that the number of samples within each training sample subset after splitting is the same. For example, the training sample set includes 100 samples, and the training sample set is split uniformly, so that various splitting schemes are possible: the device can be split into 100 training sample subsets, and the number of samples in each training sample subset is 1; the device can be split into 50 training sample subsets, and the number of samples in each training sample subset is 2; the device can be split into 25 training sample subsets, and the number of samples in each training sample subset is 4; the device can be split into 20 training sample subsets, and the number of samples in each training sample subset is 5; the device can be split into 10 training sample subsets, and the number of samples in each training sample subset is 10; the device can be split into 5 training sample subsets, and the number of samples in each training sample subset is 20; the device can be split into 4 training sample subsets, and the number of samples in each training sample subset is 25; can be split into 2 training sample subsets, with 50 samples per training sample subset.

For each sample splitting scheme, the product of the number of subsets of the split training sample subset and the number of samples within the training sample subset is equal to the number of samples included in the training sample set. Setting the number of training sample sets to be batch_size, the number of subsets of the split training sample sets to be num_micro_batches, and the number of samples within the split training sample sets to be micro_batch_size, then micro_batch_size is num_micro_batches=batch_size

2022. And searching a maximum K value corresponding to each sample splitting scheme by taking the maximum storage space of each computing node as a constraint condition to obtain each candidate scheduling strategy.

Considering that for different sample splitting schemes, the larger the number of samples in the training sample subset is, the larger the size of the input tensor is, and the larger the display memory occupied by the computing node when the training sample subset is computed is; for the same sample splitting scheme, the number of samples in the training sample subset, i.e. micro_batch_size, is fixed, the more the number of subsets of the training sample subset is (i.e. the larger the K value) that each computing node performs forward computation or backward computation at a time, the greater the storage pressure of the computing node is, and the storage space of each computing node is limited, so for different sample splitting schemes, the maximum K value that each computing node can bear when invoking forward stage or backward stage for computation is different, and the obtained candidate scheduling policy is also different. And for the maximum K value corresponding to each sample splitting scheme, searching by taking the maximum storage space of each computing node as a constraint condition.

Specifically, each candidate scheduling policy determination procedure is: for each sample splitting scheme, based on the storage space of each computing node and the network bandwidth among different computing nodes, simulating the processing process of each computing node for processing training samples in each training sample subset under each sample splitting scheme, continuously increasing the number of training sample subsets processed by each computing node until the maximum storage space of any computing node is reached in the process of simulating each computing node for processing training samples in each training sample subset under each sample splitting scheme, wherein the number of subsets of training sample subsets processed by each computing node when the maximum storage space is reached is used as the maximum K value corresponding to the corresponding sample splitting scheme, namely the maximum K value corresponding to the number of samples in the training sample subsets under the corresponding splitting scheme, and then forming a candidate scheduling strategy by one sample splitting scheme and the corresponding maximum K value.

In the process of searching the maximum K value, if the calculated amount in the previous round of searching does not reach the maximum storage space for any computing node, and the calculated amount in the next round of computing exceeds the maximum storage space, the subset number of the training sample subsets processed by the computing node in the previous round is taken as the maximum K value.

For example, for any sample splitting scheme, the number of samples of the training sample subset in the sample splitting scheme is 8, k=1 is selected during the initial search, a scheduling policy of 1F1B is adopted to simulate the calculation process of each calculation node on the training sample subset, that is, the number of subsets of the training sample subset that each calculation node performs forward calculation or backward calculation at one time is 1, if the maximum storage space of any calculation node is not reached, k=2 is selected, a scheduling policy of 2F2B is adopted to simulate the calculation process of each calculation node on the training sample subset, that is, the number of subsets of the training sample subset that each calculation node performs forward calculation or backward calculation at one time is 2, and if the maximum storage space of a certain calculation node is reached, the maximum K value corresponding to the sample splitting mode is determined to be 2.

The number of samples in the training sample subset corresponding to each sample splitting scheme is determined, so that when the number of samples in the training sample subset corresponding to each K value is searched, the maximum K value corresponding to each sample splitting scheme can be determined. The method for searching the maximum number of samples in the training sample subset corresponding to the K value for the fixed K value is the same as the method for searching the corresponding maximum K value for each sample splitting scheme, and specifically, refer to the above, and will not be repeated here.

The above method of searching the maximum K value corresponding to each sample splitting scheme is also called pareto front edge pruning method. FIG. 3 shows a memory limit curve (Memory limit curve) obtained by the pareto front pruning method, where the integer pairs on the curve are solutions of the scheduling policy when the maximum storage space of the computing node is reached, for example, the point C in the graph is on the curve, and the integer pairs at the point C are solutions reaching the upper limit of the video memory; the integer pair in the area below the curve is a feasible solution of the candidate scheduling strategy, for example, the point A in the figure is positioned below the curve, and the integer pair solution of the point A is not fully utilized, but the micro_batch_size or K should be continuously increased to fully utilize the video memory; the integer pairs in the area above the curve are not feasible solutions of the candidate scheduling strategies, and the maximum video memory of the computing node is exceeded at the moment, so that the computing node cannot realize the computing process.

It should be noted that, when searching the maximum K value corresponding to each sample splitting scheme, the method provided by the embodiment of the application may not enumerate all feasible solutions, which relieves pressure for generating task images, evaluating candidate scheduling policies and testing network performance subsequently.

203. And generating task images corresponding to each candidate scheduling strategy based on the sample splitting scheme in each candidate scheduling strategy and the association relation between each computing node.

And taking each computing node as a task node on the task image, generating a corresponding subtask image for each training sample subset under each candidate scheduling strategy based on the association relation among the computing nodes, and obtaining a plurality of subtask images corresponding to each candidate scheduling strategy, wherein the number of the subtask images is the same as the number of the split training sample subsets under each candidate scheduling strategy, and each subtask image is the same. For a training task, a task image is generally used to represent the calculation process of each calculation node, and after a plurality of subtask images corresponding to each candidate scheduling policy are obtained, the plurality of subtask images corresponding to each candidate scheduling policy need to be fused to obtain a task image corresponding to each candidate scheduling policy.

204. And determining a target scheduling strategy with the shortest training time from a plurality of candidate scheduling strategies according to the current network state and the task image and the maximum K value corresponding to each candidate scheduling strategy.

The training time length of the model can reflect the training efficiency of the model in general, the longer the training time length is, the lower the training efficiency of the model is, the shorter the training time length is, the higher the training efficiency of the model is, and when the training time length of the model under each candidate scheduling strategy is according to, the target scheduling strategy with the shortest training time length is selected from a plurality of candidate scheduling strategies, and the target scheduling strategy is the scheduling strategy with the highest training efficiency. Specifically, when determining the target scheduling strategy with the shortest training duration from multiple candidate scheduling strategies according to the current network state and the task image and the maximum K value corresponding to each candidate scheduling strategy, the following method may be adopted:

2041. and generating a candidate scheduling plan corresponding to each candidate scheduling strategy according to the task image corresponding to each candidate scheduling strategy and the maximum K value.

Each candidate scheduling strategy only indicates the trend of the data flow when each computing node calculates training samples in the training sample subsets, the subset number and the calculation sequence of the training sample subsets calculated by each computing node are not given, and a training plan for executing a specific model training task is required to be made for each computing node according to the task image and the maximum K value corresponding to each candidate scheduling strategy. For a task image corresponding to any candidate scheduling strategy, if the maximum K value corresponding to the candidate scheduling strategy is 2, the candidate scheduling plan indicates each computing node to calculate two training sample subsets or two calculation results of the last forward stage by calling the forward stage in a forward calculation stage according to the calculation sequence on the task image; in the backward calculation stage, two backward stages are called to calculate two forward calculation results or two calculation results of the last backward stage.

2042. And according to the current network state, obtaining the training duration of each candidate scheduling strategy by simulating the process that each computing node carries out model training according to the candidate scheduling plan corresponding to each candidate scheduling strategy.

According to the maximum storage space of each computing node and the current network state, simulating a process of model training of each computing node according to a candidate scheduling plan corresponding to each candidate scheduling strategy, estimating the calculation time length of each training sample under each candidate scheduling strategy through simulating the model training process, measuring the communication process between each computing node and other computing nodes when executing the candidate scheduling plan corresponding to each candidate scheduling strategy in the current network state, obtaining the communication time length between each computing node and other computing nodes under each candidate scheduling strategy, calling an overhead model, and processing the calculation time length of each training sample under each candidate scheduling strategy and the communication time length between each computing node and other computing nodes to obtain the training time length of each candidate scheduling strategy.

2043. And selecting a target scheduling strategy with the shortest training time from a plurality of candidate scheduling strategies.

Based on the determined training time length of each candidate scheduling strategy, a target scheduling strategy with the shortest training time length can be selected from a plurality of candidate scheduling strategies.

Fig. 4 shows a performance comparison graph of a scheduling policy of KFKB (selected k=2) provided by the embodiment of the present application and an existing scheduling policy of 1F1B when the pipeline-oriented parallel training is performed. The abscissa in fig. 4 represents the training time, the total length of the abscissa represents the training time length (pipeline) using different scheduling strategies, the ordinate represents different computing nodes, the number in each small box represents the micro batch in the training sample set, and the blank (bubble) in the figure represents the computing node in an idle state.

As can be seen from fig. 4, the training time length of the scheduling policy using 1F1B is longer than the training time length of the scheduling policy using 2F2B, and the number of bubbles is greater. This result occurs for the following reasons: in the network preemption scenario, the network communication transmission time between each computing node is not negligible, and the communication time between each computing node can be overlapped only by letting the computing independent of the communication work as far as possible, so as to achieve the state that the network and the computing node are busy at the same time as far as possible.

For the scheduling policy of 1F1B, since the network communication transmission time between the computing nodes is not negligible, after the computing on the previous computing node is completed, each micro_batch needs to wait for a period of time to be transmitted to the next computing node, and the next computing node is in an idle state because there is no computing task at this time, for example, after the computing on device 0 of the micro batch 0 in fig. 4 is completed, the device 1 needs to wait for a period of time to receive the computing result of the micro_batch 0 sent by the device 0, and then further computing is performed. For device 0, although each micro_batch has no dependency in the forward computing stage, the uncomputed micro_batch can be computed as long as the idle state is left, and the communication duration overlap between computing nodes can be shortened. However, in the backward calculation stage, most backward stages need to wait for the end of network transmission to start calculation, and thus a large number of bubbles are formed, because the 1F1B principle is followed. This is especially true for other devices, both forward and backward computing stages, which form a large number of bubbles.

For the 2F2B scheduling policy, because 2 forward stages or 2 backward stages are a group, compared with the 1F1B scheduling policy, although a backward calculation for input data is not performed because the input data is depended on by a certain backward stage, the calculation node can select another data which has no dependence item or is ready for input to calculate, so that each calculation node is fully busy, communication duration between each calculation node can be saved by overlap, and the number of bubbles of each calculation device in the training process is reduced.

Further, after determining the target scheduling policy with the shortest training duration, the embodiment of the present application further sends a target scheduling plan corresponding to the target scheduling policy to each computing node, so that each computing node performs model training according to the target scheduling plan.

Considering that the network state frequently changes in public cloud scenes, the target scheduling plan determined in the above step is the scheduling plan selected according to the network state in the previous round of training, and in each subsequent round of training, the training efficiency of training a training model by adopting the target scheduling plan may be high along with the change of the network state, so that the training plan needs to be adjusted in combination with the network state in each round of training. Under the condition that the number of samples included in the training sample set of each batch is the same, the sample splitting scheme is unchanged, and the attribute information of each computing node is unchanged, so that the process of determining the candidate scheduling plans corresponding to each candidate scheduling strategy is not needed to be executed again, and the selection can be directly performed based on the candidate scheduling plans determined by the steps. Specifically, when training a deep learning model based on a training sample set of the next batch, monitoring the network state of the current network, simulating a model training process of each computing node according to a candidate scheduling plan corresponding to each candidate scheduling policy according to the maximum storage space and the current network state of each computing node, estimating the calculated time length of each training sample under each candidate scheduling policy through the simulated model training process, measuring the communication process of each computing node with other computing nodes when executing a candidate scheduling plan corresponding to each candidate scheduling policy in the current network state, obtaining the communication time length of each computing node under each candidate scheduling policy, calling an overhead model, processing the calculated time length of each training sample under each candidate scheduling policy and the communication time length of each computing node, obtaining the training time length of each candidate scheduling policy, and reselecting the shortest target scheduling time length from multiple candidate scheduling policies based on the determined training time length of each candidate scheduling policy.

Of course, if the number of samples included in the training sample set of each round is different, the target candidate scheduling policy with the shortest training duration may be selected again for the training process of each round according to the method shown in the steps 202 to 204, which is not described herein.

The method provided by the embodiment of the application can combine the network state of each round of training, timely adjust the scheduling strategy of each round of training instead of adopting the fixed scheduling strategy, and can obtain higher model training efficiency in the whole model training process no matter the current network state. Especially in the network preemption scene, when the pipeline parallel training mode is adopted to carry out distributed training on the large model, the application does not need to apply for the super-calculation cluster, can maintain a better training level, saves the cost, and provides possibility for training the large-scale model on public cloud service products.

According to the method provided by the embodiment of the application, when the deep learning model is trained by adopting the pipeline parallel training mode, a fixed scheduling strategy of 1F1B is not adopted, but in the training process of each round, the training sample set is split according to the sample number of the training sample set for the training round, so that a plurality of sample splitting schemes are obtained, and the number of subsets of the training sample subsets split by each sample splitting scheme and the number of samples in the training sample subsets are different. In the maximum storage space of each computing node, for the same sample splitting scheme, the number of samples in the split training sample subsets is fixed, and the more the number of training sample subsets are calculated in parallel by each computing node, the higher the computing efficiency of each computing node is; for different sample splitting schemes, the more the number of samples in the split training sample subsets is, the fewer the number of subsets of the training sample subsets that the computing node can compute in parallel is, and the more the number of subsets of the training sample subsets that need to compute in parallel is, the fewer the number of communications between the computing nodes is, and the lower the degree of dependence on network state is. According to the method and the system, the maximum storage space of each computing node is used as a limiting condition, the corresponding maximum K value is searched for each sample splitting scheme, multiple candidate scheduling strategies are obtained, the training time of each candidate scheduling strategy in the current network state is further obtained, the influence of the network state on the training efficiency is reduced on the premise that the network state and the computing efficiency of each computing node are comprehensively considered, and therefore when the target scheduling strategy with the shortest training time is selected to control the training process of each computing node, the higher model training efficiency can be obtained under the scene of network resource preemption, and the whole model training process can keep higher training level.

Referring to fig. 5, a schematic structural diagram of a scheduling policy determining apparatus for pipeline parallel training is provided in an embodiment of the present application, where the apparatus is used in a network resource preemption scenario, and the apparatus may be implemented by software, hardware, or a combination of both, and is all or a part of a computing device. The device comprises:

a first determining module 501, configured to determine multiple candidate scheduling policies according to a training sample set of any batch and attribute information of each computing node, where the computing node is deployed with a training stage segmented by a deep learning model, where the candidate scheduling policy is configured to instruct each computing node to train the deep learning model based on a training mode of training the training sample set, where the candidate scheduling policy includes a sample splitting scheme for uniformly splitting the training sample set into multiple training sample subsets and a corresponding maximum K value, where the maximum K value is configured to indicate a maximum value of the number of subsets when each computing node performs one time of forward computation or backward computation under the corresponding sample splitting scheme, and the maximum K value is greater than or equal to 2;

the generating module 502 is configured to generate a task image corresponding to each candidate scheduling policy based on a sample splitting scheme in each candidate scheduling policy and an association relationship between each computing node;

A second determining module 503, configured to determine, according to the current network state and the task image and the maximum K value corresponding to each candidate scheduling policy, a target scheduling policy with the shortest training duration from multiple candidate scheduling policies, where the target scheduling policy is used to control a computing process of each computing node.

In another embodiment of the present application, the first determining module 501 is configured to uniformly split the training sample set according to the number of samples included in the training sample set, so as to obtain multiple sample splitting schemes, where each sample splitting scheme includes the number of subsets of the split training sample subsets and the number of samples in the training sample subsets; and searching a maximum K value corresponding to each sample splitting scheme by taking the maximum storage space of each computing node as a constraint condition to obtain each candidate scheduling strategy.

In another embodiment of the present application, the first determining module 501 is configured to continuously increase the number of training sample subsets processed in parallel by each computing node until the maximum storage space of any computing node is reached in a process of simulating each computing node to process training samples in each training sample subset under each sample splitting scheme; calculating the number of subsets calculated by the nodes when the maximum storage space is reached under each sample splitting scheme, and taking the number of subsets as the maximum K value corresponding to the corresponding sample splitting scheme; and forming a candidate scheduling strategy by a sample splitting scheme and a corresponding maximum K value.

In another embodiment of the present application, a generating module 502 is configured to generate, for each training sample subset under each candidate scheduling policy, a corresponding subtask image based on an association relationship between each computing node, to obtain a plurality of subtask images corresponding to each candidate scheduling policy, where the number of subtask images is the same as the number of subsets of the split training sample subset under each candidate scheduling policy; and fusing the multiple subtask images corresponding to each candidate scheduling strategy to obtain the task image corresponding to each candidate scheduling strategy.

In another embodiment of the present application, the second determining module 503 is configured to generate, according to the task image and the maximum K value corresponding to each candidate scheduling policy, a candidate scheduling plan corresponding to each candidate scheduling policy, where the candidate scheduling plan is a training plan when each computing node performs a specific model training task according to the corresponding candidate scheduling policy; according to the current network state, training time length of each candidate scheduling strategy is obtained by simulating a process that each computing node carries out model training according to a candidate scheduling plan corresponding to each candidate scheduling strategy; and selecting a target scheduling strategy with the shortest training time from a plurality of candidate scheduling strategies.

In another embodiment of the present application, the second determining module 503 is configured to estimate, according to a maximum storage space of each computing node, a computation duration of each training sample under each candidate scheduling policy in a process of simulating each computing node to perform model training according to a candidate scheduling plan corresponding to each candidate scheduling policy; under the current network state, the communication time length between each computing node and other computing nodes under each candidate scheduling strategy is obtained by simulating the communication process between each computing node and other computing nodes when each computing node executes the candidate scheduling plan corresponding to each candidate scheduling strategy; and calling an overhead model, and processing the calculated time length of each training sample and the communication time length of each calculation node under each candidate scheduling strategy to obtain the training time length of each candidate scheduling strategy.

In another embodiment of the present application, the apparatus further comprises:

and the sending module is used for sending the target scheduling plan corresponding to the target scheduling strategy to each computing node so that each computing node carries out model training according to the target scheduling plan.

the third determining module is used for determining a plurality of model segmentation schemes based on model information of the deep learning model and attribute information of each computing device used for model training, wherein the model segmentation schemes are used for indicating the segmentation modes of the deep learning model and computing nodes corresponding to training stages after segmentation, and each computing node comprises at least one computing device;

The first selection module is used for selecting a target segmentation scheme with the shortest training duration from a plurality of model segmentation schemes;

and a fourth determining module, configured to determine attribute information of each computing node based on the target segmentation scheme.

the monitoring module is used for monitoring the network state of the current network;

the second selection module is used for reselecting a target scheduling strategy with the shortest training time from a plurality of candidate scheduling strategies according to the monitored network state;

and the sending module is used for sending the re-selected target scheduling strategy to each computing node, wherein the re-selected target scheduling strategy is used for controlling the computing process of each computing node when the deep learning model is trained based on the training sample set of the next batch.

It should be noted that, the scheduling policy determining apparatuses facing the pipeline parallel training shown in fig. 1 and fig. 5 can select a scheduling policy suitable for the current network in the pipeline parallel training scenario, so as to obtain higher model training efficiency in the network preemption scenario. The functions of the two are the same, and certain relation is naturally also necessarily present between each device included in fig. 1 and each module included in fig. 5. Specifically, the adaptive packetizer in fig. 1 corresponds to the first determining module in fig. 5, and the adaptive packetizer may include the first determining module therein or may be the first determining module; the task image generator in fig. 1 corresponds to the generation module in fig. 5, and may include the generation module, or may be the generation module; the adaptive packet scheduler in fig. 1 corresponds to the second determining module and the transmitting module in fig. 5, and may include the second determining module and the transmitting module; the model slicer in fig. 1 corresponds to the third determining module, the first selecting module, and the fourth determining module in fig. 5, and the model slicer may include the third determining module, the first selecting module, and the fourth determining module.

According to the device provided by the embodiment of the application, when the deep learning model is trained by adopting the pipeline parallel training mode, the fixed scheduling strategy of 1F1B is not adopted, but in the training process of each round, the training sample set is split according to the sample number of the training sample set for the training round, so that a plurality of sample splitting schemes are obtained, and the subset number of the training sample subsets split by each sample splitting scheme and the sample number in the training sample subsets are different. In the maximum storage space of each computing node, for the same sample splitting scheme, the number of samples in the split training sample subsets is fixed, and the more the number of training sample subsets are calculated in parallel by each computing node, the higher the computing efficiency of each computing node is; for different sample splitting schemes, the more the number of samples in the split training sample subsets is, the fewer the number of subsets of the training sample subsets that the computing node can compute in parallel is, and the more the number of subsets of the training sample subsets that need to compute in parallel is, the fewer the number of communications between the computing nodes is, and the lower the degree of dependence on network state is. According to the method and the system, the maximum storage space of each computing node is used as a limiting condition, the corresponding maximum K value is searched for each sample splitting scheme, multiple candidate scheduling strategies are obtained, the training time of each candidate scheduling strategy in the current network state is further obtained, the influence of the network state on the training efficiency is reduced on the premise that the network state and the computing efficiency of each computing node are comprehensively considered, and therefore when the target scheduling strategy with the shortest training time is selected to control the training process of each computing node, the higher model training efficiency can be obtained under the scene of network resource preemption, and the whole model training process can keep higher training level.

The embodiment of the application provides a pipeline parallel training system, referring to fig. 6, the system includes a scheduling policy determining device 601 and a plurality of computing nodes 602 for pipeline parallel training;

the scheduling policy determining apparatus 601 is an apparatus as shown in fig. 5, and is configured to execute the scheduling policy determining method for pipeline-oriented parallel training;

each computing node 602 is configured to receive the target scheduling policy sent by the scheduling policy determining apparatus 601, and perform model training according to the target scheduling policy.

Fig. 7 illustrates a block diagram of a computing device 700 provided in an exemplary embodiment of the present application. In general, the computing device 700 includes: a processor 701 and a memory 702.

The processor 701 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). Processor 701 may also include a main processor, which is a processor for processing data in an awake state, and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 701 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 701 may also include an artificial intelligence processor for processing computing operations related to machine learning.

The Memory 702 may include one or more computer-readable storage media, which may be non-transitory computer-readable storage media, such as CD-ROM (Compact Disc Read-Only Memory), ROM, RAM (Random Access Memory ), magnetic tape, floppy disk, optical data storage device, and the like. The computer readable storage medium stores at least one computer program that when executed enables a scheduling policy determination method for pipeline-oriented parallel training.

Of course, the computing device described above may necessarily also include other components, such as input/output interfaces, communication components, and the like. The input/output interface provides an interface between the processor and a peripheral interface module, which may be an output device, an input device, etc. The communication component is configured to facilitate wired or wireless communication between the computing device and other devices, and the like.

Those skilled in the art will appreciate that the structure shown in fig. 7 is not limiting of computing device 700 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

The embodiment of the application provides a computer readable storage medium, wherein at least one computer program is stored in the computer readable storage medium, and the scheduling policy determining method for pipeline-oriented parallel training can be realized when the at least one computer program is executed by a processor.

Embodiments of the present application provide a computer program product comprising a computer program that, when executed by a processor, enables a scheduling policy determination method for pipeline-oriented parallel training.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A scheduling policy determination method for pipeline-oriented parallel training, wherein the method is applied to a network resource preemption scenario, the method comprising:

2. The method of claim 1, wherein determining a plurality of candidate scheduling policies based on the training sample set of any batch and the attribute information of each computing node comprises:

according to the number of samples included in the training sample set, uniformly splitting the training sample set to obtain a plurality of sample splitting schemes, wherein each sample splitting scheme comprises the number of subsets of split training sample subsets and the number of samples in the training sample subsets;

and searching a maximum K value corresponding to each sample splitting scheme by taking the maximum storage space of each computing node as a constraint condition to obtain each candidate scheduling strategy.

3. The method of claim 2, wherein searching for a maximum K value corresponding to each sample splitting scheme with a maximum storage space of each computing node as a constraint condition, to obtain each candidate scheduling policy includes:

in the process of simulating each computing node to process training samples in each training sample subset under each sample splitting scheme, the number of the training sample subsets processed by each computing node in parallel is continuously increased until the maximum storage space of any computing node is reached;

Calculating the number of subsets calculated by the nodes when the maximum storage space is reached under each sample splitting scheme, and taking the number as the maximum K value corresponding to each sample splitting scheme;

and forming a candidate scheduling strategy by a sample splitting scheme and a corresponding maximum K value.

4. The method according to claim 2, wherein the generating task images corresponding to each candidate scheduling policy based on the sample splitting scheme in each candidate scheduling policy and the association relationship between the computing nodes includes:

generating corresponding subtask images for each training sample subset under each candidate scheduling strategy based on the association relation among the computing nodes, and obtaining a plurality of subtask images corresponding to each candidate scheduling strategy, wherein the number of the subtask images is the same as the number of the split training sample subsets under each candidate scheduling strategy;

and fusing the multiple subtask images corresponding to each candidate scheduling strategy to obtain the task image corresponding to each candidate scheduling strategy.

5. The method according to claim 1, wherein the determining, according to the current network state and the task image and the maximum K value corresponding to each candidate scheduling policy, the target scheduling policy with the shortest training duration from the plurality of candidate scheduling policies includes:

Generating a candidate scheduling plan corresponding to each candidate scheduling strategy according to the task image corresponding to each candidate scheduling strategy and the maximum K value, wherein the candidate scheduling plan is a training plan when each computing node executes a model training task according to the corresponding candidate scheduling strategy;

according to the current network state, training time length of each candidate scheduling strategy is obtained by simulating a process that each computing node carries out model training according to a candidate scheduling plan corresponding to each candidate scheduling strategy;

and selecting a target scheduling strategy with the shortest training time from a plurality of candidate scheduling strategies.

6. The method according to claim 5, wherein the obtaining the training duration of each candidate scheduling policy by simulating the process that each computing node performs model training according to the candidate scheduling plan corresponding to each candidate scheduling policy according to the current network state includes:

in the process of simulating each calculation node to carry out model training according to a candidate scheduling plan corresponding to each candidate scheduling strategy, estimating the calculation time length of each training sample under each candidate scheduling strategy according to the maximum storage space of each calculation node;

Under the current network state, the communication time length between each computing node and other computing nodes under each candidate scheduling strategy is obtained by simulating the communication process between each computing node and other computing nodes when executing the candidate scheduling plan corresponding to each candidate scheduling strategy;

and calling an overhead model, and processing the calculation time length of each training sample and the communication time length of each calculation node and other calculation nodes under each candidate scheduling strategy to obtain the training time length of each candidate scheduling strategy.

7. The method according to claim 5, wherein after determining the target scheduling policy with the shortest training duration from the plurality of candidate scheduling policies according to the current network state and the task image and the maximum K value corresponding to each candidate scheduling policy, the method further comprises:

and sending the target scheduling plan corresponding to the target scheduling strategy to each computing node so that each computing node carries out model training according to the target scheduling plan.

8. The method according to any one of claims 1 to 7, further comprising, before determining a plurality of candidate scheduling policies from the training sample set of any batch and the attribute information of each computing node:

Determining a plurality of model segmentation schemes based on model information of the deep learning model and attribute information of each computing device for model training, wherein the model segmentation schemes are used for indicating computing nodes corresponding to segmentation modes of the deep learning model and training stages after segmentation, and each computing node comprises at least one computing device;

selecting a target segmentation scheme with the shortest training time from a plurality of model segmentation schemes;

and determining attribute information of each computing node based on the target segmentation scheme.

9. The method according to any one of claims 1 to 7, wherein after determining the target scheduling policy with the shortest training duration from the multiple candidate scheduling policies according to the current network state and the task image and the maximum K value corresponding to each candidate scheduling policy, the method further comprises:

monitoring the network state of the current network;

according to the monitored network state, re-selecting a target scheduling strategy with the shortest training time from a plurality of candidate scheduling strategies;

and sending a re-selected target scheduling strategy to each computing node, wherein the re-selected target scheduling strategy is used for controlling the computing process of each computing node when the deep learning model is trained based on the training sample set of the next batch.

10. The pipeline parallel training system is characterized by comprising a scheduling strategy determining device and a plurality of computing nodes, wherein the scheduling strategy determining device is used for pipeline parallel training;

the scheduling policy determining device is configured to perform the scheduling policy determining method for pipeline-oriented parallel training according to any one of claims 1 to 9;

11. A computing device comprising a processor and a memory; the memory stores at least one piece of program code; the at least one program code is configured to be invoked and executed by the processor to implement a pipeline parallel training oriented scheduling policy determination method as claimed in any one of claims 1 to 9.

12. A computer readable storage medium, characterized in that at least one computer program is stored in the computer readable storage medium, which at least one computer program, when being executed by a processor, is capable of implementing a scheduling policy determination method for pipeline-oriented parallel training according to any one of claims 1 to 9.

13. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, is capable of implementing a scheduling policy determination method for pipeline-oriented parallel training according to any one of claims 1 to 9.