CN115033388A

CN115033388A - Method and system for configuring GPU (graphics processing Unit) with parallel flow in artificial intelligence system

Info

Publication number: CN115033388A
Application number: CN202210797455.5A
Authority: CN
Inventors: 胡晋彬; 贺蔓; 刘颖; 王进
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-09-09

Abstract

The invention discloses a method and a system for configuring a GPU (graphics processing unit) with parallel flow in an artificial intelligence system, which aim at a shared GPU (graphics processing unit) cluster and are applied to neural network distributed training. In order to solve the problem that GPU configuration cannot be dynamically adjusted due to the fact that a GPU allocation scheme is fixed in the parallel of the downstream lines of the shared GPU cluster, a plurality of new work partitions are obtained according to static indexes and dynamic indexes before the next training, and the available bandwidth of the GPU is added into the dynamic indexes, so that the new work partitions can reflect the dynamic available resources of the GPU; and then, a meta-network is introduced to predict the training speed of each work partition to screen the work partitions, and reinforcement learning is introduced to judge whether the current work partition is updated or not.

Description

Method and system for configuring GPU (graphics processing Unit) with parallel flow in artificial intelligence system

Technical Field

The invention belongs to the technical field of deep neural network training and is applied to the field of artificial intelligence, in particular to a method and a system for configuring a GPU (graphics processing unit) in a parallel flow in an artificial intelligence system, and particularly the method for configuring the GPU is applied to neural network distributed training.

Background

In the field of Artificial Intelligence (AI), an accelerator is usually used for the training of Deep Neural Network (DNN) for calculation, and the calculation process generally includes forward propagation, directional propagation and gradient descent. Through continuous iteration of the three steps, the neural network is finally brought into a convergence state. However, with the continuous increase of data sets and deep learning models, the number of model layers is more and more, so that a large amount of computing and storage resources need to be consumed in the process of training a Deep Neural Network (DNN), that is, the required computing power and video memory occupy more and more, and in order to reduce the deep learning training time, the distributed deep learning for performing cluster parallel computing on a plurality of machines gradually becomes the focus of technical innovation and development.

Common techniques for distributed deep learning are data parallel, model parallel, and pipeline parallel. In data parallel, the data set is partitioned, i.e., each GPU (graphics processing unit) computes a portion of the data set and a copy of the model parameters is maintained on each GPU. In model parallel, the models are partitioned, i.e., each GPU has a different layer of the model, and a copy of the data set is maintained on each GPU. The existing pipeline parallelism makes up the defects of data parallelism and model parallelism, the pipeline parallelism is based on the model parallelism, the model is partitioned/layered, and a GPU is allocated and configured for each layer to process the data of the layer; however, different from the model parallel, only one GPU works at the same time, in the pipeline parallel, the GPU in the upper layer and the GPU in the lower layer can process different data at the same time, so that the waste of resources is reduced. For example, a data subset is split into a plurality of micro-batches (a plurality of groups of data), in the forward propagation process of the model, after one GPU performs the calculation of one group of data, the backward propagation is not waited for all the time, but the forward propagation process of the next group of data is calculated continuously, at the moment, the GPU of the upper layer and the GPU of the lower layer process different data simultaneously, so that the idle time of the GPUs is greatly reduced, and the parallel efficiency is provided.

From the above, the performance of pipeline parallelism in the distributed training technology is highly related to the work partition, the work partition has the function of determining how to allocate the calculation of each layer to the GPU, that is, a plurality of GPUs are configured to each layer of the model, and the same GPU can be responsible for a plurality of layers; however, in the existing pipeline parallel technology, after training is started, work partitioning is performed according to the existing GPU and bandwidth, and then the work partitioning is kept fixed. However, in the practical application process, the DNN training process is time-consuming, often requiring hours or days; meanwhile, the situation that the GPU is shared with other jobs often exists in the GPU cluster in the current DNN training process, so that in the DNN training process, available resources of the GPU are likely to fluctuate, because in the long-time model training process, other shared GPU jobs may start, end or pause to work, which may cause fluctuations in the GPU resources, and the work of other non-DNN training may also affect fluctuations in bandwidth, thereby greatly affecting the running performance of the pipeline parallelism, especially when the available resources cannot meet the previously set work partition, which may cause the solution of the previous work partition to be outdated, the resources of the actual GPU may not be timely adjusted, and affecting the performance of the pipeline parallelism method.

Disclosure of Invention

The invention aims to solve the problems that in a scene of sharing a GPU cluster, GPU allocation cannot be adjusted timely along with actual GPU resources due to the fact that a GPU allocation scheme in the pipeline parallel is fixed and unchanged in a distributed training system, GPU resource utilization is low, and performance of a pipeline parallel method and neural network training efficiency are affected, and further provides a pipeline parallel GPU configuration method and system in an artificial intelligence system. According to the GPU configuration method provided by the technical scheme, before the next neural network training, a plurality of new work partition schemes are obtained according to static indexes and dynamic indexes, the available bandwidth of each GPU is added into the dynamic indexes, and a plurality of newly generated work partitions dynamically reflect available resources of the GPU; and further introducing a meta-network to predict the training speed of each work partition scheme to screen the work partitions, and introducing reinforcement learning to judge whether the current work partition is updated by the screened work partitions. In conclusion, the technical scheme of the invention effectively solves the problem that the GPU allocation scheme is fixed through the GPU configuration method, can realize more reasonable distributed training, and effectively improves the GPU resource utilization rate, thereby ensuring the training efficiency of the subsequent neural network.

In one aspect, the present invention provides a method for configuring a GPU in parallel in a pipeline in an artificial intelligence system, where the method for configuring the GPU is based on a shared GPU cluster used for updating or maintaining a configuration relationship between a current GPU and a network layer before a next training of a neural network, and the shared GPU exists in the GPU cluster, and the method for configuring the GPU includes the following steps:

step 1: acquiring a current static index and a current dynamic index in a distributed training system;

the distributed training system adopts a pipeline parallel mode, a GPU is configured for each network layer of the neural network, the same GPU is responsible for training one or more network layers, and the static indexes comprise: the number of network layers of the neural network, the number of GPUs and training characteristics of each network layer; the dynamic index comprises: available bandwidth of each GPU, and forward propagation time and backward propagation time of the network layer where each GPU is responsible;

step 2: generating a plurality of new work partitions according to the static indexes and the dynamic indexes, wherein the work partitions represent the configuration relationship between the GPU and the network layer;

and step 3: taking the static index, the dynamic index and the new working subareas as input, and obtaining a training speed prediction value corresponding to each new working subarea by using a working subarea training speed prediction model; the input of the working partition training speed prediction model is a static index, a dynamic index and a new working partition; outputting a training speed predicted value of the new work partition;

and 4, step 4: screening the new work partitions based on the training speed predicted value of each new work partition and the number of GPUs with changed configuration relations in each new work partition;

and 5: and taking the static indexes, the dynamic indexes, the screened new work partition and the current work partition as input, and determining whether to replace the current work partition by using a screening model based on reinforcement learning, namely updating or maintaining the configuration relationship between the current GPU and the network layer.

In the GPU cluster provided by the technical scheme of the invention, the GPU executes DNN training operation and other operations simultaneously, so that available resources of the GPU are likely to fluctuate greatly in the DNN training process, and the utilization rate of the GPU is influenced finally. Therefore, before each training except for the first training, if the available resources of the GPU change, the GPU configuration method provided by the technical scheme of the present invention updates the configuration relationship between the GPU and the network layer or keeps the current configuration relationship between the GPU and the network layer unchanged, that is, the GPU configuration method optimizes the configuration relationship between the GPU and the network layer before the next training of the neural network, so that the GPU configuration method can be dynamically matched with the available resources of the GPU, thereby improving the utilization rate of the GPU. The neural network training method comprises the steps that a working partition training speed prediction model is built to obtain a training speed prediction value corresponding to each working partition, and the training speed prediction value is used as a screening basis, so that the neural network training speed corresponding to the screened working partition is guaranteed to be high; the number of GPUs with changed configuration relation is used as another screening basis, so that the screening range is effectively reduced, and the time complexity is reduced; and finally, introducing reinforcement learning to judge which of the new working partition and the original working partition is more suitable for the current environment, so that the optimal working partition is selected for next training of the neural network, and the optimal partition is migrated through a reinforcement learning model.

Aiming at the working partition training speed prediction model, the invention simultaneously utilizes the static index, the dynamic index and the new working partition and sets the static index, the dynamic index and the new working partition as model input. In order to monitor the change of the computing capacity of the GPU, the fact that the GPU is possibly responsible for training of a plurality of network layers and is responsible for different network layers in different training tasks is considered, therefore, forward propagation time and backward propagation time of each layer are embedded into a feature space to serve as dynamic indexes, and a trained work partition training speed prediction model is made to be more robust. Secondly, in order to comprehensively understand the computing capability of the GPU, the available bandwidth of the GPU is also embedded into the feature space, and the reliability of the evaluation result is effectively improved by combining dynamic and static indexes.

In conclusion, the technical scheme of the invention considers the dynamic change of the available resources of the GPU, so that the GPU allocation scheme trained each time is not fixed but dynamically matched with the available resources of the current GPU, the utilization rate of the GPU is effectively improved, and the distributed training is more reasonable; and moreover, a training speed predicted value is obtained by utilizing the meta-network, so that a work partition with a higher preset training speed is screened out, whether the work partition needs to be replaced or not is automatically judged by utilizing reinforcement learning, and the fact that the work partition selected for next training of the neural network is more reasonable and suitable is further ensured.

Further optionally, the work partition training speed prediction model in step 3 is constructed based on a meta-network, where the meta-network includes: each group of static indexes, an embedded layer corresponding to the dynamic indexes, an LSTM network corresponding to the dynamic indexes and a full connection layer, wherein the embedded layer corresponding to the dynamic indexes is connected with the LSTM network, and the embedded layer corresponding to the static indexes and the LSTM network are connected with the full connection layer;

inputting the dynamic indexes into the corresponding embedded layer, and inputting the obtained output result into the LSTM network to obtain the sequence characteristics of the dynamic indexes;

and inputting the static indexes into a corresponding embedding layer to obtain an output result, the sequence characteristics and the new working partition as the input of a full connection layer, wherein the output of the full connection layer is a training speed predicted value of the new working partition.

In the technical scheme of the invention, the embedded layer, the LSTM network and the fully-connected layer are all the existing network architectures, and the invention does not specifically introduce the network architecture, and it should be understood that the size of the network is determined according to the size of each input group of characteristics so as to ensure that the output of the subsequent fully-connected layer is taken as a training speed predicted value.

Further optionally, the work partition training speed prediction model and the screening model are both constructed by offline training;

the goal of the reward function in the screening model is to enable the training speed corresponding to the work partition selected by the screening model to be larger than the training speed corresponding to the previous work partition.

It should be understood that how the reinforcement learning reward function participates in training and how to set is the prior art, the invention does not optimize this, but only when it is applied to the scene, according to the requirement of the application, sets its target to make the training speed corresponding to the working partition selected by the screening model greater than the training speed corresponding to the previous working partition, i.e. when the training speed corresponding to the selected working partition is greater than the training speed corresponding to the previous working partition, it is regarded as the output of the screening model is positive; therefore, the actual meaning of the reward function in the reinforcement learning can be adaptively defined according to the target.

Further optionally, when a new work partition is screened in step 4, the following two rules need to be satisfied simultaneously;

rule 1: the configuration of only 2 GPUs in the screened new work partition is changed, and the configuration is the configuration relation between the GPUs and the network layer;

rule 2: and the predicted value of the training speed of the screened new working partition is higher than the training speed corresponding to the current working partition.

Further optionally, the training features of each network layer include an output activation size, a weight parameter, and a gradient of the network layer.

In a second aspect, the present invention provides a neural network distributed training method based on a GPU configuration method, which includes the following steps:

step S1: loading a neural network to be trained and a data set into a distributed training system, and dividing the neural network into a plurality of network layers;

step S2: initializing a work subarea and carrying out first training of a neural network;

the method comprises the following steps that a GPU is configured for each network layer of a neural network, the same GPU is responsible for training one or more network layers, namely the configuration relation between the GPU and the network layers is determined according to an initialization work partition, the GPU utilizes a data set to train the neural network, and a distributed training system adopts a distributed communication mechanism to carry out communication connection;

step S3: before the next training of the neural network, determining a working partition corresponding to the next training of the neural network according to the modes of the step 1 to the step 5 and then training; if the new work partition is obtained, updating the configuration relation between the GPU and the network layer according to the new work partition;

step S4: judging whether the iterative training termination condition of the neural network is met, if not, returning to the step S3 to continue training; otherwise, completing the training of the neural network.

In a third aspect, the present invention provides a GPU allocation apparatus based on the GPU configuration method, which includes:

the dynamic and static index acquisition module is used for acquiring the current static index and dynamic index in the distributed training system;

the distributed training system adopts a pipeline parallel mode, a GPU is configured for each network layer of the neural network, the same GPU is responsible for training one or more network layers, and the static indexes comprise: the number of network layers of the neural network, the number of GPUs and training characteristics of each network layer; the dynamic index includes: available bandwidth of each GPU, and forward propagation time and backward propagation time of the network layer where each GPU is responsible;

the configuration module is used for generating a plurality of new work partitions according to the static indexes and the dynamic indexes; the new work partition represents the configuration relation between the GPU and the network layer;

the training speed prediction value acquisition module is used for taking the static index, the dynamic index and the new work partition as input and obtaining a training speed prediction value corresponding to each new work partition by using a work partition training speed prediction model;

the screening module is used for screening the new work partitions based on the training speed predicted value of each new work partition and the number of GPUs with changed configuration relations in each new work partition;

and the decision module is used for taking the static indexes, the dynamic indexes, the screened new work partitions and the current work partitions as input, and determining whether to replace the current work partitions by using a screening model based on reinforcement learning, namely updating or maintaining the configuration relationship between the current GPU and the network layer.

In a fourth aspect, the present invention provides a distributed training system based on the GPU configuration method or the training method, where the distributed training system at least includes: each GPU server is provided with a GPU, a CPU, a memory, a network card and a switch;

the GPU is used for realizing network training; the memory is used for storing data; the CPU, the network card and the switch are used for realizing data transmission, and distributed communication is adopted among the GPU servers.

In a fifth aspect, the present invention provides an electronic device, comprising:

one or more processors;

a memory storing one or more computer programs;

wherein the processor invokes the computer program to implement:

the method comprises the steps of a GPU configuration method for flow parallel in an artificial intelligence system or the steps of a neural network distributed training method based on the GPU configuration method.

In a sixth aspect, the present invention provides a readable storage medium storing a computer program for invocation by a processor to implement:

the method comprises the steps of a GPU configuration method for pipeline parallel in an artificial intelligence system or a neural network distributed training method based on the GPU configuration method.

Advantageous effects

1. According to the GPU configuration method for the parallel flow in the artificial intelligence system, before each training except for the first training, if available resources of the GPU change, the configuration relationship between the GPU and the network layer is updated or the current configuration relationship between the GPU and the network layer is kept unchanged, namely the configuration relationship between the GPU and the network layer is optimized before the next training of the neural network, so that the GPU configuration method can be dynamically matched with the available resources of the GPU, and the utilization rate of the GPU is improved. In the specific implementation process, the available bandwidth of each GPU and the forward propagation time and the backward propagation time of a network layer which is responsible for each GPU are brought into dynamic indexes, a plurality of working partitions are obtained by combining static indexes, the training speed predicted value of each new working partition is predicted by utilizing the dynamic and static indexes and is used as a screening index, and the new working partition with higher training speed is screened out; and finally, whether the screened new working partition is more suitable for the current environment than the current working partition is judged by utilizing reinforcement learning, so that a better working partition is determined for next training, the utilization rate of the GPU is ensured, a foundation is laid for accelerating the training of the neural network, and the technical challenge brought by the dynamic change of the available resources of the GPU sharing the GPU cluster in distributed training is effectively solved.

2. The GPU configuration method provided by the technical scheme of the invention is applied to neural network training, except for the first training, the next GPU allocation scheme is determined by using the GPU configuration method, so that the whole training process of the neural network is highly dynamically matched with available resources of the GPU, and the training speed and the training efficiency of the neural network are improved. The core of the method is to solve the problem that GPU allocation in distributed training cannot adapt to the dynamic adjustment process of available resources of the GPU, the technical idea is suitable for training of any type of neural network which uses distributed pipelines for parallel under a shared GPU cluster, and the application range is wider.

Drawings

Fig. 1 is a schematic flowchart of a method for configuring a GPU in a pipeline parallel manner in an artificial intelligence system according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a neural network distributed training method based on a GPU configuration method according to an embodiment of the present invention;

FIG. 3 is a schematic network structure diagram of a training speed prediction model of a work partition constructed based on a meta-network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a network architecture of a reinforcement learning-based screening model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of training speeds of three deep neural network training schemes varying with network bandwidth in different depth neural network models and in two different communication modes; wherein, (a) is a training speed schematic diagram of three training methods under the scenes of PS, Tensorflow and ResNet50, (b) is a training speed schematic diagram of three training methods under the scenes of PS, Tensorflow and VGG16, (c) is a training speed schematic diagram of three training methods under the scenes of PS, Tensorflow and AlexNet, (d) is a training speed schematic diagram of three training methods under the scenes of PS, MXNet and ResNet50, FIG. e is a diagram of training speeds of three training methods under the scenes of PS, MXNet and VGG16, FIG. f is a diagram showing the training speeds of three training methods under the scenarios of PS, MXNet and AlexNet, FIG. g is a graph showing the training speed of three training methods under the scenarios of RingAll-reduce, PyTorch, ResNet50, the figure (h) is a schematic diagram of the training speeds of three training methods under the scenes of Ringall-reduce, PyTorch and VGG16, FIG. i is a schematic diagram of training speeds of three training methods in a scene of RingAll-reduce, PyTorch, AlexNet.

Detailed Description

In the existing distributed pipeline parallel method, the problem that the utilization rate of the GPU is influenced because the available resources of the GPU in the shared GPU cluster fluctuate is not considered, the GPU configuration corresponding to each network layer of the neural network is determined before training begins, and is kept unchanged in the training process. In order to solve the technical problem, the invention provides a GPU configuration method, a neural network distributed training method and a system for parallel flow in an artificial intelligence system. Monitoring available bandwidth of each GPU and forward propagation time and backward propagation time of each GPU on different model layers, and taking the forward propagation time and the backward propagation time as dynamic indexes; then, a plurality of working partitions suitable for the current environment are given by utilizing the dynamic indexes and the recorded static indexes, and the training speed of each working partition is predicted through a meta-network, so that more reasonable new working partitions are screened; finally, whether to switch to a new work partition is determined through reinforcement learning. The present invention will be further described with reference to the following examples.

Example 1

The method for configuring the pipeline parallel GPU in the artificial intelligence system, provided by the embodiment, is used for updating or maintaining the configuration relationship between the current GPU and the network layer before the next training of the neural network, and specifically comprises the following steps:

step 1: and acquiring the current static index and dynamic index in the distributed training system.

In this embodiment, the distributed training system adopts pipeline parallelism, which is based on model parallelism, and also partitions the model, and splits data of one batch into multiple micro-batches, so that the GPUs in the upper layer and the GPUs in the lower layer can process different micro-batches at the same time. In other words, in this embodiment, GPU allocation refers to configuring a GPU for each network layer of the neural network, and the same GPU is responsible for training one or more network layers.

In this embodiment, the static indicators include the number of network layers of the neural network, the number of GPUs, and the training characteristics of each network layer, where the training characteristics of the network layers are the output activation size O _i Weight parameter P _i And gradient G _i (ii) a In other possible embodiments, the training feature of the network layer can also be set as the weight parameter P _i And gradient G _i Namely, the method is adaptively adjusted according to the precision requirement. Assuming that N represents the number of network layers of the neural network, M represents the number of GPUs, and vectors represent training characteristics of all the network layers as follows:

O＝[O ₁ ,...,O _i ,...,O _N ]

P＝[P ₁ ,...,P _i ,...,P _N ]

G＝[G ₁ ,...,G _i ,...G _N ]

wherein, O, P and G respectively represent output activation size vectors, weight parameter vectors and gradient vectors corresponding to all network layers of the neural network.

The dynamic index in this embodiment includes: available bandwidth B per GPU _j The forward propagation time FP of the network layer where each GPU is responsible _i,j Backward propagation time BP _i,j J corresponds to the GPU number and i corresponds to the network layer number. The available bandwidth vector B, the forward propagation time vector FP, and the backward propagation time vector BP, which represent all GPUs by vectors, are:

B＝[B ₁ ,...,B _j ,...,B _M ]

FP＝[FP ₁ ,...,FP _j ,...,FP _M ],FP _ij ∈FP _j

BP＝[BP ₁ ,...,BP _j ,...BP _M ],BP _ij ∈BP _j

wherein FP _j Representing a forward propagation time vector of a network layer for which the jth GPU is responsible; BP (Back propagation) of _j Representing the backward propagation time vector of the network layer for which the jth GPU is responsible.

Step 2: and generating a plurality of new work partitions according to the static indexes and the dynamic indexes.

For example, in the present embodiment, the means for acquiring a new work partition may refer to the prior art, and includes: the partitioning algorithm described in the Genralized PipelineParallelismfor DNNTraining paper publicly generates several new working partitions. The paper discloses a partition algorithm, that is, a series of new work partitions are generated by using the method and using the static indexes and the dynamic indexes, but the embodiment does not adopt the dynamic programming screening method disclosed by the paper, but screens the work partitions by using the following technology.

And step 3: and taking the static index, the dynamic index and the new working partition as input, and obtaining a training speed prediction value corresponding to each new working partition by utilizing a working partition training speed prediction model constructed based on a meta-network.

The meta-network is an independent network and is introduced to obtain a training speed prediction value of the work partition. The meta-network in this embodiment is shown in fig. 3, and includes an embedded layer corresponding to each set of features, an LSTM network corresponding to the dynamic features, and a fully-connected layer. Inputting the dynamic indexes into an embedding layer corresponding to the dynamic characteristics to obtain output results, and inputting the respective output results into an LSTM network learning dynamic environment to obtain sequence characteristics of an available bandwidth vector B, a forward propagation time vector FP and a backward propagation time vector BP; and inputting the static indexes into the embedding layer corresponding to the static characteristics to obtain an output result, taking the output result, the sequence characteristics and the new working partition of the embedding layer corresponding to the static characteristics as the input of the full-connection layer, and inputting the full-connection layer to obtain an output result, wherein the output result is a training speed predicted value of the new working partition.

The embedded layer, the LSTM and the fully connected layer are all existing network architectures, the realized functions are also functions possessed by the network, for example, the embedded layer is used for converting input data into fixed-size vectors, and the fully connected layer is used for connecting features extracted from all components, so that training speed is predicted. The invention combines the embedded layer, the LSTM and the full connection layer, connects the LSTM network according to the embedded layer corresponding to the dynamic index, and forms a meta-network by the embedded layer corresponding to the static index and the architecture of the LSTM network connecting the full connection layer, and finally realizes the prediction of the work partition training speed. It should be understood that the network structure sizes of the embedded layer, the LSTM and the fully-connected layer are determined based on the input static index, dynamic index and work partition, so as to ensure that the output of the training speed prediction value is obtained after the static index, dynamic index and work partition are combined as the model input.

It should be understood that the meta-network architecture constructed based on the present invention and the determined inputs and outputs; and acquiring a working subarea with known training speed and dynamic and static indexes thereof to construct a sample set, and obtaining a working subarea training speed prediction model through an offline training element network. The training process can be understood as obtaining the function f (FP, BP, B, N, M, P, G, O, S) → V, where V is the work partition training speed prediction value and S denotes a new work partition, which is generally expressed in an array form, such as S [ M ], S [ M ] ═ { S [0], S [1] … S [ M-1 }, where the work partition is described in an array form with size M, the index of the array starts from 0, and each element in the array represents a layer allocated to each worker. The training process is not specifically set forth as it is conventional in the art.

And 4, step 4: and screening the new work partitions based on the training speed predicted value of each new work partition and the number of GPUs with changed configuration relations in each new work partition.

When a new working partition is screened in the embodiment, the following two rules need to be satisfied simultaneously;

rule 1: the configuration of only 2 GPUs in the screened new work partition is changed. The change of the configuration relationship between the GPU and the network layer means that the network layer which is responsible for the GPU is changed.

When the training speed predicted value of each working partition solution is obtained through the meta-network, it is time-consuming to use an enumeration method to find the optimal working partition, so that the invention preferably screens out the working partitions with only two GPUs and network layer configuration relations changing, and then selects out the working partition with higher speed according to the training speed.

It should be understood that, in this embodiment, if only one working partition is needed for final screening, the working partition with the largest predicted value of the training speed is selected according to the training speed.

And 5: and taking the static indexes, the dynamic indexes, the screened new work partition and the current work partition as input, and determining whether to replace the current work partition by using a screening model based on reinforcement learning, namely updating or keeping the configuration relationship between the current GPU and the network layer unchanged.

Reinforcement learning is different from deep learning, which is to train an algorithm using existing data to find a pattern for solving a corresponding problem and then predict new data using the pattern. The reinforcement learning is to adjust its own action (output) by feeding back the result (reward function) to obtain the optimal result. In this embodiment, the reward function is based on a training speed, specifically, a training speed corresponding to the selected work partition is compared with a training speed corresponding to a previous work partition, and if the training speed corresponding to the selected work partition is greater than the training speed corresponding to the previous work partition, it is determined that an output result of the screening model is correct, that is, in the environment of the current training, the output is correct, and the partition solution of the current time is preferentially selected next time when the network environment is encountered; otherwise, the output result of this time is slightly poor, and the priority of the partitioning solution of this time is reduced when the network environment is encountered next time. Because the reinforcement learning network structure (full-connection neural network) and the reward function type are not optimized, the reinforcement learning network structure and the reward function type are obtained by determining the model input and the model output related to the application and performing model offline training by constructing a sample set containing the model input and the model output only in order to solve the problem of work partition updating judgment of the reinforcement learning network structure and the reward function type.

It should be appreciated that a more reasonable GPU configuration may be determined for the next training of the neural network according to the above process, i.e. a work partitioning result that is more suitable for the current environment. And preferentially executing the process after the current training is finished and before the next training of the neural network is started, and determining the work partition of the next training.

Example 2:

the GPU configuration method provided in embodiment 1 is applied to the distributed training of the neural network, and therefore, this embodiment provides a neural network distributed training method based on the GPU configuration method, which includes the following steps:

step S1: and loading the neural network to be trained and the data set into a distributed training system, and layering the neural network.

Step S2: and initializing a working partition and performing first training. In this embodiment, a work partition in the existing pipeline parallel scheme is selected as an initialization work partition. Namely, the configuration relationship between the GPU and the network layer is determined according to the initialization work partition.

Step S3: before the next training of the neural network, determining a working partition of the next training of the neural network according to the modes of the steps 1 to 5 and then training; and if the new work partition is obtained, updating the configuration relation between the GPU and the network layer according to the new work partition.

Since the implementation of this step can refer to the implementation of embodiment 1, it is not specifically stated.

The deep neural network model needs to be trained for many times to converge (namely, the output of the deep neural network model basically accords with the actual result). The embodiment provides a corresponding work partition solution for each training, then compares the training speed with the work partition solution used in the last training, then enables the system to decide whether to switch to the currently trained work partition solution, and finally starts the training.

It should be understood that the neural network has many network layers, and the distributed training of the neural network means that each GPU is respectively responsible for some network layer training, so that the work partition in the present invention is to determine which GPU is responsible for which network layer training, and it is the prior art how the GPU utilizes the data set to implement the neural network training, and the present invention does not specifically restrict this.

In addition, the neural network distributed training method does not restrict the type and application scene of the neural network model, and the technical scheme of the invention can be applied as long as the neural network distributed training under the working condition of sharing the GPU cluster is met. In this embodiment, the method is applied to image classification, three deep neural network models of VGG16, ResNet50 and AlexNet are selected to realize image classification, and then the neural network is trained by using training data of image classification to obtain an image classification model. Wherein the synthesized data is trained to be set to the format of ImageNet. Other possible application areas may also be translation (e.g. implementing a neural network model in the english translation), video captioning, language recognition, etc.

Example 3:

the embodiment provides a distributed training system, which at least includes a plurality of GPU servers, and each GPU server is provided with a GPU, a CPU, a memory, a network card, and a switch. Determining a network layer responsible for each GPU server in each training according to the GPU configuration method provided by the embodiment 1; the method can also be understood as the neural network distributed training method provided in embodiment 2, and the distributed training of the neural network is implemented by using all GPU servers in the distributed training system.

The GPU is used for realizing network training, namely data calculation. The memory is used for storing data. The CPU, the network card and the switch are used for realizing data transmission. The GPU servers adopt a distributed communication mode, such as a PS (parameter Server) parameter server or a RingAll-reduce.

It should be appreciated that in some implementations, one GPU server may be selected as the controller, and in addition to implementing the network training function, the GPU configuration method described in embodiment 1 is performed to determine the network layer each GPU server is responsible for in each training; in other implementations, other external controllers may also be used to determine the network layer that each GPU server is responsible for in each training, and thus all GPU servers are used to implement distributed training of the neural network. The specific implementation technique needs to be determined by the selected communication mode, and the distributed training system is the prior art, so that the detailed description is not given.

Example 4:

the present embodiment provides a GPU allocation apparatus of the GPU configuration method, which includes: the device comprises a dynamic and static index acquisition module, a configuration module, a training speed predicted value acquisition module, a screening module and a decision module.

The dynamic and static index acquisition module is used for acquiring the current static index and dynamic index in the distributed training system.

The distributed training system adopts a pipeline parallel mode, a GPU is configured for each network layer of the neural network, the same GPU is responsible for training one or more network layers, and the static indexes comprise: the number of network layers of the neural network, the number of GPUs and training characteristics of each network layer, such as output activation size, weight parameters and gradient; the dynamic index includes: available bandwidth per GPU, forward propagation time and backward propagation time of the network layer in which each GPU is responsible.

The configuration module is used for generating a plurality of new work partitions according to the static indexes and the dynamic indexes, wherein the work partitions represent the configuration relationship between the GPU and the network layer.

And the training speed predicted value acquisition module is used for taking the static indexes, the dynamic indexes and the new working partitions as input and obtaining the training speed predicted value corresponding to each new working partition by utilizing a working partition training speed prediction model constructed based on a meta-network.

And the screening module is used for screening the new work partitions based on the training speed predicted value of each new work partition and the number of the GPUs with changed configuration relations in each new work partition.

For the implementation process of each module, please refer to the content of the above method, which is not described herein again. It should be understood that the above described division of functional blocks is merely a division of logical functions and that in actual implementation there may be additional divisions, for example, where multiple elements or components may be combined or integrated into another system or where some features may be omitted, or not implemented. Meanwhile, the integrated unit can be realized in a hardware form, and can also be realized in a software functional unit form.

Example 5:

the present embodiment provides an electronic device, which includes: one or more processors and memory storing one or more computer programs; wherein the processor invokes the computer program to: a step of a GPU configuration method with parallel flow in an artificial intelligence system; or implementing a step of a neural network distributed training method based on the GPU configuration method.

In some approaches, a processor invokes a computer program to implement: when the steps of the method for configuring the GPU in parallel in a pipeline in the artificial intelligence system are performed, the following steps are specifically performed:

the distributed training system adopts a pipeline parallel mode, a GPU is configured for each network layer of the neural network, the same GPU is responsible for training one or more network layers, and the static indexes comprise: the number of network layers of the neural network, the number of GPUs and training characteristics of each network layer; the dynamic index includes: available bandwidth per GPU, forward propagation time and backward propagation time of the network layer in which each GPU is responsible.

Step 2: generating a plurality of new work partitions according to the static indexes and the dynamic indexes; the work partition represents a configuration relationship between the GPU and the network layer.

And 3, step 3: taking the static index, the dynamic index and the new working partition as input, and obtaining a training speed prediction value corresponding to each new working partition by using a working partition training speed prediction model constructed based on a meta-network;

In other implementations, the processor invokes the computer program to implement: when the steps of the neural network distributed training method based on the GPU configuration method are used, the following steps are specifically executed:

step S1: loading a neural network to be trained and a data set into a distributed training system, and layering the neural network;

step S2: initializing a work partition and carrying out primary training;

the method comprises the following steps that a GPU is configured for each network layer of a neural network, the same GPU is responsible for training one or more network layers, namely the configuration relation between the GPU and the network layers is determined according to an initialization work partition, and a distributed training system is in communication connection by adopting a distributed communication mechanism;

step S3: before the next training of the neural network, determining a working partition of the next training of the neural network according to the modes of the steps 1 to 5 and then training; if the new work partition is obtained, updating the configuration relation between the GPU and the network layer according to the new work partition;

The memory may include high speed RAM memory, and may also include a non-volatile defibrillator, such as at least one disk memory.

If the memory and the processor are implemented independently, the memory, the processor and the communication interface may be connected to each other via a bus and perform communication with each other. The bus may be an industry standard architecture bus, a peripheral device interconnect bus, an extended industry standard architecture bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc.

Optionally, in a specific implementation, if the memory and the processor are integrated on a chip, the memory and the processor may complete communication with each other through an internal interface.

It should be understood that in the embodiments of the present invention, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

Example 6:

the present embodiments provide a readable storage medium storing a computer program for invocation by a processor to implement: a step of a method for configuring a GPU in parallel in a pipeline in an artificial intelligence system; or realizing a step of a neural network distributed training method based on the GPU configuration method.

Wherein, in some forms, the computer program is invoked by a processor to implement: when the steps of the method for configuring the GPU in parallel in the flow in the artificial intelligence system are disclosed, the following steps are specifically executed:

The distributed training system adopts a pipeline parallel mode, a GPU is configured for each network layer of the neural network, the same GPU is responsible for training one or more network layers, and the static indexes comprise the number of network layers of the neural network, the number of GPUs and training characteristics of each network layer; the dynamic index includes: available bandwidth of each GPU, and forward propagation time and backward propagation time of the network layer where each GPU is responsible;

And 5: and taking the static indexes, the dynamic indexes, the screened new work partition and the current work partition as input, and determining whether to replace the current work partition by using a reinforcement learning-based screening model, namely updating or maintaining the configuration relationship between the current GPU and the network layer.

In other implementations, the computer program is invoked by a processor to implement: when the steps of the neural network distributed training method based on the GPU configuration method are performed, the following steps are specifically performed:

step S2: initializing a work partition and carrying out primary training;

The specific implementation process of each step refers to the explanation of the foregoing method.

The readable storage medium is a computer readable storage medium, which may be an internal storage unit of the controller according to any of the foregoing embodiments, for example, a hard disk or a memory of the controller. The readable storage medium may also be an external storage device of the controller, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the controller. Further, the readable storage medium may also include both an internal storage unit of the controller and an external storage device. The readable storage medium is used for storing the computer program and other programs and data required by the controller. The readable storage medium may also be used to temporarily store data that has been output or is to be output.

Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned readable storage medium comprises: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

And (3) experimental verification:

experimental setup: the method comprises the steps that 10 GPU servers are arranged, wherein each server is provided with an NVIDIAP100GPU, 40 CPU cores, a 128GB memory, 1 Melanox ConnectX5100Gbps network card and a Melanox SN2100 switch, driving versions of Melanox are 5.1-0.6.6.0, an operation system of an experiment is Ubuntu 18.04, an inner core version of Linux is 4.15.0-55-genetic, the experiment uses three deep neural network models, namely VGG16, Res 50 and AlexNet, two different parameter synchronization schemes, namely PS (parameter Server) and RingAll-reduce, three different machine learning frameworks, namely Tensiloflow, MXNet and PyTorrex are used, VGG16 is set to be 64, ResNet is set to be 128 and AlNet is set to be 256, link bandwidth is 10Gbps-100 ps, and the method performs Dreex performance testing on the existing deep neural network (Pillen performance testing) base under different environments. Wherein, Baseline is a common deep neural network training scheme.

Fig. 5 is a schematic diagram of training speeds of three deep neural network training schemes in different depth neural network models and two different communication modes, which vary with network bandwidth, where (a) - (c) in fig. 5 are schematic diagrams of training speeds of three training methods (10Gbps, 25Gbps, 40Gbps, and 100Gbps) varying with network bandwidth under the scenarios of PS, Tensorflow, and three neural network models (ResNet50, VGG16, and AlexNet), where (a) is a schematic diagram of training speeds of three training methods under the scenarios of PS, Tensorflow, and ResNet50, where (b) is a schematic diagram of training speeds of three training methods under the scenarios of PS, Tensorflow, and VGG16, and (c) is a schematic diagram of training speeds of three training methods under the scenarios of PS, Tensorflow, and AlexNet, which is named as topipe in the present invention. It can be seen from the figure that the performance of AutoPipe is 177% and 89% higher than that of baseine and pipdream for resenet 50, 113% and 44% higher for VGG16, and 143% and 70% higher for AlexNet.

FIGS. 5 (d) - (f) are schematic diagrams of training speeds of three training methods (10Gbps, 25Gbps, 40Gbps and 100Gbps) with network bandwidth changes under the scenarios of PS, MXNet and three neural network models (ResNet50, VGG16 and AlexNet). Wherein, (d) is a schematic diagram of training speeds of the three training methods under the scenes of PS, MXNet and ResNet50, (e) is a schematic diagram of training speeds of the three training methods under the scenes of PS, MXNet and VGG16, and (f) is a schematic diagram of training speeds of the three training methods under the scenes of PS, MXNet and AlexNet. It can be seen from the figure that the performance of AutoPipe is 171% and 82% higher than that of baseine and pipdream for resenet 50, 104% and 41% higher for VGG16, and 124% and 58% higher for AlexNet.

FIGS. g) - (i) are schematic diagrams of training speeds of three training methods under the conditions of Ring All-reduce, PyTorch and three different neural network models (ResNet50, VGG16 and AlexNet) according to network bandwidth changes (10Gbps, 25Gbps, 40Gbps and 100Gbps), wherein the diagram (g) is a schematic diagram of the training speeds of the three training methods under the conditions of Ring All-reduce, PyTorch and Rese 50, the diagram (h) is a schematic diagram of the training speeds of the three training methods under the conditions of Ring All-reduce, PyTorch and VGG16, and the diagram (i) is a schematic diagram of the training speeds of the three training methods under the conditions of Ring All-reduce, PyTorch and AlexNet, and it can be seen that for Res 50, the performance of AutoPipe is 148% and 65% higher than that of Baselin and PipeDraem, and for VGG16, the performance of AutoPipepe is 143% higher than that of Baselin and 17% that of Piteeline and Drein are higher than that of Piteelme.

The invention thus makes it possible to observe, through the above experiments: 1) AutoPipe outperforms pipdream in all cases, and AutoPipe gains even more acceleration based on pipdream in some cases. 2) AutoPipe showed more acceleration in ResNet 50. The reason is that the ResNet50 contains more layers than the other two models. Thus, AutoPipe gains more benefits from more accurate modeling and finer granularity switching.

It should be emphasized that the examples described herein are illustrative and not restrictive, and thus the invention is not to be limited to the examples described herein, but rather to other embodiments that may be devised by those skilled in the art based on the teachings herein, and that various modifications, alterations, and substitutions are possible without departing from the spirit and scope of the present invention.

Claims

1. A method for configuring a GPU (graphics processing Unit) with parallel flow in an artificial intelligence system is characterized by comprising the following steps: the GPU configuration method is based on a shared GPU cluster and used for updating or maintaining the configuration relationship between the current GPU and a network layer before the next training of a neural network, wherein the shared GPU exists in the GPU cluster, and the GPU configuration method comprises the following steps:

step 2: generating a plurality of new work partitions according to the static indexes and the dynamic indexes;

the work partition represents the configuration relation between the GPU and the network layer;

and step 3: taking the static index, the dynamic index and the new working subareas as input, and obtaining a training speed prediction value corresponding to each new working subarea by using a working subarea training speed prediction model;

2. A GPU configuration method according to claim 1, characterized in that: in step 3, the working partition training speed prediction model is constructed based on a meta-network, and the meta-network comprises: each group of static indexes, an embedded layer corresponding to the dynamic indexes, an LSTM network corresponding to the dynamic indexes and a full connection layer, wherein the embedded layer corresponding to the dynamic indexes is connected with the LSTM network, and the embedded layer corresponding to the static indexes and the LSTM network are connected with the full connection layer;

inputting the dynamic indexes into a corresponding embedded layer, and inputting the obtained output result into the LSTM network to obtain the sequence characteristics of the dynamic indexes;

and inputting the static indexes into a corresponding embedding layer to obtain an output result, the sequence characteristics and a new working partition as the input of a full connection layer, wherein the output of the full connection layer is a training speed predicted value of the new working partition.

3. A GPU configuration method according to claim 2, characterized in that: the working partition training speed prediction model and the screening model are constructed through off-line training;

4. The GPU configuration method of claim 1, wherein: when a new working partition is screened in the step 4, the following two rules need to be met simultaneously;

rule 1: the screened new work partition only has 2 GPU configuration changes, and the configuration is the configuration relation between the GPU and the network layer;

5. The GPU configuration method of claim 1, wherein: the training features of each network layer include an output activation size, a weight parameter, and a gradient of the network layer.

6. A neural network distributed training method based on the GPU configuration method of claim 1, characterized in that: the method comprises the following steps:

step S1: loading a neural network to be trained and a data set into a distributed training system, wherein the neural network is divided into a plurality of network layers;

step S3: before the next training of the neural network, determining a working partition corresponding to the next training of the neural network according to the modes of the steps 1 to 5 and then training; if the new work partition is obtained, updating the configuration relationship between the GPU and the network layer according to the new work partition;

7. A GPU allocation apparatus based on the GPU configuration method of any of claims 1 to 5, characterized in that: the method comprises the following steps:

the configuration module is used for generating a plurality of new work partitions according to the static indexes and the dynamic indexes;

the training speed prediction value acquisition module is used for taking the static indexes, the dynamic indexes and the new working partitions as input and obtaining a training speed prediction value corresponding to each new working partition by utilizing a working partition training speed prediction model;

8. A distributed training system based on the GPU configuration method of claim 1 or the training method of claim 6, characterized in that: the distributed training system comprises at least: each GPU server is provided with a GPU, a CPU, a memory, a network card and a switch;

the GPU is used for realizing neural network training; the memory is used for storing data; the CPU, the network card and the switch are used for realizing data transmission, and distributed communication is adopted among the GPU servers.

9. An electronic device, characterized in that: the method comprises the following steps:

one or more processors;

a memory storing one or more computer programs;

wherein the processor invokes the computer program to implement:

the GPU configuration method of claim 1 or the neural network distributed training method of claim 6.

10. A readable storage medium, characterized by: a computer program is stored, which is invoked by a processor to implement: