CN115794385A - Container automatic arrangement method for deep learning model distributed training - Google Patents

Container automatic arrangement method for deep learning model distributed training Download PDF

Info

Publication number
CN115794385A
CN115794385A CN202211426263.XA CN202211426263A CN115794385A CN 115794385 A CN115794385 A CN 115794385A CN 202211426263 A CN202211426263 A CN 202211426263A CN 115794385 A CN115794385 A CN 115794385A
Authority
CN
China
Prior art keywords
model
training
operator
deep learning
container
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211426263.XA
Other languages
Chinese (zh)
Inventor
曹春
徐经纬
崔子寒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202211426263.XA priority Critical patent/CN115794385A/en
Publication of CN115794385A publication Critical patent/CN115794385A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a container automatic arrangement method for deep learning model distributed training, which comprises the steps of acquiring operators of a neural network model to be trained; when the calculation time prediction model is used for predicting the calculation time of an operator, the forward and backward propagation time consumption cost of each layer is obtained; for given hardware equipment, a communication bandwidth test is run in parallel, and topology information of the equipment is acquired; according to an analysis result obtained by a prediction model during calculation, combining topological information of hardware equipment, using a simulated annealing strategy to divide the model, constructing a mirror image, using the mirror image to create a container, and arranging the container based on Kubernetes; and a training process is operated in the containers, and the containers are communicated to finish the training of the model together. The invention provides an out-of-box automatic model division function in the aspect of training of a complex neural network model, and the divided model is trained by a plurality of devices in parallel, so that the training efficiency of a large model is improved.

Description

Automatic container arrangement method for deep learning model distributed training
Technical Field
The invention relates to a container automatic arrangement method for deep learning model distributed training, and belongs to the field of distributed computation and deep learning training optimization.
Background
As a representative technique in the field of artificial intelligence, the deep learning technique is widely applied to many fields of artificial intelligence, including computer vision, natural language processing, speech recognition, automatic driving, and the like, and has achieved leading effects. Deep learning uses a deep neural network model, extracts features from training data, and classifies, identifies and the like the features according to the requirements of tasks. Common neural network models include multilayer perceptron (MLP), convolutional Neural Network (CNN), recurrent Neural Network (RNN), and the like. The training of the neural network model adopts a back propagation algorithm, training data are input into the network, and the prediction result of the model is obtained through forward propagation. And comparing the predicted result with the real result, and calculating the Loss (Loss) according to the selected Loss function. In the back propagation process, the model solves the gradient of the parameters with respect to the loss layer by layer from back to front. And after the solution is completed, updating the parameters according to the gradient and the learning rate to reduce the loss.
Although deep learning achieves leading results in various tasks, the training process of the model has many challenges. On one hand, the model needs to use a large amount of data from training to convergence, and the expected effect can be achieved through a plurality of rounds of (Epoch) training, and a large amount of calculation needs to be processed in the process, so that high-performance computing equipment such as a GPU (graphics processing unit), a TPU (thermoplastic polyurethane), an FPGA (field programmable gate array) and the like need to be used, and hardware resources used in the training need to be managed reasonably and efficiently. On the other hand, the improvement of the model performance is accompanied by the improvement of the model complexity and the model parameter number, and the complex model needs to retain more intermediate results in the training process, so that a large amount of running Memory is needed, and a Memory-Out problem (Out-of-Memory, OOM) occurs on a single device.
Aiming at complex models and mass data, the existing method divides the models into a plurality of devices for distributed training by adopting a model parallelization mode, each device is responsible for training a part of model parameters, and the devices transmit data required by forward propagation and backward propagation through point-to-point communication. In the training process, data Throughput (Data Throughput) is an important index for measuring training efficiency, and is defined as the amount of Data processed in a unit time. However, in the stage of model division, an automatic division method capable of effectively improving training efficiency is lacked. In the existing methods, some methods rely on manual division, and an experienced algorithm engineer is required to analyze the model and then give a division scheme. The scheme needs to enable an algorithm researcher to participate in the training process, depends on the experience of a developer more, and cannot be used after being opened. Some methods measure the computational power required by each layer in the model by using coarse-grained indexes such as floating point numbers (FLOPs), and then perform load balancing division according to the computational power. On one hand, the division methods lack consideration of communication cost between equipment after division, and on the other hand, formalized modeling on a training process does not exist, and consideration of the relation between the division effect and the training efficiency is lacked. The obtained division scheme can cause low equipment utilization rate, thereby affecting training efficiency.
For training of the divided models, if a traditional physical machine is used for training, the resource utilization rate is low, isolation among tasks cannot be guaranteed, and the tasks are affected mutually.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention provides a Kubernets-based container arrangement method for distributed deep learning training, which organizes a group of physical machines to construct a Kubernets cluster, performs unified resource management and task arrangement, provides an out-of-box automatic model division function and a Kubernets-based distributed training function, arranges by using the Kubernets after the divided model is containerized, performs distributed training by using a plurality of containers, and can effectively improve the efficiency of the distributed deep learning training.
The technical scheme is as follows: a container automatic arrangement method for deep learning model distributed training comprises a general calculation time prediction method, a deep learning model division method based on a simulated annealing strategy and training operation based on Kubernetes.
1) Based on a container scheduling platform Kubernets: for user tasks, scheduling is performed on a distributed cluster of computers using kubernets.
2) The calculation time characteristics are established for the given deep neural network model by establishing the calculation time prediction model, and the process is performed off line, so that the overhead of model analysis can be effectively saved. The output result is used for accelerating the subsequent training process.
3) Automated model partitioning and training: and for the model to be trained input by the user, automatically extracting the characteristics of the model, dividing the model by combining hardware information, and performing automatic load balancing and communication optimization. The divided model carries out scheduling by using the container scheduling platform in the step 1).
The Kubernets scheduling platform based on container arrangement comprises:
11 For a given deep learning model to be trained, the tasks are encapsulated and isolated in a containerized manner and scheduled using kubernets.
12 In a containerized manner, the tasks are allocated with resources and isolated from one another; and the training process is operated in the containers, and the containers are communicated through the network to jointly complete the training of the model. The starting and stopping processes of the tasks are simplified, the resource management in the distributed computer cluster is simplified, and the use of special computing resources is simplified.
The calculation time prediction model specifically comprises:
21 For a given deep learning model, by dynamically analyzing the model, an Operator (Operator) specifically used in the model and a hyper-parameter (hyper-parameter) configuration of the Operator can be obtained. The operator types include: two-dimensional convolution operator (Conv 2 d), linear transformation operator (Linear), max-pooling operator (Max-pooling), average-pooling operator (Average-pooling), linear rectification function operator (ReLU), batch regularization operator (BatchNorm).
22 The computation time of an operator is predicted by using a computation time prediction model, the type of the operator and the hyper-parameter of the operator are used as input, and the forward propagation time and the backward propagation time of the operator are predicted in the deep learning training process.
211 For a given deep learning model, the definition of the model is done using the open source framework PyTorch. And finally analyzing the model into a directed acyclic graph with operators as nodes through dynamic analysis.
221 For the calculation time prediction model, a multilayer neural network is used for modeling, and the calculation time of an operator is predicted by learning the relation between the calculation time of the operator and the hyperparameter of the operator.
The automated model division and training in 3) specifically comprises
31 Generate placement solutions for deep learning models that require training. For given hardware equipment, performing parallel communication bandwidth test, and acquiring topological information of the equipment; and when the operator calculation time is obtained based on the calculation time prediction model, model division is carried out by combining equipment topology information and aiming at minimizing single data batch processing time and maximizing data throughput.
32 Containerize the partitioned models. And constructing a mirror image according to the model according to the division result. And packaging the divided models into a general mirror image file by using Docker in combination with training codes related to the models, and providing the general mirror image file for subsequent training.
33 Distributed training is performed on the partitioned model. In the kubernets cluster, containers are created using mirroring. Creating a corresponding container on Kubernets according to the image file constructed in the step 32), and running a model training process in the container. The containers are arranged and scheduled by Kubernets, and the containers communicate with each other by using a TCP (transmission control protocol) and an NCCL (NCCL) communication library, so that distributed training of the model is completed together.
In 31), the topology information of the equipment is acquired on line, and a NCCL communication library is used for carrying out point-to-point communication bandwidth test on the equipment participating in training; the bandwidth testing process is that when point-to-point data transmission between the collection equipment is used, a linear model for representing the communication speed is obtained in a linear fitting mode.
32), the model division adopts an improved Simulated Annealing strategy (Simulated Annealing) to find a better division scheme. In the searching process, the searching is carried out in the direction of reducing the communication cost by combining with the communication speed model between the devices.
And 32), performing distributed training on the divided models, wherein the process is as follows:
321 Kubernets are used for programming and scheduling training tasks, a part of each divided model and related training codes are packaged into a container through a containerization method, and the Kubernets are used for programming and scheduling.
322 Using a queue-based asynchronous communication mode, each container starts 4 work threads, which are respectively responsible for receiving forward-propagated input from the front-driving device, sending forward-propagated results to the back-driving device, receiving backward-propagated input from the back-driving device, and sending backward-propagated results to the front-driving device; 4 work threads work in parallel to improve communication efficiency.
323 Distributed training using Model parallelization (Model Parallelism). And the containers cooperate with each other to complete the complete training of a model, and the containers communicate with each other through a network to complete the training process of the model.
The process of completing model training by the Kubernetes-based distributed deep learning training acceleration method is as follows:
step 1, automatically analyzing a model to be trained, obtaining an operator in the model in a dynamic analysis mode, using a time-use prediction model, and configuring the calculation time of the prediction operator according to the type of the operator and the hyper-parameter.
And 2, automatically analyzing the topology of the equipment, and obtaining a communication model between the equipment through parallel communication bandwidth test.
And 3, dividing the model, and searching a better division scheme through a simulated annealing strategy.
And 4, preparing training, and packaging the divided model and the model training codes into mirror images.
And 5, training, namely creating a container on Kubernets according to the mirror image packaged in the step 4, and performing model training.
Distributed computing cluster: a cluster contains a number of computer devices, each including a memory, a general purpose Computing Processor (CPU), a general purpose Graphics Processor (GPU), and a computer program stored on the memory and executable on the processor. In a cluster, one-to-one, one-to-many, and many-to-many communications between computer devices may occur over a network.
A graphics processor GPU, dedicated to perform graphics calculations and tensor calculations, is installed on the general purpose computer. The device is provided with a plurality of stream processors and is used for accelerating matrix calculation in a parallel mode, so that the device is suitable for deep network model training scenes.
Compared with the prior art, the invention has the following characteristics:
1) The automatic out-of-box and in-use model partitioning function is provided, and the training process of model parallelization in the existing method is simplified.
2) A calculation time prediction model is provided, and the calculation time of each layer in the model to be trained can be obtained in an off-line mode, so that the cost of on-line testing is avoided.
3) A heuristic model division method is designed by using a simulated annealing strategy, so that a better solution can be given in a shorter time, and the model training efficiency can be effectively improved.
4) Distributed training is performed based on container arrangement capability provided by Kubernetes, and isolation between tasks and unified management of resources can be effectively performed.
5) The method has universality, realizes decoupling in a system, can be adapted to a deep learning training framework, improves the training efficiency of complex models and mass data, reduces the total training time, and improves the equipment utilization rate.
Drawings
FIG. 1 is a method schematic of an embodiment of the invention;
FIG. 2 is a flow chart for performing model partitioning and training.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
A container automatic arranging method for deep learning model distributed training comprises the following steps: the method comprises a general calculation time prediction method, a deep learning model division method based on a simulated annealing strategy and training operation based on Kubernetes.
The concrete contents are as follows: 1) And aiming at different operators, constructing a prediction model for calculation according to the relation between the collected operator hyper-parameters and the calculation time. 2) And aiming at the model to be trained, carrying out automatic model analysis by using the constructed calculation time prediction model to obtain the forward propagation time cost and the backward propagation time cost of each layer in the model to be trained. 3) And for given hardware equipment, performing communication bandwidth test in parallel and collecting topology information of the equipment. 4) And according to an analysis result obtained by the prediction model during calculation, combining the topological information of the hardware equipment, and performing model division by using a simulated annealing strategy. 5) And constructing an image file according to the division result of the model. 6) On kubernets, containers are created using image files through customized orchestration logic, a training process is initiated, and the containers communicate over a network.
1) In the method, aiming at different operators, a calculation time prediction model is constructed according to the relation between the collected operator hyper-parameters and calculation time, and the calculation time prediction model is constructed by adopting a multilayer perceptron (MLP) mode and taking the operator types and the hyper-parameters as input, and taking forward propagation time and backward propagation time as output from the collected data. For different operators, different computationally-efficient predictive models will be constructed. The classes of operators include a two-dimensional convolution operator (Conv 2 d), a Linear transformation operator (Linear), a max pooling operator (MaxPooling), an average pooling operator (AveragePooling), a Linear rectification function operator (ReLU), and a batch regularization operator (BatchNorm).
For a certain type of operator, the method firstly creates a plurality of sets of hyper-parameter configurations of the operator according to the characteristics of the operator and by combining common models in the torchvision. And aiming at each hyper-parameter configuration, obtaining forward propagation time and backward propagation time of the operator under the corresponding configuration condition in a performance analysis mode, and constructing a calculation time data set aiming at the operator. After a data set is obtained, the method trains a calculation time-consuming prediction model for an operator in a multilayer perceptron mode based on the data set.
The method uses a multilayer perceptron mode to establish a prediction model for calculation. The multilayer perceptron has strong fitting capability, and the introduced activation function layer can enable the multilayer perceptron to process the nonlinear problem. Compared with static table look-up, the method reduces the expense of data storage, and compared with online performance analysis, the method does not need an online operation model, only needs offline prediction, and can reduce the expense of online analysis.
2) The model is analyzed, and the purpose is to obtain the forward propagation time and the backward propagation time of each layer of the model. The method comprises the steps of firstly obtaining layers in a model and operators contained in each layer through a dynamic analysis method. The framework predicts the computation time of the operator in each layer according to the computation time prediction model obtained in the step 1). According to the type of the operator, the calculation time of the operator is aggregated to obtain the calculation time of the layer, namely the forward propagation time and the backward propagation time.
3) The topological information of the equipment is collected. In distributed training, the communication cost between devices is not negligible and greatly affects the data throughput in the training process. In a non-proprietary cluster, communication links between devices are usually heterogeneous, devices on the same server may perform point-to-point communication through a PCI-E bus or a customized device such as NVLink, and devices on different servers need to communicate through a network, and different communication links have different bandwidths and have a large difference, so it is necessary to acquire topology information of the used devices before training to obtain an indication of the communication bandwidth between the devices.
The method adopts a parallel bandwidth testing mode to test the bandwidth of each link in parallel, and devices associated with all the bandwidths to be tested at the same time are not repeated. And (3) fitting a communication model by using a linear model through sampling the time required by the transmission of the data packets with different sizes. The communication model receives the communication data amount as input and outputs expected communication time.
4) According to an analysis result obtained by the calculation time prediction model, the topological information of hardware equipment is combined, and a simulated annealing strategy is used for dividing the model. 2) And calculating a time prediction model to obtain the forward propagation time and the backward propagation time of each layer in the model to be trained. 3) And constructing a communication model to obtain a bandwidth model between the devices. The two parts are used as input of a partitioning algorithm, and the partitioning algorithm combines the outputs of 2) and 3) to give a partitioning scheme.
The method adopts a simulated annealing strategy to produce the model division scheme. The simulated annealing algorithm is a heuristic search algorithm, and can enable an initial search stage of the algorithm to accept a poor solution with a certain probability by setting a temperature parameter, so that premature trapping in a local minimum value is avoided. The goal of minimization in the search process is to minimize the training time for a single data batch, thereby maximizing the number of data batches processed per unit time, i.e., maximizing data throughput.
For a given model partitioning scheme, the method obtains the training time of a single data batch by a method of calculating a Critical Path (Critical Path). For the device numbered i, the starting time of completing the forward propagation of the j data batch by the device is a larger value of completing the forward propagation of the j-1 data batch by the device and completing the forward propagation of the j data batch by the device No. i-1. By solving the key path of the dependency graph formed among the devices, the time for a certain data to complete training can be obtained.
The model division method used in the method starts from average division and uses a strategy of simulated annealing for searching. During the search process, neighborhoods are generated by randomly migrating certain layers on one device to another device. In all the generated neighborhoods, for each candidate partition scheme, the communication model constructed in the step 3) is used, the communication overhead of the partition scheme is calculated, the partition scheme is weighted according to the communication overhead, and the scheme with low communication overhead has higher probability of being selected. The selected scheme is compared with the current scheme, if the selected scheme is better than the current scheme, namely the training time of a single data batch is shorter, the current scheme is replaced, otherwise, whether the current scheme is replaced is determined with a certain probability according to the current temperature parameter. As the search progresses, the temperature parameter is gradually reduced.
5) And constructing different mirror image files for creating a container according to the division result of the model, and performing subsequent training.
6) And creating a container according to the result of the model division and the corresponding mirror image file by combining training configuration, and training. The training code of the model is run in each container, and is responsible for training a part of the model. The hardware used by the container is specified by injecting an environment variable NVIDIA _ VISIBLE _ DEVICE into the container. The containers are communicated through a network to complete the distributed training process.
The method is realized by the following specific steps:
1) And aiming at different operators, constructing a prediction model for calculation according to the relation between the collected operator hyper-parameters and the calculation time.
101 For a given type of operator, selecting hyper-parameters related to the calculation cost, and selecting a common value range of each hyper-parameter. And carrying out Cartesian product operation on the value ranges of all the hyper-parameters to obtain a configuration data set of the operator.
102 Run tests are performed on each configuration in the data set. The operator was tested for forward propagation time and backward propagation time in the specified configuration using the hook mechanism of PyTorch.
103 By taking hyper-parameters of the operator as characteristics, taking forward propagation time and backward propagation time of the operator as targets, training by using a multilayer perceptron, and fitting out a computation time prediction model of the operator.
2) And aiming at the model to be trained, carrying out automatic model analysis by using the constructed calculation time prediction model to obtain the forward propagation time cost and the backward propagation time cost of each layer in the model to be trained.
201 A hook mechanism provided by PyTorch is used for performing instrumentation on the model to be trained, dynamic analysis is performed, and all operators used in the model are analyzed.
202 For each operator, the forward propagation time and the backward propagation time of the operator are predicted by using the computation time prediction model obtained in 1).
203 The computation times of the operators are aggregated according to the type of the operators to obtain the computation time of the layer, namely the forward propagation time and the backward propagation time.
3) And for given hardware equipment, performing communication bandwidth test in parallel and collecting topology information of the equipment.
301 For each communication link between devices, 5 packets with the size of 100MB,300MB,500MB,700MB,1GB are selected, and the time for transmitting the 5 packets through the link is counted.
302 ) for the statistical results obtained in 301), a linear model between the communication data amount and the communication time is solved using a least square method.
4) And according to an analysis result obtained by the calculation time prediction model, combining topological information of hardware equipment, and performing model division by using a simulated annealing strategy.
401 Using an average division method to obtain an initial solution S according to the calculation time obtained in 2) 0
402 ) selecting an initial temperature parameter t 0 Selecting the minimum temperature t at which the search is stopped min
403 ) set the current temperature t = t 0 Current solution S = S 0
404 When t is> t min Then, the following steps are executed:
405 From the current solution S, a neighborhood is generated by randomly migrating certain layers on one device to another device, the neighborhood containing multiple candidate solutions.
406 3) calculating the communication overhead of the partitioning scheme, and weighting the partitioning scheme according to the communication overhead, wherein the scheme with low communication overhead has higher probability of being selected. The selected candidate solution is noted as S'.
407 S and S ' are calculated, respectively, and if S ' is shorter than S, let S = S '.
408 If the critical path of S ' is longer than S), the probability p of accepting S ' is calculated from the current temperature, let S = S ' with probability p. And decays the temperature, let t = t/2.
409 405) to 408) are repeated until the current temperature t< t min . The S at this time is returned as the final solution.
5) And according to the model division result obtained in the step 4), creating a corresponding image file by using Docker.
6) Starting a training container:
601 According to the image file generated in the 5), a corresponding container is created in Kubernets, and the container number is 0,1, \8230;, N.
602 Container No. 0) creates a process group, and the other containers send requests to container No. 0 to join the process group, eventually all containers have joined the process group.
603 Start training code in container, k number container accepts input of forward propagation from container number (k-1), and sends the result of forward propagation to container number (k + 1) as input of container number (k + 1). Correspondingly, the container number k accepts the input of the back propagation from the container number (k + 1), and sends the result of the back propagation to the container number (k-1) as the input of the container number (k-1).
604 603) until the training is finished.
A distributed computing cluster, the cluster comprising a number of computer devices, each computer device including a memory, a general purpose Computing Processor (CPU), a general purpose Graphics Processor (GPU), and a computer program stored on the memory and executable on the processor. In a cluster, one-to-one, one-to-many, and many-to-many communications between computer devices may occur over a network.
As shown in fig. 1, the distributed deep learning training acceleration method based on kubernets includes:
the method comprises the following steps: the model defined by the user input pytorreh is analyzed to obtain forward propagation and backward propagation time for each layer of the model. The method comprises the steps of firstly obtaining layers in a model and operators contained in each layer through a dynamic analysis method. The framework will make a prediction of computation time for the operators in each layer according to the computation time prediction model. The calculation time of the operator is aggregated according to the type of the operator to obtain the calculation time of the layer, namely the forward propagation time and the backward propagation time.
Step two: and collecting the topological information of the equipment. In a non-proprietary cluster, communication links between devices are usually heterogeneous and therefore have different transmission speeds, the method adopts a parallel bandwidth test mode to test the bandwidth of each link in parallel, and devices associated with all the bandwidths to be tested at the same time are not duplicated. And (3) fitting a communication model by using a linear model through sampling the time required by the transmission of the data packets with different sizes. The communication model receives the communication data amount as input and outputs expected communication time. Specifically, in the bandwidth test, 5 data packets with the size of 100MB,300MB,500MB,700MB and 1GB are selected for a communication link between each device, and the time for transmitting the 5 data packets through the link is counted. After the statistics is finished, 5 tuples of < data packet size, transmission time > are obtained. And then performing linear fitting by using a least square method to obtain a linear model representing the communication cost.
Step three: and according to an analysis result obtained by the calculation time prediction model, combining topological information of hardware equipment, and performing model division by using a simulated annealing strategy. During the search, neighborhoods are generated by randomly migrating certain layers on one device to another device. And in all the generated neighborhoods, for each candidate partition scheme, calculating the communication cost of the partition scheme by using the communication model constructed in the step, weighting the partition scheme according to the communication cost, wherein the scheme with low communication cost has higher probability of being selected. The selected scheme is compared with the current scheme, if the selected scheme is better than the current scheme, namely, shorter single data batch training time exists, the current scheme is replaced, otherwise, whether the current scheme is replaced is determined with a certain probability according to the current temperature parameter. As the search progresses, the temperature parameter is gradually reduced.
Step five: and combining the division results to arrange the containers running the training tasks. In implementation, the method containers the tasks and programs the containerized tasks by using a Kubernetes cluster. The containers run model training code inside, and each container is responsible for training a part of the models.
The following is the training process inside the container, and fig. 2 shows the operation flow of the training framework:
step six: the training program in the container is started first, and the model part responsible for the training program is loaded on the equipment according to the division result. Then 5 working threads are started, which are respectively responsible for: model training, accepting results from the predecessor, sending results back to the successor, accepting returned gradients from the successor, and sending gradients to the predecessor. Except the model training thread, the other threads responsible for communication maintain a blockable queue, and the communication process is simplified through the queue. And after the training program is started, synchronous handshaking is carried out among the containers to construct a communication group.
Step seven: after synchronization is complete, the first container reads the input data from the data set and begins training. The training process is iterative, with the input data processed to the last container. And after the last container reads the label corresponding to the original input data, comparing the predicted result with the real data, and calculating the loss according to the selected loss function. In the back propagation process, the model solves the gradient of the parameters with respect to the loss layer by layer from back to front. And after the solution is completed, updating the parameters according to the gradient and the learning rate to reduce the loss.
It will be apparent to those skilled in the art that the steps of the method of the embodiments of the present invention described above may be implemented by a general purpose computing device, centralized on a single computing device or distributed over a network of computing devices, or alternatively implemented by program code executable by a computing device, such that the steps shown and described may be executed by a computing device stored in a storage device and, in some cases, executed out of order, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from a plurality of modules or steps. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

Claims (10)

1. A container automatic arranging method for deep learning model distributed training is characterized by comprising the following steps:
1) Based on a container scheduling platform Kubernets: automatically containerizing tasks of users, and then using Kubernets to arrange on a distributed computer cluster;
2) Establishing a calculation time characteristic for a given deep neural network model by establishing a calculation time prediction model, wherein the process is performed off line;
3) Automated model partitioning and training: for a model to be trained input by a user, automatically extracting model characteristics, dividing the model by combining hardware information, and performing automatic load balancing and communication optimization; the divided model is scheduled by using the container scheduling platform in 1).
2. The deep learning model distributed training-oriented container automatic programming method according to claim 1, wherein the Kubernets-based container programming scheduling platform comprises:
11 For a given training task of a deep learning model to be trained, packaging and isolating the task in a containerization mode, and scheduling by using Kubernets;
12 In a containerized manner, the tasks are allocated with resources and isolated from one another; and the training process is operated in the containers, and the containers are communicated through the network to jointly complete the training of the model.
3. The Kubernetes-based distributed deep learning training acceleration method according to claim 1, characterized in that the computing time prediction model specifically comprises:
21 For a given deep learning model, carrying out dynamic analysis on the model to obtain an operator specifically used in the model and the hyper-parameter configuration of the operator;
22 The computation time of an operator is predicted by using a computation time prediction model, the type of the operator and the hyper-parameter of the operator are used as input, and the forward propagation time and the backward propagation time of the operator are predicted in the deep learning training process.
4. The deep learning model-oriented distributed training container automatic programming method according to claim 3, characterized in that for a given deep learning model, the definition of the model is performed by using an open-source framework PyTorch; analyzing the model into a directed acyclic graph with operators as nodes finally through dynamic analysis;
and for the prediction model for calculating, modeling by using a multilayer neural network, and predicting the calculating time of an operator by learning the relation between the calculating time of the operator and the hyperparameter of the operator.
5. The method for automatically arranging containers for deep learning model distributed training according to claim 1, wherein the automated model division and training in 3) specifically comprises:
31 Generate a placement solution for the deep learning model to be trained; for a given hardware device, running a communication bandwidth test in parallel, and acquiring topological information of the device; when the operator calculation time is obtained based on the calculation time prediction model, model division is carried out by combining equipment topology information and taking the minimization of single data batch processing time and the maximization of data throughput as targets;
32 Containerize the divided models; constructing a mirror image of the model according to the division result; packaging the divided models into a general mirror image file by using Docker in combination with training codes related to the models;
33 Distributed training is performed on the divided models; in the Kubernetes cluster, containers are created using mirroring; establishing a corresponding container on Kubernetes according to the image file constructed in the step 32), and running a model training process in the container; the containers are arranged and scheduled by Kubernets, and the containers communicate with each other by using a TCP (transmission control protocol) and an NCCL (NCCL) communication library, so that distributed training of the model is completed together.
6. The deep learning model distributed training-oriented container automatic arrangement method according to claim 1,
in 31), the topology information of the equipment is acquired on line, and a NCCL communication library is used for carrying out point-to-point communication bandwidth test on the equipment participating in training; when the bandwidth test process is used for acquiring point-to-point data transmission between the devices, a linear model for expressing the communication speed is obtained in a linear fitting mode;
in the 32), an improved simulated annealing strategy is adopted for model division to find a better division scheme; in the searching process, the searching is carried out in the direction of reducing the communication cost by combining with the communication speed model between the devices.
7. The method for automatically arranging containers for the deep learning model distributed training according to claim 5, wherein in 32), the distributed training is performed on the divided models, and the process is as follows:
321 Using Kubernets to schedule training tasks, packaging a part of each divided model and related training codes into a container by a containerization method, and using Kubernets to schedule and schedule;
322 Using a queue-based asynchronous communication mode, each container starts 4 work threads, which are respectively responsible for receiving forward-propagated input from the front-driving device, sending forward-propagated results to the back-driving device, receiving backward-propagated input from the back-driving device, and sending backward-propagated results to the front-driving device; 4 working threads work in parallel to improve the communication efficiency;
323 Distributed training using model parallelization; and the containers cooperate with each other to complete the complete training of the model, and the containers are communicated with each other through the network to complete the training process of the model.
8. The method for automatically arranging containers for deep learning model distributed training according to claim 1, wherein the model training is completed by a Kubernetes-based distributed deep learning training acceleration method as follows:
step 1, automatically analyzing a model to be trained, obtaining an operator in the model in a dynamic analysis mode, using a time-use prediction model, and configuring the calculation time of the prediction operator according to the type and the hyper-parameter of the operator;
step 2, automatically analyzing equipment topology, and obtaining a communication model between the equipment through parallel communication bandwidth test;
step 3, dividing the model, and searching a better division scheme through a simulated annealing strategy;
step 4, preparing training, and packaging the divided models and model training codes into mirror images;
and 5, training, namely creating a container on Kubernets according to the mirror image packaged in the step 4, and performing model training.
9. The method for automatically arranging the containers for the deep learning model distributed training according to claim 1, wherein the distributed computing cluster comprises: the cluster comprises a plurality of computer devices, each computer device comprising a memory, a general purpose Computing Processor (CPU), a general purpose Graphics Processor (GPU), and a computer program stored on the memory and executable on the processor;
in a cluster, one-to-one, one-to-many, and many-to-many communications between computer devices may occur over a network.
10. The container automatic arrangement method for the deep learning model distributed training as claimed in claim 9, wherein a graphics processor GPU, which is dedicated to graphics calculation and tensor calculation, is installed on a general purpose computer; the device is provided with a plurality of stream processors and is used for accelerating matrix calculation in a parallel mode, so that the device is suitable for deep network model training scenes.
CN202211426263.XA 2022-11-14 2022-11-14 Container automatic arrangement method for deep learning model distributed training Pending CN115794385A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211426263.XA CN115794385A (en) 2022-11-14 2022-11-14 Container automatic arrangement method for deep learning model distributed training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211426263.XA CN115794385A (en) 2022-11-14 2022-11-14 Container automatic arrangement method for deep learning model distributed training

Publications (1)

Publication Number Publication Date
CN115794385A true CN115794385A (en) 2023-03-14

Family

ID=85437705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211426263.XA Pending CN115794385A (en) 2022-11-14 2022-11-14 Container automatic arrangement method for deep learning model distributed training

Country Status (1)

Country Link
CN (1) CN115794385A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116991564A (en) * 2023-09-28 2023-11-03 之江实验室 Operator internal parallel acceleration method for heterogeneous dual-core MCU

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116991564A (en) * 2023-09-28 2023-11-03 之江实验室 Operator internal parallel acceleration method for heterogeneous dual-core MCU
CN116991564B (en) * 2023-09-28 2024-01-09 之江实验室 Operator internal parallel acceleration method for heterogeneous dual-core MCU

Similar Documents

Publication Publication Date Title
CN107817787B (en) Intelligent production line manipulator fault diagnosis method based on machine learning
CN115543639B (en) Optimization method for performing deep learning tasks in distributed mode and distributed system
Djigal et al. Machine and deep learning for resource allocation in multi-access edge computing: A survey
CN113361680B (en) Neural network architecture searching method, device, equipment and medium
CN111274036B (en) Scheduling method of deep learning task based on speed prediction
CN104765589B (en) Grid parallel computation preprocess method based on MPI
KR20190054449A (en) Method for placing compute node for deep neural network acceleration in heterogeneous cluster
CN112764893B (en) Data processing method and data processing system
Yuan An anomaly data mining method for mass sensor networks using improved PSO algorithm based on spark parallel framework
CN115794385A (en) Container automatic arrangement method for deep learning model distributed training
WO2020227582A2 (en) Method and apparatus for scheduling matrix operations in digital processing systems
Du et al. A distributed in-situ CNN inference system for IoT applications
Ko et al. An in-depth analysis of distributed training of deep neural networks
CN117573328B (en) Parallel task rapid processing method and system based on multi-model driving
Zhang et al. Experimental evaluation of the performance of Gpipe parallelism
CN117474082A (en) Optimization method of deep learning model framework compiler and framework compiler
Tang et al. Energy-efficient and high-throughput CNN inference on embedded CPUs-GPUs MPSoCs
Nsiye et al. A Micro-Discrete Event Simulation Environment for Production Scheduling in Manufacturing Digital Twins
Zhou et al. Training and Serving System of Foundation Models: A Comprehensive Survey
Fomperosa et al. Task scheduler for heterogeneous data centres based on deep reinforcement learning
US12001893B1 (en) Distributed synchronization scheme
Xie et al. SpikeNC: An Accurate and Scalable Simulator for Spiking Neural Network on Multi-Core Neuromorphic Hardware
Alves et al. Reinforcement learning to support meta-level control in air traffic management
Kornelsen et al. Fast heterogeneous task mapping for reducing edge dnn latency
Viebke et al. Performance modelling of deep learning on intel many integrated core architectures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination