CN117851028A - Training method of distributed model and related equipment - Google Patents

Training method of distributed model and related equipment Download PDF

Info

Publication number
CN117851028A
CN117851028A CN202311587160.6A CN202311587160A CN117851028A CN 117851028 A CN117851028 A CN 117851028A CN 202311587160 A CN202311587160 A CN 202311587160A CN 117851028 A CN117851028 A CN 117851028A
Authority
CN
China
Prior art keywords
loop
computing
training
candidate
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311587160.6A
Other languages
Chinese (zh)
Inventor
李亚杰
樊英博
王雅惠
郭佳兴
张�杰
赵永利
王伟
张博鑫
陈玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202311587160.6A priority Critical patent/CN117851028A/en
Publication of CN117851028A publication Critical patent/CN117851028A/en
Pending legal-status Critical Current

Links

Abstract

The present disclosure provides a distributed model training method and related devices, the method comprising: obtaining the residual bandwidth of links between computing nodes in a computing power network; calculating the number of calculation resources required to be used by a candidate loop formed by calculation nodes in the computing power network based on the data quantity of training data; determining a target loop from the candidate loops based on the remaining bandwidth and an amount of computing resources that the candidate loops need to use; and performing distributed model training based on the target loop.

Description

Training method of distributed model and related equipment
Technical Field
The disclosure relates to the field of computer technology, and in particular, to a training method for a distributed model and related equipment.
Background
With the development of artificial intelligence and big data technologies, the power networks (Computing Network, CPN) serve as an important infrastructure for handling large-scale data and complex machine learning tasks. With the increasing data size and model complexity, conventional single-node training has failed to meet the requirements of high-performance computing. Distributed model training (Distributed Model Training, DMT) is an effective solution that can fully utilize the computing power of multiple computing nodes in a computational power network, speed up model training and improve training results. The distributed model training fully utilizes the parallel processing of data, greatly shortens the training time, and has obvious advantages compared with the training of a single computing node. However, the challenges presented by large models and mass data processing, as well as the high computational demands of deep learning tasks, coupled with the underutilization of computational resources in the computational network, result in inefficient training of distributed models in the computational network.
Disclosure of Invention
The disclosure provides a training method and related equipment for a distributed model, so as to solve the technical problem that the training efficiency of the distributed model in a computational power network is low to a certain extent.
In a first aspect of the present disclosure, a training method for a distributed model is provided, including:
obtaining the residual bandwidth of links between computing nodes in a computing power network;
calculating the number of calculation resources required to be used by a candidate loop formed by calculation nodes in the computing power network based on the data quantity of training data;
determining a target loop from the candidate loops based on the remaining bandwidth and an amount of computing resources that the candidate loops need to use;
and performing distributed model training based on the target loop.
In a second aspect of the present disclosure, there is provided a training apparatus for a distributed model, including:
the acquisition module is used for acquiring the residual bandwidth of the links between the computing nodes in the computing power network;
the computing module is used for computing the number of computing resources required to be used by a candidate loop formed by the computing nodes in the computing power network based on the data quantity of the training data;
a selection module, configured to determine a target loop from the candidate loops based on the remaining bandwidth and an amount of computing resources that the candidate loops need to use;
And the training module is used for carrying out distributed model training based on the target loop.
In a third aspect of the disclosure, an electronic device is provided that includes one or more processors, a memory; and one or more programs, wherein the one or more programs are stored in the memory and executed by the one or more processors, the programs comprising instructions for performing the method of the first or second aspect.
In a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium containing a computer program which, when executed by one or more processors, causes the processors to perform the method of the first or second aspect.
In a fifth aspect of the present disclosure, there is provided a computer program product comprising computer program instructions which, when executed on a computer, cause the computer to perform the method of the first aspect.
From the above, it can be seen that the training method and related equipment for a distributed model provided by the present disclosure, through an intelligent resource management and task scheduling strategy, save the computing power resources in the tidal power network, and improve the training efficiency and performance of the distributed model. By optimizing the resource allocation and dynamic adjustment strategy, the task deployment in different periods and areas can be flexibly managed, the idle and waste of computing power resources are reduced, the overall efficiency of the distributed model training task is improved, and the training cost and the energy consumption are reduced.
Drawings
In order to more clearly illustrate the technical solutions of the present disclosure or related art, the drawings required for the embodiments or related art description will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.
Fig. 1 is a schematic diagram of a training architecture of a distributed model in accordance with an embodiment of the present disclosure.
Fig. 2 is a schematic hardware architecture diagram of an exemplary electronic device according to an embodiment of the disclosure.
Fig. 3 is a flow chart of a training method of a distributed model according to an embodiment of the disclosure.
Fig. 4 is a schematic diagram of data parallelism in an embodiment of the disclosure.
Fig. 5 is a schematic diagram of model parallelism in an embodiment of the disclosure.
FIG. 6 is a schematic diagram of pipelined parallelism of an embodiment of the present disclosure.
Fig. 7 is a schematic diagram of a communication manner according to an embodiment of the disclosure.
Fig. 8 is an example of a distributed model training method of an embodiment of the present disclosure.
Fig. 9 is an example of a distributed model training method of an embodiment of the present disclosure.
FIG. 10 is a schematic diagram of a distributed model training apparatus according to an embodiment of the present disclosure.
Detailed Description
For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.
It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present disclosure pertains. The terms "first," "second," and the like, as used in embodiments of the present disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.
For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.
FIG. 1 illustrates a schematic diagram of a training architecture of a distributed model of an embodiment of the present disclosure. Referring to fig. 1, the training architecture 100 of the distributed model may include a server 110, such as a server of a data center in a computing network. The server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, security services, CDNs, and the like.
It should be appreciated that the number of servers in FIG. 1 is merely illustrative and is not intended to be limiting. There may be any number of servers, as desired for implementation.
Fig. 2 shows a schematic hardware structure of an exemplary electronic device 200 provided by an embodiment of the disclosure. As shown in fig. 2, the electronic device 200 may include: processor 202, memory 204, network module 206, peripheral interface 208, and bus 210. Wherein the processor 202, the memory 204, the network module 206, and the peripheral interface 208 are communicatively coupled to each other within the electronic device 200 via a bus 210.
The processor 202 may be a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphic Processing Unit, GPU), a neural Network Processor (NPU), a Microcontroller (MCU), a programmable logic device, a Digital Signal Processor (DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits. The processor 202 may be used to perform functions related to the techniques described in this disclosure. In some embodiments, processor 202 may also include multiple processors integrated as a single logic component. For example, as shown in fig. 2, the processor 202 may include a plurality of processors 202a, 202b, and 202c.
The memory 204 may be configured to store data (e.g., instructions, computer code, etc.). As shown in fig. 2, the data stored by the memory 204 may include program instructions (e.g., program instructions for implementing a training method of a distributed model of an embodiment of the present disclosure) as well as data to be processed (e.g., the memory may store configuration files of other modules, etc.). The processor 202 may also access program instructions and data stored in the memory 204 and execute the program instructions to perform operations on the data to be processed. The memory 204 may include volatile storage or nonvolatile storage. In some embodiments, memory 204 may include Random Access Memory (RAM), read Only Memory (ROM), optical disks, magnetic disks, hard disks, solid State Disks (SSD), flash memory, memory sticks, and the like.
The network module 206 may be configured to provide communications with other external devices to the electronic device 200 via a network. The network may be any wired or wireless network capable of transmitting and receiving data. For example, the network may be a wired network, a local wireless network (e.g., bluetooth, wiFi, near Field Communication (NFC), etc.), a cellular network, the internet, or a combination of the foregoing. It will be appreciated that the type of network is not limited to the specific examples described above. In some embodiments, network module 306 may include any combination of any number of Network Interface Controllers (NICs), radio frequency modules, receivers, modems, routers, gateways, adapters, cellular network chips, etc.
Peripheral interface 208 may be configured to connect electronic device 200 with one or more peripheral devices to enable information input and output. For example, the peripheral devices may include input devices such as keyboards, mice, touchpads, touch screens, microphones, various types of sensors, and output devices such as displays, speakers, vibrators, and indicators.
Bus 210 may be configured to transfer information between the various components of electronic device 200 (e.g., processor 202, memory 204, network module 206, and peripheral interface 208), such as an internal bus (e.g., processor-memory bus), an external bus (USB port, PCI-E bus), etc.
It should be noted that, although the architecture of the electronic device 200 described above only shows the processor 202, the memory 204, the network module 206, the peripheral interface 208, and the bus 210, in a specific implementation, the architecture of the electronic device 200 may also include other components necessary to achieve normal execution. Furthermore, those skilled in the art will appreciate that the architecture of the electronic device 200 may also include only the components necessary to implement the embodiments of the present disclosure, and not all of the components shown in the figures.
In the calculation network, all training data are segmented in parallel in the distributed model training, each calculation node calculates the part of data by using the same model parameters, and all calculation nodes communicate in a synchronous or asynchronous mode to obtain a global gradient for the next iteration. Data parallelism may employ a Ring All-Reduce (RAR) architecture for communication and parameter synchronization between compute nodes. While RAR is efficient in parallel training processes, it presents challenges for large models and massive data processing. The high computational demands of deep learning tasks, coupled with the use of large amounts of computational resources in a computational power network, result in serious wasted computational power resources (e.g., CPU, GPU, TPU, NPU, etc.) and under-utilization of resources. Taking GPU as an example, in large-scale model distributed model training, when a large number of GPUs train simultaneously, the situations of GPU resource waste and efficiency reduction may occur. Meanwhile, the tidal power network based on service variation exacerbates the problem that the deployment of the distributed model training task consumes huge computational resources: periodic flow changes, tidal flow, caused by human activity exist in tidal power networks. Tidal flow exacerbates the problem of consumption of computational resources in the deployment of distributed model training tasks. In office time, the demand of computing power resources in business areas increases sharply, and reaches the peak; and peak is reached in the residential area during off-hours. When nodes and links are at peak times, the utilization and allocation of computing resources is unbalanced, resulting in increased wastage and fragmentation of computing resources. The existence of tidal traffic further exacerbates the problem of significant consumption of computational resources when distributed model training tasks are deployed in tidal computational networks. Therefore, how to improve the efficiency and performance of the distributed model training while saving the computational resources in the tidal power network is a technical problem to be solved.
In view of this, embodiments of the present disclosure provide a distributed model training method and related devices. Through an intelligent resource management and task scheduling strategy, the efficiency and performance of the distributed model training are improved while the power calculation resources are saved in the tidal power calculation network. By optimizing the resource allocation and dynamic adjustment strategy, the task deployment in different periods and areas can be flexibly managed, the idle and waste of computing power resources are reduced, the overall efficiency of the distributed model training task is improved, and the training cost and the energy consumption are reduced.
Referring to fig. 3, fig. 3 shows a flow diagram of a training method of a distributed model according to an embodiment of the present disclosure. Distributed model training may refer to training of models using multiple computing nodes (e.g., GPU, CPU, TPU, NPU, etc.) in concert to speed training, improve model performance, and process large-scale data and complex models. The distributed model training can be applied to various deep learning tasks, such as image classification, voice recognition, natural language processing and the like. In fig. 3, the training method 300 of the distributed model may further specifically include the following steps.
In step S310, the remaining bandwidth of the links between the computing nodes in the power network is acquired.
The link residual bandwidth and the residual computing resources between computing nodes of the computing power network can be obtained by detecting and inquiring the links between the computing nodes and the resource utilization conditions of the computing nodes. The residual bandwidth of the link and the residual computing resources of the node can be obtained by detecting the transmission data quantity of the link and the load condition of the computing node in real time. This information will be used for deployment of the training method of the distributed model, ensuring that the task can perform distributed model training on the premise of meeting the resource requirements.
In step S320, the amount of computing resources that need to be used to calculate the candidate loops formed by the computing nodes in the power network is calculated based on the data amount of the training data.
In some embodiments, the method 300 further comprises:
searching a computing loop formed by the source node and the computing node based on the source node in the computing network as a starting point;
and selecting a preset number of calculation loops with the least number of nodes to determine as the candidate loops.
Specifically, in selecting the candidate loops, all loops starting from the source node may be first searched from the power network, and the number of nodes of each loop may be calculated. K (K is a positive integer) loops with the least number of nodes are selected from the loops to serve as candidates, and the candidate loops are ensured to contain source nodes. Therefore, the loops with fewer nodes and containing the source nodes can be preferentially selected, so that the calculation overhead and the occupied frequency spectrum resources of task deployment are reduced, and meanwhile, the effectiveness and the reliability of the loops are ensured.
In some embodiments, calculating the amount of computing resources needed to be used by candidate loops formed by computing nodes in the computational power network based on the amount of data of training data includes:
determining an amount of node data to be processed by each computing node in the candidate loop based on the number of nodes in the candidate loop and the number of training data;
determining the number of node computing resources required to be used by each computing node based on the number of nodes;
and obtaining the number of computing resources needed to be used by the candidate loop based on the sum of the number of computing resources of each node of the computing nodes.
The number of computing resources may refer to the number of GPUs used, among other things. When calculating task attributes of distributed model training, data can be segmented according to the selected K loop nodes, and the data size which needs to be processed by the calculation nodes of each loop is calculated. And then, determining the node computing task of each computing node and the use condition of the GPU according to the computing resources of the computing nodes of each loop and the quantity of the GPUs on the loop. Meanwhile, the number of peak nodes in the K loops is counted and is used for considering the calculation demand and the resource scheduling strategy of the peak time. By comprehensively considering the attributes, DMT tasks can be effectively deployed, and the resource utilization and task performance of the computing power network are optimized.
In step S330, a target loop is determined from the candidate loops based on the remaining bandwidth and the amount of computing resources that the candidate loops need to use.
In some embodiments, determining a target loop from the candidate loops based on the remaining bandwidth and an amount of computing resources that the candidate loops need to use includes:
selecting the candidate loop with the largest quantity of the residual computing resources as a first loop;
judging whether the link bandwidth of the first loop meets training requirements or not based on the residual bandwidth;
and determining that the first loop is the target loop in response to the link bandwidth of the first loop meeting training requirements.
In some embodiments, determining a target loop from the candidate loops based on the remaining bandwidth and an amount of computing resources that the candidate loops need to use further comprises:
repeating the following steps until all the candidate loops are traversed:
removing the first loop in response to the link bandwidth of the first loop not meeting the training requirement;
determining the candidate loop with the largest quantity of the residual computing resources from the residual candidate loops as the first loop;
judging whether the link bandwidth of the first loop meets training requirements or not based on the residual bandwidth;
And determining that the first loop is the target loop in response to the link bandwidth of the first loop meeting training requirements.
In some embodiments, the method 300 may further comprise:
when the number of computing resources needed to be used by a first candidate loop and a second candidate loop in the candidate loops is the same and minimum, acquiring the number of first peak nodes of the first candidate loop and the number of second peak nodes of the second candidate loop;
and in response to the first number of peak nodes being greater than the second number of peak nodes, determining that the second candidate loop is the first loop.
In particular, the loop node with the least number of GPUs and the loop node with the least number of peak nodes can be found, and whether the computing resource and the bandwidth resource are enough to support the data transmission requirement of the task can be checked. By comprehensively considering the resource conditions of the loop nodes, whether the task requirements are met or not can be judged, so that reasonable resource allocation and deployment decisions are made, and if not, the loop is removed to continuously find a suboptimal solution. And iterating until the task deployment is successful, or traversing all the candidate loops, wherein the task requirement cannot be met, and the task deployment is regarded as failure. It can be seen that the method of the embodiments of the present disclosure increases the efficiency of the use of computing power resources and the impact of tidal computing power by selecting a loop for distributed model training in consideration of the amount of computing power resources used (e.g., the number of GPU boards) and the number of peak nodes.
In step S340, distributed model training is performed based on the target loop.
In some embodiments, distributed model training based on the target loop includes:
dividing the training data into a plurality of subsets based on a number of nodes of the target loop;
for each round of training, each computing node of the target loop performs model training based on the corresponding subset to obtain local model parameters;
and each computing node synchronizes based on the local model parameters, and obtains the model parameters of all computing nodes on the target ring so as to perform the next training until the training requirement is met.
The distributed model training may be based on a computer cluster or cloud platform, and comprises a plurality of computing nodes and one or more parameter servers. During training, the training data is partitioned into multiple sub-data sets, with each compute node being responsible for processing one sub-data set and performing model forward propagation, backward propagation, and parameter updating locally. And each computing node interacts through a communication protocol to realize sharing and synchronization of model parameters, so that consistency of a global model is ensured.
The distributed model training can adopt strategies such as data parallelism, model parallelism or pipeline parallelism. In the data parallel strategy, each computing node processes different data subsets, after the gradient information obtained by calculation is aggregated on the parameter server, the updated parameters are broadcasted to each computing node, as shown in fig. 4, and fig. 4 shows a schematic diagram of data parallel according to an embodiment of the disclosure. In the model parallel strategy, different computing nodes are responsible for processing different parts of the model, gradient information is independently computed and propagated on each node, and finally gradients of the parts are integrated by a parameter server, as shown in fig. 5, and fig. 5 shows a schematic diagram of model parallel according to an embodiment of the disclosure. Pipelining is a method for accelerating model training that combines the concepts of model parallelism and data parallelism. As shown in fig. 6, fig. 6 shows a schematic diagram of pipelined parallelism in accordance with an embodiment of the present disclosure. In this approach, the model may be partitioned into multiple blocks, and each block assigned to a different computing device. These devices process the data sequentially in a certain order. In the forward pass, each device is responsible for processing the computation of one block and passing the computation results to the next block without waiting for the entire model computation to complete. This means that multiple blocks can be forward transferred simultaneously at the same time, thereby speeding up the overall computation. In the backward pass, each device passes the gradient of the input data back to the previous block in order to update the model parameters. This process is also pipelined and does not need to wait for the gradient computation of the entire model to complete. In a word, the pipeline parallelism enables different computing devices to compute simultaneously by dividing the model into blocks, so that the model training efficiency is improved, the computing time is reduced, and the distributed computing resources can be better utilized.
In distributed model training, ring All-Reduce communication is used to synchronize model parameters between compute nodes. In the communication mode, the computing nodes form a ring-shaped communication topology, and each node sequentially transmits own gradient information to the next node until all nodes receive the gradient information of other nodes. Each node then accumulates the received gradients and updates its own model parameters as shown in fig. 7, fig. 7 showing a schematic diagram of a communication scheme according to an embodiment of the disclosure. In this way, all nodes can maintain the consistency of parameters, thereby realizing the effect of distributed training.
In distributed model training, it is often necessary to set a predetermined number of iterations, indicating how many epochs the training process will take. The setting of the number of iterations involves limitations on the size of the training data, the complexity of the model, and computational resources. Assuming a total of N iterations, denoted N iter . In each iteration we update the parameters W and bias b of the model, denoted W (t) And b (t) Where t represents the current iteration round number. The parameter updating of each iteration is realized by a gradient descent algorithm:
Wherein learning_rate is learning rate, loss is a Loss function, and the difference between the model predicted value and the real label is represented.
In addition to setting a fixed number of iterations, it is also possible to determine when to terminate training by means of a convergence criterion. Convergence criteria refers to a condition in the training process that when met, considers the model to have achieved better performance and does not continue training. The convergence criterion may include:
a. verification set performance: during the training process, the data set is typically divided into a training set and a validation set. After each epoch is completed, it is determined whether the model has converged by evaluating the performance of the model, such as accuracy, loss function, etc., on a validation set. When the performance of the model on the validation set no longer improves or begins to fluctuate, the model may be considered to have converged.
b. Loss function: the loss function is an index for measuring the difference between the model predictive result and the real label. During training, the loss function may gradually decrease. When the loss function falls to a small threshold or tends to settle, the model may be considered to have converged.
c. Gradient change: in the model training process, the gradient value represents the variation trend of the model parameters. When the gradient values approach zero or change little, the model can be considered to have converged.
In particular, whether the model is determined by the change in the loss function across the validation setHas converged. Assume that after the t-th iteration, the model has a loss function on the validation set ofIf we set a convergence threshold epsilon, the model can be considered to have converged when the loss function on the validation set meets the following conditions:
where |·| represents the absolute value and epsilon is the convergence threshold. In addition, whether the model converges or not can also be judged by verifying the accuracy rate on the set. Assume that after the t-th iteration, the accuracy of the model on the validation set isIf we set a convergence threshold delta, the model can be considered to have converged when the accuracy on the validation set meets the following conditions:
where delta is the convergence threshold. Comprehensively considering, the iteration times of the distributed model training and the convergence criterion are set by adjusting parameters such as a learning rate, an iteration number Niter, a convergence threshold epsilon, delta and the like. In the training process, whether the model has converged is judged by continuously adjusting the parameters and observing the change of the loss function or the accuracy rate on the verification set, so that the termination condition of the training is determined.
According to an embodiment of the present disclosure, deployment of a distributed model training method in a computing network may include:
Node selection: and selecting proper nodes to execute the distributed model training task according to the topological structure of the CPN and the computing power of the nodes. The computational performance, storage capacity, and network bandwidth of the node are typically considered.
Dividing data: the training data is divided in a certain way into subsets, each of which is assigned to a node. The goal of the data partitioning is to ensure that each node has enough data samples for training while avoiding redundant transmission of data.
Parameter initialization: each node needs to initialize the parameters of the model before training begins. Typically, the parameters may be initialized randomly or pre-trained model parameters may be used.
Model synchronization: in the training process, each node exchanges model parameters and gradient information through a communication protocol (for example, ring All-Reduce) so as to realize synchronous updating of the model. This ensures that each node has the most up-to-date model parameters globally.
Training and iterating: each node performs local training according to the assigned data subsets and updates model parameters after each training iteration is completed. After one round of training is completed, parameter synchronization is carried out among the nodes, and then the next round of training is continued.
Convergence criteria: whether the training converges is typically determined by monitoring the performance of the model on the validation set. Training may be stopped if the model's loss function or accuracy over the validation set has converged or no longer significantly changed.
In particular, referring to fig. 8, fig. 8 illustrates an example of a distributed model training method according to an embodiment of the present disclosure. In fig. 8, the remaining resources of the GPU of the computing node and the link usage can be obtained, and then candidate loops formed by K computing nodes including the source node are found in the computing network. Judging whether K is larger than 0, if K is not larger than 0, failing to deploy the distributed model training method; and if K is greater than 0, cutting the training data according to the number of the computing nodes in the current candidate loop, and obtaining the data quantity D of the training data required to be processed on each computing node in the current candidate loop. The computing node in the current candidate loop may then compute the required computing resources based on the data amount D of the training data that needs to be processed. The number of GPUs and peak nodes used in K candidate loops may also be calculated.
Loop determination: judging whether K is larger than 0, if K is not larger than 0, failing to deploy the distributed model training method; if K is larger than 0, selecting the loop with the smallest GPU number from the K loops as the target loop. Wherein, if the number of GPUs is consistent, a loop with a smaller number of peak nodes can be selected as the target loop. Judging whether the link bandwidth in the target loop meets the task requirement, and if so, performing distributed model training based on the target loop. If the task is not satisfied, the target loop is removed, K=K-1 is made, the loop determination step is repeated until the task deployment is successful, or the task requirement cannot be satisfied after all the candidate loops are traversed, and the task deployment is regarded as failure.
Compared with the existing scheme, the deployment of the distributed model training in the tidal power network does not fully consider the fluctuation of the power caused by the tidal flow, and the deployment strategy lacks a dynamic adjustment mechanism, so that the waste and uneven distribution of the power resource are easily caused, and the training efficiency and the cost are influenced. The lack of an optimization method for computing power resources in tidal scenes cannot effectively cope with changing training requirements, so that training tasks cannot be deployed efficiently in peak periods, and training time and cost are increased. According to the distributed model training method disclosed by the embodiment of the disclosure, through monitoring tidal workload and resource availability in the network CPN in real time, a heuristic algorithm is adopted to select proper loop deployment for tasks, data are dynamically divided, and computing resources are allocated, so that the efficient utilization of the computing resources is realized, and the model training cost is reduced. The method can effectively improve the efficiency and reliability of the distributed model training under the complex environment of the tidal power network, and provides an optimal solution for the deployment of large-scale batch tasks.
In RAR-DMT, CPN can be represented as a directed graph G (V, E, N) based on the computational resources (GPU, for example) and the communication between nodes, where V and E are the set of nodes and fiber links, respectively. N represents the number of GPUs per node. In a network, there are three types of nodes: a peak zone (i.e., high flow zone), a valley zone (i.e., low flow zone), and a complex zone. The different nodes are caused by tidal traffic caused by the flow of people, e.g. during working hours, business areas are The peak area, the residential area is the valley area, and the comprehensive area is the traffic, entertainment area and other areas. M for each DMT task i (S i ,D i ,θ i ,T i ) Wherein S is i Is a task generating node. D (D) i Representing the size, θ, of the original training data i And T i Representing model accuracy and deadlines.
In CPN, DMT tasks can be deployed in RAR architecture. All nodes in the CPN are equipped with GPU boards and each node can be a worker node for distributed model training. Each node has the same computing resource with a computing power of 300×10 9 cycles/s. Each node may be dynamically allocated computational resources based on the amount of data handled by the node. However, in a practical scenario, the cost of newly turning on a GPU is far greater than continuing to use an existing turned-on GPU. CPN adopts photoelectric hybrid architecture as bearing network, and links between nodes are optical paths. In order to meet the requirements of the task, it is necessary to ensure that the spectrum resources occupied by the task on the link conform to the spectrum consistency and continuity principles. Considering the influence of tidal flow on CPN, the tidal power network can be modeled as the situation that the task generation probabilities of different moments and different areas are different, so that a tidal power network model can be effectively constructed. This will help to understand in depth the impact of tidal phenomena on the deployment of distributed model training tasks and provide the basis for proposing an efficient deployment approach that saves GPU.
Referring to fig. 9, fig. 9 illustrates an example of distributed model training according to an embodiment of the present disclosure. In fig. 9, a distributed model training method may be deployed using Ring all-reduce over a computing power network comprising 6 computing nodes. Wherein, the 1, 2, 6 nodes and links are regarded as commercial areas, and the 3, 4, 5 nodes are regarded as residential areas. And when the time is 10:00-18:00, the commercial area is a peak area, and the residential area is a valley area. The solid line represents a link, assuming that each link has 5 spectral slots, each node has 2 GPU boards, and the computing power of each GPU is 700×10 9 Each node allocates corresponding computing resources according to the data size to be processed, and takes GPU as an example, each node is newly openedThe cost of one GPU board is much greater than the cost of using the GPU board in use. Tidal power networks can be modeled with different task creation rates for different areas of the day. The probability change of the task generation in different areas at different moments can be considered according to actual conditions. For example, during office hours, the task generation rate in business areas increases to a peak, while during off-hours, the task generation rate in residential areas increases to form another peak. By the method, the influence of tidal phenomenon on the computational power network can be reflected more truly, so that deployment and resource optimization of the distributed model training task can be performed more accurately, GPU resources can be saved more effectively in practical application, training efficiency is improved, and deployment cost is reduced.
Assume that node 2 generates a distributed model training task, the training data size of which is 100GB. The deadline for training was 5 hours and the model accuracy was 0.3. When a task is generated, the algorithm finds the first K loops containing the task generating node, we assume k=5, then find 5 loops, respectively: 2-6-1-2;2-3-5-2;2-5-6-2;2-3-4-5-2;2-3-5-6-2. Then the algorithm traverses five loops from the first loop, segments the data according to the different numbers of nodes of each loop, for example, loops 2-6-1-2 segments the data into 3 parts, each part having the size ofLoop 2-3-4-5-2 divides the data into 4 parts on average, each part having a data size +.>And each node allocates corresponding computing resources according to the size of the divided data and the latest completion time, and then calculates the number of GPU boards used by each loop and the number of peak nodes, wherein the specific numbers are shown in the table 1:
loop circuit 2-6-1-2 2-3-5-2 2-5-6-2 2-3-4-5-2 2-3-5-6-2
GPU plate 5 5 5 4 7
Peak node number 3 1 2 1 2
TABLE 1
Therefore, two important strategies can be employed to optimize the deployment of the distributed model training task. Firstly, by adopting a heuristic algorithm, loop nodes with minimum required calculation resources and fewer peak nodes are dynamically selected in the tidal power network, so that precious calculation resources are effectively saved. The computing nodes with the least quantity of GPU boards are preferably selected, so that consumption of GPU resources can be reduced to the greatest extent, and resource utilization efficiency is improved. And secondly, under the condition that the quantity of the GPUs is consistent, loops with fewer peak nodes can be further selected for deployment, so that the influence of tidal power on model training tasks is reduced, modeling is conducted according to probability change conditions of tasks in a tidal power network, thus the tidal power network is better adapted to tidal scenes, and the adaptability and stability of the tidal power network are improved. And after judging that the loop link resource meets the task requirement, selecting the loop to utilize RAR to carry out multiple iterations until the task precision is reached. The method is beneficial to efficiently completing model training tasks in the tidal power network, and ensures that the tasks are smoothly carried out in a resource-sufficient environment, thereby effectively reducing deployment cost and improving training efficiency.
Therefore, according to the training method of the distributed model, through the intelligent resource management and task scheduling strategy, the training efficiency and performance are improved while the calculation resources are saved in the tidal calculation network. By optimizing the resource allocation and dynamic adjustment strategy, the task deployment in different periods and areas can be flexibly managed, the idle and waste of computing power resources are reduced, the overall efficiency of training tasks is improved, the training cost and the energy consumption are reduced, and the method is beneficial to promoting the efficient training and application of large-scale data and complex models.
It should be noted that the method of the embodiments of the present disclosure may be performed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present disclosure, the devices interacting with each other to complete the method.
It should be noted that the foregoing describes some embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Based on the same technical concept, corresponding to the method of any embodiment, the disclosure further provides a training device of a distributed model, referring to fig. 10, where the training device of the distributed model includes:
the acquisition module is used for acquiring the residual bandwidth of the links between the computing nodes in the computing power network;
the computing module is used for computing the number of computing resources required to be used by a candidate loop formed by the computing nodes in the computing power network based on the data quantity of the training data;
a selection module, configured to determine a target loop from the candidate loops based on the remaining bandwidth and an amount of computing resources that the candidate loops need to use;
and the training module is used for carrying out distributed model training based on the target loop.
For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of the various modules may be implemented in the same one or more pieces of software and/or hardware when implementing the present disclosure.
The device of the foregoing embodiment is configured to implement the corresponding distributed model training method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same technical concept, corresponding to the method of any embodiment, the disclosure further provides a non-transitory computer readable storage medium, where the non-transitory computer readable storage medium stores computer instructions for causing a computer to perform the method of the distributed model training method of any embodiment.
The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
The storage medium of the above embodiment stores computer instructions for causing a computer to perform the distributed model training method of any of the above embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.
Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the disclosure, including the claims, is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined under the idea of the present disclosure, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present disclosure as above, which are not provided in details for the sake of brevity.
Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present disclosure. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present disclosure, and this also accounts for the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform on which the embodiments of the present disclosure are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.
The disclosed embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the embodiments of the disclosure, are intended to be included within the scope of the disclosure.

Claims (10)

1. A distributed model training method, comprising:
obtaining the residual bandwidth of links between computing nodes in a computing power network;
calculating the number of calculation resources required to be used by a candidate loop formed by calculation nodes in the computing power network based on the data quantity of training data;
determining a target loop from the candidate loops based on the remaining bandwidth and an amount of computing resources that the candidate loops need to use;
and performing distributed model training based on the target loop.
2. The method of claim 1, the method further comprising:
searching a computing loop formed by the source node and the computing node based on the source node in the computing network as a starting point;
and selecting a preset number of calculation loops with the least number of nodes to determine as the candidate loops.
3. The method of claim 1, wherein calculating the amount of computing resources needed for use by the candidate loops formed by the computing nodes in the computing power network based on the amount of data of the training data comprises:
Determining an amount of node data to be processed by each computing node in the candidate loop based on the number of nodes in the candidate loop and the number of training data;
determining the number of node computing resources required to be used by each computing node based on the number of nodes;
and obtaining the number of computing resources needed to be used by the candidate loop based on the sum of the number of computing resources of each node of the computing nodes.
4. The method of claim 1, wherein determining a target loop from the candidate loops based on the remaining bandwidth and an amount of computing resources that the candidate loops need to use comprises:
selecting the candidate loop with the largest quantity of the residual computing resources as a first loop;
judging whether the link bandwidth of the first loop meets training requirements or not based on the residual bandwidth;
and determining that the first loop is the target loop in response to the link bandwidth of the first loop meeting training requirements.
5. The method of claim 4, wherein determining a target loop from the candidate loops based on the remaining bandwidth and an amount of computing resources that the candidate loops need to use, further comprises:
repeating the following steps until all the candidate loops are traversed:
Removing the first loop in response to the link bandwidth of the first loop not meeting the training requirement;
determining the candidate loop with the largest quantity of the residual computing resources from the residual candidate loops as the first loop;
judging whether the link bandwidth of the first loop meets training requirements or not based on the residual bandwidth;
and determining that the first loop is the target loop in response to the link bandwidth of the first loop meeting training requirements.
6. The method of claim 4, further comprising:
when the number of computing resources needed to be used by a first candidate loop and a second candidate loop in the candidate loops is the same and minimum, acquiring the number of first peak nodes of the first candidate loop and the number of second peak nodes of the second candidate loop;
and in response to the first number of peak nodes being greater than the second number of peak nodes, determining that the second candidate loop is the first loop.
7. The method of claim 1, wherein the distributed model training based on the target loop comprises:
dividing the training data into a plurality of subsets based on a number of nodes of the target loop;
for each round of training, each computing node of the target loop performs model training based on the corresponding subset to obtain local model parameters;
And each computing node synchronizes based on the local model parameters, and obtains the model parameters of all computing nodes on the target ring so as to perform the next training until the training requirement is met.
8. A distributed model training apparatus, comprising:
the acquisition module is used for acquiring the residual bandwidth of the links between the computing nodes in the computing power network;
the computing module is used for computing the number of computing resources required to be used by a candidate loop formed by the computing nodes in the computing power network based on the data quantity of the training data;
a selection module, configured to determine a target loop from the candidate loops based on the remaining bandwidth and an amount of computing resources that the candidate loops need to use;
and the training module is used for carrying out distributed model training based on the target loop.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when the program is executed.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 7.
CN202311587160.6A 2023-11-24 2023-11-24 Training method of distributed model and related equipment Pending CN117851028A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311587160.6A CN117851028A (en) 2023-11-24 2023-11-24 Training method of distributed model and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311587160.6A CN117851028A (en) 2023-11-24 2023-11-24 Training method of distributed model and related equipment

Publications (1)

Publication Number Publication Date
CN117851028A true CN117851028A (en) 2024-04-09

Family

ID=90542404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311587160.6A Pending CN117851028A (en) 2023-11-24 2023-11-24 Training method of distributed model and related equipment

Country Status (1)

Country Link
CN (1) CN117851028A (en)

Similar Documents

Publication Publication Date Title
US10884795B2 (en) Dynamic accelerator scheduling and grouping for deep learning jobs in a computing cluster
US11010313B2 (en) Method, apparatus, and system for an architecture for machine learning acceleration
CN114756383A (en) Distributed computing method, system, device and storage medium
US20190377606A1 (en) Smart accelerator allocation and reclamation for deep learning jobs in a computing cluster
KR20210036226A (en) A distributed computing system including multiple edges and cloud, and method for providing model for using adaptive intelligence thereof
CN111274036A (en) Deep learning task scheduling method based on speed prediction
CN113821332B (en) Method, device, equipment and medium for optimizing efficiency of automatic machine learning system
Kanwal et al. A genetic based leader election algorithm for IoT cloud data processing
Mao et al. AdaLearner: An adaptive distributed mobile learning system for neural networks
WO2022026044A1 (en) Sharing of compute resources between the virtualized radio access network (vran) and other workloads
CN111597035B (en) Simulation engine time propulsion method and system based on multithreading
CN116204327B (en) Distributed system communication scheduling method and distributed machine learning system
CN116737370A (en) Multi-resource scheduling method, system, storage medium and terminal
CN117851028A (en) Training method of distributed model and related equipment
Anwar et al. Recommender system for optimal distributed deep learning in cloud datacenters
WO2023184009A1 (en) Systems and methods for cluster-based parallel split learning
CN113886036B (en) Method and system for optimizing distributed system cluster configuration
US20220398416A1 (en) System and method for identifying approximate k-nearest neighbors in web scale clustering
US11625420B2 (en) System and method for identifying approximate k-nearest neighbors in web scale clustering
CN113485718B (en) Context-aware AIoT application program deployment method in edge cloud cooperative system
CN117201319B (en) Micro-service deployment method and system based on edge calculation
CN110958144B (en) Method and device for acquiring network
Wu et al. An estimation of distribution algorithm to optimize the utility of task scheduling under fog computing systems
Li et al. TOC: Joint Task Offloading and Computation Reuse in Vehicular Edge Computing
CN117714292A (en) Policy determination method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination