CN115865607A

CN115865607A - Distributed training computing node management method and related device

Info

Publication number: CN115865607A
Application number: CN202310180801.XA
Authority: CN
Inventors: 李仁刚; 闫瑞栋; 郭振华; 赵雅倩; 刘璐; 金良; 徐聪
Original assignee: Shandong Mass Institute Of Information Technology
Current assignee: Shandong Mass Institute Of Information Technology
Priority date: 2023-03-01
Filing date: 2023-03-01
Publication date: 2023-03-28

Abstract

The application discloses a distributed training computing node management method and a related device, and relates to the technical field of computers, wherein the computing node management method comprises the following steps: acquiring node information of each computing node; grouping all the computing nodes based on the node information of each computing node to obtain a plurality of computing node groups of different types; setting a local decentralized communication architecture for the computing nodes in each computing node group, and setting a global centralized communication architecture between each computing node group; and performing distributed model training in the plurality of computing node groups based on the input model and data to obtain a training result so as to improve the efficiency of the distributed model training.

Description

Distributed training computing node management method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a server, and a computer-readable storage medium for managing compute nodes in distributed training.

Background

With the rapid development of big data, artificial intelligence, high-performance computation and internet technology, massive data and large-scale models generated in various fields are often modeled and solved through a neural network. The storage, calculation and solving processes of the neural network all depend on a distributed training system. A so-called distributed training system is a network formed by a plurality of computing nodes together, and each computing node may be formed by one host or a plurality of hosts.

In the related technology, a deep neural network model or a large data set to be trained is split in a model parallel, data parallel or mixed parallel mode and is distributed to corresponding computing nodes; then, each computing node separately trains the split small-scale data or the sub-model and generates a local or intermediate training result; and finally, the distributed training system aggregates all local training results in a certain mode to obtain a global result and outputs the global training result. However, in practical applications, there are differences between different computing nodes, which results in a decrease in efficiency of the training process of the distributed model, and thus resources of the computing nodes cannot be effectively utilized.

Therefore, how to improve the efficiency of the distributed model training is a key issue of attention for those skilled in the art.

Disclosure of Invention

The application aims to provide a computing node management method, a computing node management device, a server and a computer readable storage medium for distributed training, so as to improve the efficiency of distributed model training.

In order to solve the above technical problem, the present application provides a distributed training method for managing compute nodes, including:

acquiring node information of each computing node;

grouping all the computing nodes based on the node information of each computing node to obtain a plurality of computing node groups of different types;

setting a local decentralized communication architecture for the computing nodes in each computing node group, and setting a global centralized communication architecture between each computing node group;

and carrying out distributed model training in the plurality of computing node groups based on the input model and data to obtain a training result.

Optionally, a synchronous update policy is adopted among the compute nodes in each compute node group, and an asynchronous update policy is adopted among the compute node groups.

Optionally, the obtaining node information of each computing node includes:

when a newly accessed computing node exists, acquiring node information of the newly accessed computing node; wherein the node information includes: hardware information, current load running state information, network connection and bandwidth conditions among the computing nodes;

and recording the node information in a database.

Optionally, grouping all the computing nodes based on the node information of each computing node to obtain multiple computing node groups of different types, where the grouping includes:

similarity calculation is carried out on each computing node based on the node information of each computing node, and the similarity between each computing node is obtained;

and clustering all the computing nodes based on the similarity between each computing node to obtain a plurality of computing node groups.

Optionally, performing similarity calculation on each computing node based on the node information of each computing node to obtain a similarity between each computing node, including:

calculating the firmware similarity between each computing node based on the firmware information of each computing node;

calculating the network structure similarity between each computing node based on the network information of each computing node;

calculating load similarity of each calculation based on the load information of each calculation node;

and determining the similarity between each computing node based on the firmware similarity, the network structure similarity and the load similarity between each computing node.

Optionally, calculating the firmware similarity between each computing node based on the firmware information of each computing node includes:

calculating a hardware index for each of the compute nodes based on the firmware information for each of the compute nodes;

and calculating Euclidean distance between hardware indexes among each computing node, and taking the Euclidean distance as the similarity of the firmware among each computing node.

Optionally, calculating the network structure similarity between each computing node based on the network information of each computing node includes:

calculating a network address distance and a network neighbor index between each computing node based on the network information of each computing node;

and taking the network address distance and the network neighbor index between each computing node as the network structure similarity between each computing node.

Optionally, calculating the load similarity of each calculation based on the load information of each calculation node includes:

calculating an equipment load condition index and a network bandwidth condition index of each computing node based on the load information of each computing node;

and taking the equipment load condition index and the network bandwidth condition index as the load similarity of the computing node.

Optionally, determining the similarity between each computing node based on the firmware similarity, the network structure similarity, and the load similarity between each computing node includes:

and performing weighted calculation on the firmware similarity, the network structure similarity and the load similarity among the calculation nodes to obtain the similarity among the calculation nodes.

Optionally, performing distributed model training in the plurality of computing node groups based on the input model and data to obtain a training result, including:

processing the input model and data based on a distributed training format to obtain distributed training data and a distributed training model;

and performing distributed model training based on a synchronous updating strategy among each computing node, an asynchronous updating strategy among each computing node group, distributed training data and a model to obtain the training result.

Optionally, processing the input model and data based on the format of distributed training to obtain the data and model of distributed training, including:

and denoising and standardizing the input model and data based on a distributed training format to obtain the distributed training data and model.

Optionally, the process of asynchronous update policy between each computing node group includes:

and executing an asynchronous updating strategy among each computing node group based on a preset buffer zone.

The application also provides a distributed training computing node management method, which comprises the following steps:

the method comprises the steps that a client sends a model to be trained and data to a server, so that the server can obtain node information of each computing node; grouping all the computing nodes based on the node information of each computing node to obtain a plurality of computing node groups of different types; setting a local decentralized communication architecture for the computing nodes in each computing node group, and setting a global centralized communication architecture between each computing node group; performing distributed model training in the plurality of computing node groups based on the input model and data to obtain and return a training result;

and the client acquires the training result and displays the training result.

the server acquires node information of each computing node; grouping all the computing nodes based on the node information of each computing node to obtain a plurality of computing node groups of different types; setting a local decentralized communication architecture for the computing nodes in each computing node group, and setting a global centralized communication architecture between each computing node group; performing distributed model training in the plurality of computing node groups based on the model and data input by the client to obtain a training result;

and the client displays the training result.

The present application further provides a distributed training computing node management apparatus, including:

the node information acquisition module is used for acquiring the node information of each computing node;

the node grouping module is used for grouping all the computing nodes based on the node information of each computing node to obtain a plurality of computing node groups of different types;

the communication architecture setting module is used for setting a local decentralized communication architecture for the computing nodes in each computing node group and setting a global centralized communication architecture between each computing node group;

and the model training module is used for carrying out distributed model training in the plurality of computing node groups based on the input model and data to obtain a training result.

The present application further provides a server, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the computing node management method as described above when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of computing node management as described above.

The application provides a distributed training computing node management method, which comprises the following steps: acquiring node information of each computing node; grouping all the computing nodes based on the node information of each computing node to obtain a plurality of computing node groups of different types; setting a local decentralized communication architecture for the computing nodes in each computing node group, and setting a global centralized communication architecture between each computing node group; and carrying out distributed model training in the plurality of computing node groups based on the input model and data to obtain a training result.

The method comprises the steps of grouping all computing nodes to obtain a plurality of computing node groups, and then executing differentiated communication architectures and data updating strategies between the computing nodes in each computing node group and each computing node group, so that the reduction of training efficiency caused by the difference between different computing nodes is avoided, the efficiency of the training process of the distributed model is improved, and the resources of the computing nodes are effectively utilized.

The present application further provides a distributed training computing node management apparatus, a server, and a computer-readable storage medium, which have the above beneficial effects and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a distributed training method for managing computing nodes according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of another distributed trained computing node management method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a distributed communication architecture according to an embodiment of the present application;

FIG. 4 is a diagram illustrating a parallel training architecture according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a distributed training computing node management apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a distributed training computing node management method, a computing node management device, a server and a computer readable storage medium, so as to improve the efficiency of distributed model training.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Therefore, the method for managing the computing nodes in the distributed training is provided, a plurality of computing node groups are obtained by grouping all the computing nodes, and then a differentiated communication architecture and a data updating strategy are executed between the computing nodes in each computing node group and each computing node group, so that the reduction of the training efficiency caused by the difference between different computing nodes is avoided, the efficiency of the training process of the distributed model is improved, and the resources of the computing nodes are effectively utilized.

The following describes a distributed training method for managing compute nodes according to an embodiment.

Referring to fig. 1, fig. 1 is a flowchart of a distributed training computing node management method according to an embodiment of the present disclosure.

In this embodiment, the method may include:

s101, acquiring node information of each computing node;

this step is intended to acquire node information of each computing node.

The computing nodes are nodes for distributed training, and include but are not limited to various heterogeneous computing nodes such as CPUs, GPUs, FPGAs, mobile computing devices and the like. The heterogeneous computing nodes are different hardware with different structures among the computing nodes.

Wherein the node information includes: hardware information, current load running state information, network connection between the computing nodes and bandwidth conditions.

Further, the step may include:

and when the newly accessed computing node exists, acquiring the node information of the newly accessed computing node.

It can be seen that the present alternative scheme mainly explains how to acquire new node information. In this alternative, when there is a newly accessed computing node, node information of the newly accessed computing node is acquired.

Further, the step may include:

and acquiring the node information of each computing node, and recording the node information in a database.

It can be seen that the present alternative scheme mainly illustrates how node information is saved. In this alternative, the node information of each computing node is acquired, and the node information is recorded in the database.

S102, grouping all the computing nodes based on the node information of each computing node to obtain a plurality of computing node groups of different types;

on the basis of S101, this step is intended to group all the computing nodes based on the node information of each computing node, resulting in a plurality of computing node groups of different types.

That is, different computing nodes are grouped, and similar computing nodes are divided into the same computing node group, so that performance loss caused by differences among different computing nodes is eliminated.

Further, the step may include:

step1, similarity calculation is carried out on each calculation node based on node information of each calculation node to obtain similarity between each calculation node;

and 2, clustering all the computing nodes based on the similarity between each computing node to obtain a plurality of computing node groups.

It can be seen that the present alternative is primarily illustrative of how the grouping may be performed. In the alternative scheme, similarity calculation is carried out on each computing node based on the node information of each computing node, and the similarity between each computing node is obtained; and clustering all the computing nodes based on the similarity between each computing node to obtain a plurality of computing node groups. And clustering can be performed based on the obtained similarity.

Further, the process of calculating the similarity in the last alternative may include:

step1, calculating the firmware similarity between each computing node based on the firmware information of each computing node;

step2, calculating the network structure similarity between each computing node based on the network information of each computing node;

step 3, calculating the load similarity of each calculation based on the load information of each calculation node;

and 4, determining the similarity between each computing node based on the firmware similarity, the network structure similarity and the load similarity between each computing node.

It can be seen that the present alternative scheme mainly explains how to calculate the similarity. In this alternative, the firmware similarity between each compute node is computed based on the firmware information of each compute node; calculating the network structure similarity between each computing node based on the network information of each computing node; calculating load similarity of each calculation based on the load information of each calculation node; and determining the similarity between each computing node based on the firmware similarity, the network structure similarity and the load similarity between each computing node. Namely, the similarity of the computing nodes is evaluated in three directions of the firmware similarity, the network structure similarity and the load similarity, so that the accuracy of judging the similarity is improved.

Further, the process of calculating the firmware similarity in the last alternative may include:

step1, calculating a hardware index of each computing node based on firmware information of each computing node;

and 2, calculating the Euclidean distance between hardware indexes of each computing node, and taking the Euclidean distance as the firmware similarity between each computing node.

It can be seen that the present alternative scheme mainly illustrates how the firmware similarity is calculated. In this alternative, the hardware index of each compute node is computed based on the firmware information of each compute node; and calculating Euclidean distance between hardware indexes between each computing node, and taking the Euclidean distance as the firmware similarity between each computing node.

Further, the process of calculating the network structure similarity in the above alternative may include:

step1, calculating a network address distance and a network neighbor index between each computing node based on network information of each computing node;

and 2, taking the network address distance and the network neighbor index between each computing node as the network structure similarity between each computing node.

It can be seen that the alternative scheme mainly explains how to calculate the network structure similarity. In the alternative, the network address distance and the network neighbor index between each computing node are calculated based on the network information of each computing node; and taking the network address distance and the network neighbor index between each computing node as the network structure similarity between each computing node.

Further, the process of calculating the load similarity in the last alternative may include:

step1, calculating equipment load condition indexes and network bandwidth condition indexes of each computing node based on load information of each computing node;

and 2, taking the equipment load condition index and the network bandwidth condition index as the load similarity of the computing node.

It can be seen that the present alternative scheme mainly explains how to calculate the load similarity. In the alternative, the equipment load condition index and the network bandwidth condition index of each computing node are calculated based on the load information of each computing node; and taking the equipment load condition index and the network bandwidth condition index as the load similarity of the computing node.

Further, the process of calculating the total similarity in the last alternative may include:

Further, the process of clustering in the last alternative may include:

and clustering all the computing nodes based on the average value of the similarity between each computing node to obtain a plurality of computing node groups.

S103, setting a local decentralized communication architecture for the computing nodes in each computing node group, and setting a global centralized communication architecture between each computing node group;

on the basis of S102, the step aims to set a local decentralized communication framework for the computing nodes in each computing node group and set a global centralized communication framework between each computing node group. That is, similar nodes exist among nodes in each computing node group, and a local decentralized communication architecture can be adopted to improve the communication efficiency. And because different computing node groups have differences among the computing node groups, a global centralized communication architecture is adopted so as to coordinate the communication of the different computing node groups.

S104, performing distributed model training in a plurality of computing node groups based on the input model and data to obtain a training result; and the computing nodes in each computing node group adopt a synchronous updating strategy, and the computing nodes in each computing node group adopt an asynchronous updating strategy.

On the basis of S103, the step aims at carrying out distributed model training in a plurality of computing node groups based on the input model and data to obtain a training result; and the computing nodes in each computing node group adopt a synchronous updating strategy, and the computing nodes in each computing node group adopt an asynchronous updating strategy.

Further, the step may include:

step1, processing input models and data based on a distributed training format to obtain distributed training data and models;

and 2, performing distributed model training based on a synchronous updating strategy among each computing node, an asynchronous updating strategy among each computing node group, distributed training data and a model to obtain a training result.

It can be seen that this alternative is primarily illustrative of how training can be performed. In the alternative, the input model and data are processed based on the format of distributed training to obtain the data and model of distributed training; and performing distributed model training based on a synchronous updating strategy among each computing node, an asynchronous updating strategy among each computing node group, distributed training data and a model to obtain a training result.

Further, the process of performing data processing in the last alternative may include:

and carrying out denoising processing and standardization processing on the input model and the input data based on the format of the distributed training to obtain the data and the model of the distributed training.

It can be seen that the present alternative is primarily illustrative of how data and models may be processed. In the alternative, the input model and data are denoised and standardized based on the format of distributed training to obtain the data and model of distributed training.

Further, the process of data synchronization between the computing node groups in the last alternative may include:

It can be seen that the present alternative scheme is primarily illustrative of how an asynchronous update policy may be implemented. In this alternative, an asynchronous update policy is implemented between each compute node group based on a pre-set buffer.

Further, this embodiment may further include:

step1, performing visualization processing on a training result to obtain a visualization result;

and 2, displaying the visualization result.

It can be seen that the present alternative mainly illustrates that the training result can also be displayed visually. In the alternative scheme, the training result is visualized to obtain a visualized result; and displaying the visualization result.

In summary, in the embodiment, all the computing nodes are grouped to obtain a plurality of computing node groups, and then a differentiated communication architecture and a data updating strategy are executed between the computing nodes in each computing node group and each computing node group, so that reduction of training efficiency caused by differences between different computing nodes is avoided, efficiency of a training process of a distributed model is improved, and resources of the computing nodes are effectively utilized.

The following further describes a distributed training method for managing compute nodes according to another specific embodiment.

Referring to fig. 2, fig. 2 is a flowchart of another distributed trained computing node management method according to an embodiment of the present disclosure.

In this embodiment, the following modules are used to implement the distributed training method for managing computing nodes.

1) And the heterogeneous computing node cluster information input module. The module records the hardware information, the current load running state information, the network connection and bandwidth condition between the computing nodes and other information of the computing nodes of the heterogeneous server participating in distributed training.

2) And the grouping module is based on heterogeneous computing node clustering. The module quantifies the similarity between heterogeneous computing nodes according to the hardware information of the heterogeneous computing nodes, the current complex running state, the importance and the urgency of computing tasks and other characteristics. Secondly, the similarity is used as a key basis of clustering, and clustering operation is carried out on the cluster formed by the whole computing nodes. Finally, several "homogenous" packets are generated by the clustering operation. Providing a basis for a subsequent distributed communication architecture.

3) And designing a module based on the distributed communication architecture of the grouping information. The module designs a communication architecture combining local decentralized and global centralized architectures. The same group formed by the homogeneous computing nodes adopts a local decentralized framework; a global centralized architecture is adopted among different groups.

4) A data/model input module. The data/model input module mainly completes the input task of the data set and the model to be processed, processes the input data/model into the format required by the distributed training system and the like, and provides the data/model for the direct reading and calling of the module to be trained later.

5) And a heterogeneous hybrid parallel distributed training scheme module. A first-order optimization algorithm is adopted as a bottom-layer optimizer, a synchronous updating strategy is executed for homogeneous computing nodes in the same group, an asynchronous updating strategy is executed for heterogeneous computing nodes among different groups, a heterogeneous hybrid parallel distributed training scheme combining the synchronous updating strategy and the asynchronous updating strategy is designed, and the accuracy and the effectiveness of the execution of training tasks are guaranteed.

6) And a training result output module. This module is responsible for outputting a global solution to the training task.

In summary, all modules cooperatively complete various complex training tasks in the deep learning field.

Heterogeneous computing node cluster information input module

Computing node hardware devices are a necessary prerequisite for distributed training. With the popularization of technologies such as internet of things, edge computing and cloud computing, various heterogeneous computing nodes such as a CPU (central processing unit), a GPU (graphic processing unit), an FPGA (field programmable gate array) and mobile computing equipment are simultaneously accessed to a network to form a complete distributed training system. The heterogeneous computing node cluster information input module records hardware information of computing nodes of the heterogeneous server participating in distributed training, current load operation state information of each computing node, network connection and bandwidth conditions among the computing nodes and other information (as shown in table 1), and provides important basis for subsequent grouping modules.

Table 1 schematic table of node information

。

And the grouping module is based on heterogeneous computing node clustering.

Because various heterogeneous devices exist in the distributed system, the performance difference of the heterogeneous devices is large when various tasks are calculated or processed, so that the efficient cooperation among heterogeneous computing nodes in the distributed system is difficult, and the execution efficiency of the system is reduced. To this end, the present embodiment attempts to propose a grouping idea of heterogeneous computing nodes, i.e., grouping "similar" computing devices into the same group to achieve efficient coordination with low overhead. The module quantifies the similarity between heterogeneous computing nodes according to the hardware information of the heterogeneous computing nodes, the current complex running state, the importance and the urgency of computing tasks and other characteristics. Secondly, the similarity is used as a key basis of clustering, and clustering operation is carried out on the cluster formed by the whole computing nodes. Finally, several "homogenous" packets are generated by the clustering operation. Providing a basis for a subsequent distributed communication architecture.

Among them, two key problems are: how to measure the similarity between heterogeneous computing nodes, and what clustering algorithm to use.

Measuring the similarity between heterogeneous computing nodes: the similarity (Sim) of the heterogeneous computing nodes in this embodiment is defined as: the combination of three similarities of firmware similarity (Dev), network structure similarity (Net) and Load similarity (Load) is as follows:

Sim（X，Y）= a*Dev（X，Y）+b*Net（X，Y）+c*Load（X，Y）。

wherein a, b, c respectively represent coefficients greater than 0, and a + b + c =1. Where Sim (X, Y) represents the similarity between compute node X and compute node Y.

Where Dev (X, Y) represents the firmware similarity between the computing node X and the computing node Y, and defines the euclidean distance between the core computing hardware index a and the storage capacity index B (note that all indexes need to be normalized), and its mathematical formula is defined as follows:

Dev（X，Y）=（A^2+B^2）^（1/2）。

the network structure similarity between the computing node X and the computing node Y is expressed by Net (X, Y), and is defined as the combination of a network 1P index C and a network neighbor index D, and the mathematical formula is defined as follows:

Net（X，Y）=（|ComNe1ghbor（X，Y）|/（|Ne1ghbor（X）|+|Ne1ghbor（Y）|））*d（X，Y）。

the network index C is denoted by d (X, Y) and represents the cos distance of the network 1P between the computation nodes, i.e., d (X, Y) = X × Y/(| X | + | Y |).

Load (X, Y) represents the Load similarity between the computing node X and the computing node Y, and is defined as the combination of the index of the device Load condition E and the index of the network bandwidth occupation condition F, and the specific mathematical formula is defined as follows:

Load（X，Y）=（1/2）*E+（1/2）*F（4）。

similarity-based grouping (clustering) algorithm: based on the above method for calculating the similarity between any two computing nodes (assuming that there are n computing nodes in the distributed training system), the present embodiment proposes the following grouping (clustering) strategy for heterogeneous computing nodes:

the method comprises the following steps: starting from a computing device with the reference number 1, the similarity between the node and all other nodes is calculated, namely Sim (1, 1), sim (1, 2), sim (1, 3), \ 8230;, sim (1, n) is obtained and calculated

(ii) a From the reference numerals The computing device of 2 starts by computing its similarity to all the other nodes, i.e. obtainingSim (2, 2), sim (2, 3), sim (2, 4), \ 8230, sim (2, n), and calculate ^ or ^ based on the measured values>

(ii) a And by analogy, calculating the similarity upper matrix of all the nodes and the average value of the similarity of each row.

TABLE 2 INDICATIONS OF THE MEASUREMENT OF SIMILAR RATIO

。

Step two: starting from the computing node numbered 1, the computing node numbered 1 to the computing node numbered n are examined. If the similarity Sim (1, 1) between the calculation node denoted by 1 and the calculation node denoted by 1 is equal to or greater than S1, the calculation node denoted by 1 and the calculation node denoted by 1 are divided into the same group (cluster). Through the above operations, the first Group1= { node 1, node 2, node k. Note that those compute nodes with a similarity lower than the average similarity S1 will be used in the grouping process for the subsequently numbered nodes.

Step three: starting with the computing node numbered 2, the computing node numbered 2 through the computing node numbered n are examined, and only the nodes that were culled in step two (i.e., nodes that are not in Group 1) are considered. If the similarity Sim (2, l) between the calculation node with the reference number 2 and the calculation node with the reference number l is greater than or equal to S2, they are classified into the same group. Through the above operation, a second Group2= { node 2, node l.

Step four: and repeating the process until the n computing nodes are processed. Finally, m disjoint grouping sets Group1, group2, \ 8230and Group pm without coincident computing nodes can be obtained.

And designing a module based on the distributed communication architecture of the grouping information.

Referring to fig. 3, fig. 3 is a schematic diagram of a distributed communication architecture according to an embodiment of the present disclosure.

Considering the influence of factors such as hardware equipment, network bandwidth and transmission rate, communication among computing nodes of the distributed training system often becomes a bottleneck, and training performance is severely restricted. In this case, the module focuses on designing a communication architecture that combines local decentralized with a global centralized architecture. As shown in fig. 3.

Fig. 3 shows a schematic diagram of a distributed system communication architecture designed by the present embodiment. Specifically, the present embodiment provides a local decentralized architecture and a global centralized architecture. On one hand, a local decentralized framework is adopted in the same group formed by the homogeneous nodes; on the other hand, a global centralized architecture is adopted among different groups.

The main advantages of this architecture are: firstly, the homogeneous nodes have small differences in processing capacity, computing performance and the like, so that the method is suitable for executing the synchronous updating strategy, and the synchronous updating strategy has high execution efficiency; and secondly, the difference between different groups is large, so that a global centralized framework capable of executing an asynchronous updating strategy is designed, and the synchronization and the average of global model parameters of each local group training result are realized.

A data/model input module.

The data/model input module is mainly used for completing the input task of a data set and a model to be processed, processing the input data/model into a format required by a distributed training system, and performing operations such as noise removal, standardization and the like for direct reading and calling of a subsequent training module.

Referring to fig. 4, fig. 4 is a schematic diagram of a parallel training architecture according to an embodiment of the present disclosure.

The heterogeneous hybrid parallel distributed training scheme module is shown in fig. 4. The embodiment designs a distributed training method with global centralization and local decentralization. Specifically, after the computing nodes in the cluster are processed by the clustering grouping module, different computing nodes are divided into different groups. These packets are connected to a particular server node and there is no direct connection between different packets. The server node is directly connected with the server node. The computing nodes in the same group adopt a synchronous updating strategy, and different groups adopt an asynchronous updating strategy.

In order to improve the efficiency of the server nodes to perform asynchronous global operations, the present embodiment sets a Buffer for each server node, and the size of the Buffer is dynamically adjusted according to the number of corresponding packets. For example, in fig. 4, group1 (packet 1) is connected to Server node Server1 (Server 1), and the size of Buffer1 at Server1 is 1.Group2 and Group3 are connected to the Server node Server2, and the size of the Buffer2 of the Server2 is 2.

In summary, the heterogeneous hybrid parallel distributed training scheme includes: the server node workflow and the computing node workflow are two parts.

The server node workflow may include:

inputting: the total iteration number T, the threshold value Q and the learning rate eta.

And (3) outputting: global parameter W (t + 1).

Step1, when the iteration time t =0, setting a Buffer1 at each Server node Server1 as an empty set, and making a local summary parameter Server1 (W) =0 at the Server 1.

Step2, when the iteration time t = m, the Server node Server1 obtains the average gradients Gj and Gk generated by the group from all the groups (e.g., group j, group pk) connected thereto, and calculates the "global gradient" G1 of the group, i.e., G1= (1/2) (Gj + Gk). The global parameters are updated based on the gradient G1, as follows:

W（t+1）=W（t）-η*G1。

and step 3, sending the global parameter W (t + 1) generated at the moment t = m to the corresponding groups Gj and Gk.

Step 4, if the iteration interval Gap = | t between any two Server nodes Server1 and Server2 ₁ -t ₂ |>If Q is not greater than the threshold, global synchronization between the Server1 and the Server2 is performed to obtain a global gradient G = (1/2) × (G1 + G2), and the global parameter is updated accordingly, as shown in the following formula:

W（t+1）=W（t）-η* G。

where G1 denotes the gradient cached in the Buffer of the Server1 (Server 1), and G2 denotes the gradient cached in the Buffer of the Server2 (Server 2).

Once T = T, the server node workflow ends, step 5.

The computing node workflow may include:

step1, for each Group1, it is assumed that Group1 includes a computing node w1, a computing node w2, and a computing node w3. When t = m, group1 acquires a global parameter W (t) sent by a server node corresponding to the global parameter W (t), and each computing node iteratively computes a new gradient based on W (t) and a local data sample:

node w1 calculates the gradient g1: g1= \ 8711f (W (t);. Xi), where. Xi represents the sample data for node W1;

node w2 calculates the gradient g2: g2= 8711f (W (t);

) Wherein->

Sample data representing node w 2;

node w3 calculates the gradient g3: g3= \ 8711f (W (t); /), where ψ represents sample data for node W3.

A gradient G1= (1/3) = (G1 + G2+ G3) of the above-described 3 nodes;

and 2, sending the new gradient G1 to the corresponding server node.

And Step 3, repeating the Step1 and the Step2 until T = T.

And finally, a result output module.

And the result output module is responsible for outputting the final result of the distributed training system and presenting the final result to the user in a visual mode. The user can further modify or adjust the training scheme according to the information, so that the system improvement is facilitated.

The embodiment provides a mixed distributed training method for fusing a centralization framework and a decentralization framework and adopting a synchronous updating strategy and an asynchronous updating strategy based on a grouping (clustering) idea. For this reason, 6 main functional modules are designed: the system comprises a heterogeneous computing node cluster information input module, a grouping module based on heterogeneous computing node clustering, a distributed communication architecture design module based on grouping information, a data/model input module, a heterogeneous hybrid parallel distributed training scheme module and a training result output module. Each module takes its own role and completes the training task cooperatively.

According to the similarity index definition provided by the embodiment, grouping or clustering of heterogeneous computing nodes is realized, so that similar equipment is divided into the same group, and various expenses caused by difference among the equipment are reduced;

meanwhile, a packet-based distributed communication architecture is designed and a flexible synchronous updating strategy is adopted. Aiming at the computing nodes in the same group, a decentralized framework is adopted and a synchronous updating strategy is executed, so that the computing precision and efficiency of the homogeneous nodes are ensured; aiming at the computing nodes of different groups, a centralized framework is adopted and an asynchronous updating strategy is executed, so that the resource utilization rate of the heterogeneous nodes is ensured, and resource idling and waste are avoided.

Therefore, in the embodiment, all the computing nodes are grouped to obtain a plurality of computing node groups, and then a differentiated communication architecture and a data updating strategy are executed between the computing nodes in each computing node group and each computing node group, so that the reduction of training efficiency caused by the difference between different computing nodes is avoided, the efficiency of the training process of the distributed model is improved, and the resources of the computing nodes are effectively utilized.

The following further describes a distributed training method for managing compute nodes according to another embodiment.

In this embodiment, the method may include:

the client sends the model and the data to be trained to the server so that the server can obtain the node information of each computing node; grouping all the computing nodes based on the node information of each computing node to obtain a plurality of computing node groups of different types; setting a local decentralized communication architecture for the computing nodes in each computing node group, and setting a global centralized communication architecture between each computing node group; performing distributed model training in a plurality of computing node groups based on the input model and data to obtain and return a training result; the method comprises the following steps that a synchronous updating strategy is adopted among computing nodes in each computing node group, and an asynchronous updating strategy is adopted among the computing node groups;

and the client acquires the training result and displays the training result.

In this embodiment, the method may include:

the server acquires node information of each computing node; grouping all the computing nodes based on the node information of each computing node to obtain a plurality of computing node groups of different types; setting a local decentralized communication architecture for the computing nodes in each computing node group, and setting a global centralized communication architecture between each computing node group; performing distributed model training in a plurality of computing node groups based on a model and data input by a client to obtain a training result; the method comprises the following steps that a synchronous updating strategy is adopted among computing nodes in each computing node group, and an asynchronous updating strategy is adopted among the computing node groups;

and the client displays the training result.

In the following, the distributed trained computing node management apparatus provided in the embodiment of the present application is introduced, and the below-described distributed trained computing node management apparatus and the above-described distributed trained computing node management method may be referred to in correspondence with each other.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a distributed training computing node management apparatus according to an embodiment of the present disclosure.

In this embodiment, the apparatus may include:

a node information obtaining module 100, configured to obtain node information of each computing node;

a node grouping module 200, configured to group all computing nodes based on node information of each computing node to obtain multiple computing node groups of different types;

a communication architecture setting module 300, configured to set a local decentralized communication architecture for the compute nodes in each compute node group, and set a global centralized communication architecture between each compute node group;

a model training module 400, configured to perform distributed model training in multiple computing node groups based on an input model and data to obtain a training result; and the computing nodes in each computing node group adopt a synchronous updating strategy, and the computing nodes in each computing node group adopt an asynchronous updating strategy.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a server provided in the embodiment of the present application, where the server may include:

a memory for storing a computer program;

and the processor is used for realizing the steps of the distributed training computing node management method when executing the computer program.

As shown in fig. 6, which is a schematic diagram of a composition structure of a server, the server may include: a processor 10, a memory 2, a communication interface 12 and a communication bus 13. The processor 10, the memory 2 and the communication interface 12 all communicate with each other via a communication bus 13.

In the embodiment of the present application, the processor 10 may be a Central Processing Unit (CPU), an application specific integrated circuit (asic), a digital signal processor, a field programmable gate array (fpga) or other programmable logic device.

The processor 10 may call a program stored in the memory 2, and in particular, the processor 10 may perform operations in an embodiment of the exception 1P recognition method.

The memory 2 is used for storing one or more programs, the program may include program codes, the program codes include computer operation instructions, in this embodiment, the memory 2 stores at least the program for implementing the following functions:

acquiring node information of each computing node;

performing distributed model training in a plurality of computing node groups based on the input model and data to obtain a training result; and the computing nodes in each computing node group adopt a synchronous updating strategy, and the computing nodes in each computing node group adopt an asynchronous updating strategy.

In one possible implementation, the memory 2 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created during use.

Further, the memory 2 may comprise high speed random access memory, and may also comprise non-volatile memory, such as at least one disk storage device or other volatile solid state storage device.

The communication interface 12 may be an interface of a communication module for connecting with other devices or systems.

Of course, it should be noted that the structure shown in fig. 6 does not constitute a limitation on the server in the embodiment of the present application, and in practical applications, the server may include more or less components than those shown in fig. 6, or some components may be combined.

The present application further provides a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program can implement the steps of any one of the above-mentioned distributed training computing node management methods.

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

For the introduction of the computer-readable storage medium provided in the present application, please refer to the above method embodiments, which are not described herein again.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The present application provides a method, an apparatus, a server, and a computer-readable storage medium for managing compute nodes for distributed training. The principles and embodiments of the present application are described herein using specific examples, which are only used to help understand the method and its core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A distributed training method for managing computing nodes is characterized by comprising the following steps:

acquiring node information of each computing node;

2. The method according to claim 1, wherein a synchronous update policy is applied between the compute nodes in each of the compute node groups, and an asynchronous update policy is applied between each of the compute node groups.

3. The method for managing computing nodes according to claim 1, wherein obtaining node information of each computing node comprises:

and recording the node information in a database.

4. The method according to claim 1, wherein grouping all the computing nodes into groups based on the node information of each computing node to obtain a plurality of computing node groups of different types comprises:

5. The method for managing computing nodes according to claim 4, wherein performing similarity calculation on each computing node based on the node information of each computing node to obtain a similarity between each computing node includes:

6. The computing node management method of claim 5, wherein computing the firmware similarity between each computing node based on the firmware information of each computing node comprises:

7. The method for managing computing nodes according to claim 5, wherein calculating the network structure similarity between each computing node based on the network information of each computing node comprises:

8. The method of claim 5, wherein computing the load similarity for each computation based on the load information for each compute node comprises:

9. The method for managing the computing nodes according to claim 6, wherein determining the similarity between each of the computing nodes based on the similarity of the firmware between each of the computing nodes, the similarity of the network structure, and the similarity of the load comprises:

10. The method of claim 1, wherein performing distributed model training in the plurality of compute node groups based on the input model and data to obtain training results comprises:

11. The method of claim 10, wherein processing the input model and data based on the format of distributed training to obtain the data and model of distributed training comprises:

12. The method according to claim 10, wherein the asynchronous updating policy between each of the computing node groups comprises:

13. A distributed training method for managing computing nodes is characterized by comprising the following steps:

and the client acquires the training result and displays the training result.

14. A distributed training method for managing computing nodes is characterized by comprising the following steps:

and the client displays the training result.

15. A distributed trained compute node management apparatus, comprising:

16. A server, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the compute node management method of any one of claims 1 to 12 when executing said computer program.

17. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the compute node management method according to one of the claims 1 to 12.