CN111813858B

CN111813858B - Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes

Info

Publication number: CN111813858B
Application number: CN202010662415.0A
Authority: CN
Inventors: 陈爱国; 郑旭; 罗光春; 田玲; 谢渊; 邹冰洋
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2022-06-24
Anticipated expiration: 2040-07-10
Also published as: CN111813858A

Abstract

The invention relates to a distributed neural network technology, discloses a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes, and solves the problem that the accuracy and the training efficiency of a model cannot be well balanced by a traditional distributed neural network synchronous algorithm. The method comprises the following steps: s1, predicting the synchronization efficiency of the calculation node by adopting Kalman filtering, and analyzing the fluctuation condition of the synchronization efficiency of the calculation node by adopting mean square error; s2, carrying out self-organizing real-time grouping on the computing nodes based on the predicted synchronization efficiency of each computing node and the similarity of the fluctuation condition of the synchronization efficiency; and S3, training models by adopting different synchronization strategies between groups and in groups according to the self-organizing real-time grouping result of the computing nodes.

Description

Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes

Technical Field

The invention relates to a distributed neural network technology, in particular to a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes.

Background

When a single machine system utilizes super-large scale data to train a super-large scale neural network, the problem of low training efficiency can occur, and serious conditions even directly cause the failure of the neural network training process. The distributed neural network well solves the problems of low training efficiency and training failure of super-large scale data in a single machine system.

The distributed neural network is divided into data parallelization and model parallelization according to two conditions of dividing training data or dividing model data, wherein the data parallelization is a key technology for improving the efficiency of large-scale data training.

In the distributed neural network with data parallelization, firstly, training data needs to be split into a plurality of computing nodes, single-machine optimization is carried out on the training data at each computing node, then, after each round of single-machine optimization process, gradient parameters of the computing nodes are sent to a parameter server for parameter fusion, model data are updated, then, the model data are redistributed to the computing nodes for the next round of iteration, and the system architecture is shown in fig. 1. In a distributed environment, the computing power of the computing nodes and the bandwidth efficiency from the computing nodes to the parameter server are different, so the pace of synchronizing the parameter data to the parameter server by the computing nodes is different.

The traditional distributed neural network synchronization algorithm is divided into three types, namely a synchronous gradient descent algorithm (SSGD algorithm), an asynchronous synchronous gradient descent algorithm (ASGD algorithm) and a mixed synchronous gradient descent algorithm. The SSGD algorithm needs to wait for all the calculation nodes to send the gradient parameters to the parameter server before parameter fusion can be carried out, so that the parameter server waits for the slow nodes; the ASGD algorithm does not need to wait for parameter data of all the computing nodes, and once the parameter server receives one parameter data, the parameters can be fused, but the method can reduce the accuracy of the model; the hybrid synchronous gradient descent algorithm allows the parameter server to run asynchronously until the difference value of the iteration rounds between any two computing nodes exceeds a threshold value, so that the relation between the training efficiency and the model accuracy is well balanced, but the distributed system still cannot obtain the highest efficiency due to the use of a fixed threshold value.

Therefore, the traditional distributed neural network synchronization algorithm cannot well balance the model accuracy and the training efficiency.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes is provided, and the problem that the accuracy rate and the training efficiency of a model cannot be well balanced by a traditional distributed neural network synchronous algorithm is solved.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes comprises the following steps:

s1, predicting the synchronization efficiency of the calculation node by adopting Kalman filtering, and analyzing the fluctuation condition of the synchronization efficiency of the calculation node by adopting mean square error;

s2, carrying out self-organizing real-time grouping on the computing nodes based on the predicted synchronization efficiency of each computing node and the similarity of the fluctuation condition of the synchronization efficiency;

and S3, training models by adopting different synchronization strategies between groups and in groups according to the self-organizing real-time grouping result of the computing nodes.

As a further optimization, step S1 specifically includes:

s11, for any calculation node i, the parameter server collects the time of each round of recent iteration to form a time window set T with a fixed size_iTime window set T_iEach element records the time taken for the parameter server to complete the reception of the latest gradient parameters of the node from the transmission of the latest model data to the node in one iteration;

s12, according to the efficiency of the last iteration of the node and the Kalman filtering condition, carrying out efficiency evaluation prediction on the next iteration to obtain the predicted next synchronization efficiency of the node

S13 solving time window set T_iThe mean square error of the medium element is used for analyzing the recent synchronous efficiency fluctuation condition of the computing node and is recorded as wave_iThe two variables are used as the output of the evaluation and prediction of the computing node.

As a further optimization, step S2 specifically includes:

s21, forming a similar cluster by a plurality of similar computing nodes, and judging whether the node needs to be moved out of the similar cluster according to the node similarity condition when each iteration of any node starts;

s22, removing the nodes needing to be moved out of the similar cluster, and creating a new similar cluster for the nodes;

and S23, recombining and merging the similar clusters meeting the merging condition, and updating the internal information of the similar clusters.

As a further optimization, in step S21, the node similarity condition includes:

for the same similar cluster group_similarAny two computing nodes in (2) need to satisfy:

among them, Threshold_TThreshold, representing a similarity in node efficiency_waveA threshold value indicating that recent efficiency fluctuations are similar;

indicating the predicted synchronization efficiency of node x in a similar cluster,

representing the efficiency of predictive synchronization, wave, of node y in a similar cluster_xRepresenting recent synchronization efficiency fluctuation value, wave, of node x in a similar cluster_xRepresenting recent synchronization efficiency fluctuation values of nodes y in similar clusters.

As a further optimization, in step S23, the merging condition includes:

for any two similar clusters, as long as any two nodes u and v in the two similar clusters satisfy

And | wave_u-wave_v|≤Threshold_waveThe merging can be performed.

As a further optimization, step S3 specifically includes:

s31, when one round of iteration of any computing node is finished, computing the maximum iteration round epoch in the similar cluster_max：

Wherein wave represents the average value of all node fluctuations in the similar cluster, and nodeNumber represents the number of nodes in the similar cluster;

s32, judging whether the node can enter global waiting or cluster waiting, if so, entering a step S33, otherwise, entering a step S34;

s33, if the node enters global waiting, the node can continue to operate after all nodes have the same iteration turns; if the node enters a cluster waiting state, the node can continue to operate after all nodes in the same similar cluster have the same iteration turns;

s34, if the node can not enter the global waiting or the cluster waiting, the node carries out the single machine asynchronous optimization, enters the next round of iteration and returns to the step S31.

As a further optimization, in step S32, the determining whether the node will enter global waiting or cluster waiting specifically includes:

(1) judging whether the absolute iteration turn difference value between the node and all nodes in the cluster is greater than a preset threshold value, and entering global waiting if the absolute iteration turn difference value is greater than the preset threshold value;

(2) if the difference between the relative iteration round of the node and the minimum relative round in the similar cluster is larger than the epoch_maxEntering a similar cluster for waiting;

the absolute iteration turn refers to the total number of times that the node has iterated currently, and the relative iteration turn refers to the iteration turn of the node in the similar cluster in which the node is located.

The invention has the beneficial effects that:

the prediction of the node synchronization efficiency and the analysis of the fluctuation condition of the synchronization efficiency are used as the basis of node self-organizing grouping, so that the cluster grouping state of all the computing node sets forming the distributed neural network is dynamically adjusted, and different synchronization strategies are adopted between groups and in the groups to train models according to the real-time grouping result. Because the grouping is carried out by adopting the approximate condition considering the synchronous efficiency of the nodes, the efficiency of each node in the group is relatively close, and when the model training is carried out, the limit of the maximum allowable iteration round difference in the group is used as a node synchronous barrier in the cluster, so that the iterative waiting time can be reduced, and the training efficiency is improved; in addition, the accuracy of the trained model is ensured by using the limitation of the maximum iteration round difference between any nodes as a global synchronization barrier. Therefore, the scheme of the invention can well balance the model accuracy and the training efficiency.

Drawings

FIG. 1 is a diagram of a distributed neural network system architecture for data parallelization;

FIG. 2 is a schematic diagram of a cluster state after node self-organizing grouping according to the present invention;

fig. 3 is a flowchart of a hybrid synchronous training method for a distributed neural network according to an embodiment of the present invention.

Detailed Description

The invention aims to provide a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes, and solves the problem that the accuracy and the training efficiency of a model cannot be well balanced by a traditional distributed neural network synchronous algorithm.

The method comprises three major parts: calculating node efficiency evaluation and prediction, node self-organizing grouping and mixed synchronous training.

Firstly, evaluating and predicting the efficiency of a computing node:

for each computing node i, the parameter server collects the time of each recent iteration, and a time window set with a fixed size is formed and is recorded as T_i，T_iComposed of fixed-size queues, each queue element being denoted as

m is the number of elements in the queue,

the time taken by the parameter server to receive the latest gradient parameters for the node from the time the latest model data is sent to the node in one iteration is recorded. Therefore, the temperature of the molten metal is controlled,

namely, in one iteration, the synchronization efficiency of the computing node comprises the single-machine optimization efficiency of the computing node and the efficiency of synchronizing data with the parameter server.

The method utilizes Kalman filtering to predict the efficiency of the computing node at the next moment, and obtains the efficiency evaluation of the computing node in the next iteration, and the efficiency evaluation is recorded as

At the same time, solving a set of time windows T_iThe mean square error of the medium element is analyzed, and the efficiency fluctuation condition of the computing node in the near term is recorded as wave_iThe two variables are used as the output of the evaluation and prediction of the computing node.

Secondly, node self-organizing grouping:

the main purpose of the ad hoc grouping of nodes is to group "similar" nodes into one group and "dissimilar" nodes into another group. Similar definition is that the node efficiency predicted in the next round is similar and the recent fluctuation is similar, and we mark the cluster formed by a plurality of "similar" nodes as a similar cluster group_similarA plurality of similar clusters form all the computing node sets S of the distributed neural network, as shown in fig. 2; for the same similar cluster group_similarEach compute node in (1) needs to satisfy a similar condition:

among them, Threshold_TThreshold, which represents a Threshold of similarity in node efficiency_waveRepresenting thresholds at which recent efficiency fluctuates similarly. The node self-organizing grouping algorithm is to dynamically adjust the set S when each iteration of the computing nodes is completed, so as to ensure all groups_similarSimilar conditions are satisfied.

Specifically, the node ad-hoc grouping process is mainly divided into the following steps:

1) calling the algorithm when any node i enters a new iteration;

2) if the node i cannot meet the similar conditions in the original similar cluster, moving the node i out of the original similar cluster and updating the information of the set S;

3) traversing all groups in the set S_similarElements, regrouping the elements;

in summary, self-organizing grouping is a process of splitting and recombining an existing set structure when any node enters a new iteration, and the process finally outputs a new set of all nodesCalculating a node set S to ensure any one cluster group therein_similarSimilar conditions are satisfied.

Thirdly, mixed synchronous training:

the hybrid synchronous training is to use synchronous algorithms with different strategies between groups and in the self-organized grouping, improve the accuracy of the model as much as possible, and improve the training efficiency of the model under the condition of ensuring the accuracy.

The invention designs two synchronous barriers to limit the communication pace of training so as to balance the model accuracy and the training efficiency.

Global synchronization barrier: in order to ensure the accuracy of the model, the maximum iteration delay between any two nodes needs to be determined, that is, the maximum iteration difference Threshold needs to be ensured between any two nodes_maxFor any similar cluster

Defining the maximum iteration difference allowed in the cluster as Threshold_iThen any two similar clusters

And

the sum of the maximum allowable iteration difference values between can not be greater than Threshold_maxNamely:

similar cluster synchronization barriers: according to the fluctuation value of each similar cluster node, calculating the maximum iteration difference Threshold in the cluster_iAllowing maximum Threshold among nodes in similar cluster_iShould be less than or equal to Threshold, i.e. the iteration difference between the slowest node and the fastest node_iOnce the iteration difference is greater than Threshold_iThen all nodes within the cluster must wait until the slowest node and the fastest node iteration wheelThe same time.

Example (b):

the present embodiment specifically describes the present invention with a preferred implementation algorithm.

The parameters of the algorithm comprise the length of a time statistical window _ length and a node efficiency similarity Threshold value_TNode fluctuation similarity Threshold_waveThreshold value Threshold of iteration round difference of any two nodes_max. Referring to fig. 3, the distributed neural network hybrid synchronous training method in the present embodiment includes the following steps:

step 1, calculating node efficiency evaluation and prediction:

the main purpose of the step is to carry out prediction and evaluation on the efficiency of the computing node, and the method is used as the basis for node self-organizing grouping, and the specific sub-process is as follows:

step 1.1, initialization:

efficiency assessment and prediction data structures are initialized in the parameter server. For any computing node i, the parameter server generates a queue with a length of window _ length_iAnd the time storage unit is used for storing the time used by each iteration of the computing node for a period of time. Meanwhile, the parameter server generates a Kalman filtering object corresponding to the node, initializes a matrix related to the Kalman filtering object, and comprises a state transition matrix of Kalman filtering

System measurement matrix

Covariance matrix of process noise with system

Step 1.2, detecting a node fluctuation value:

when any computing node i completes one iteration, the parameter server computes the time from sending model data by the parameter server to finishing transmitting gradient parameter data by the node

Parameter server updates the iteration time of the computing node

Put into the queue and dequeue the first element if the size of the queue exceeds the window length window. Calculating the mathematical expectation of all elements in the queue and calculating the mean square error of all elements, and for a specific calculation node i, the mean square error is recorded as wave_i. To be provided with

The fluctuation detection returns the following results on behalf of the elements in the queue:

wherein,

is an element

Is determined by the corresponding weighting factor of (a),

is an element

The mathematical expectation of (2).

Step 1.3, Kalman filtering efficiency prediction:

considering that the efficiency for any node should satisfy the principle of inertia, i.e. the efficiency should satisfy a smooth transition in a short time, without abrupt changes. Therefore, the efficiency can be predicted using kalman filtering. When the iteration round of any node i begins, counting the iteration time T of the previous round_iAs the last time of measurement

Prediction of the last iteration as prediction of the last measurement

When the iteration round is finished, the time is measured by using the iteration round

Correcting the Kalman filter and predicting the time required by the next iteration through the Kalman filter

Step 1.4: will be provided with

And wave_iPassed down as a return value, updating the node packet condition.

Step 2, node self-organizing grouping step:

the main purpose of this step is to combine a plurality of homogenous nodes into a whole, use a kind of synchronization strategy in the whole, use another kind of synchronization strategy outside the whole, its concrete subprocess is as follows:

step 2.1, initialization:

when the system is initialized, dictionary groups (key _ id and value _ list) and map (key _ node and value _ group _ id) are created, keys of the groups represent ids of one similar cluster, and values represent all computing node information contained in the similar cluster. The node structure stores the absolute iteration round epoch of the node_absRelative iteration round epoch_relaAnd the fluctuation value wave and the efficiency prediction value T, wherein the relative iteration turn refers to the iteration number of the node in the similar cluster where the node is located. map is the backward key of groups, and is used for quickly locating which node the node is located at. At the beginning of the first round, the parameter server treats each compute node as an independent similar cluster, and therefore, a groups dictionary containing the number elements of the compute node is created, and the value of each dictionary contains a compute node information node structure.

Step 2.2: splitting a similar cluster:

similar cluster splitting is attempted before any compute node starts a new iteration. The specific process is as follows:

1) updating node information, wherein the step mainly updates the information contained in the node structure and uses the information returned by the step of evaluating and predicting the efficiency of the computing node

wave_iAnd updating the efficiency predicted value T and the fluctuation value wave of the node.

2) Judging whether the splitting is needed or not, and judging whether the splitting is needed or not, wherein the splitting is needed for the similar cluster group of the node_similarTraversing all node structures to find the wave minimum wave^minWith maximum wave^maxMinimum value of efficiency T^minAnd a maximum value T^maxJudging the following formula:

if any one of the two conditions cannot be met, the node cannot meet the similar condition and needs to be split. If not, go directly to step 2.3.

3) And (4) splitting the computing node, namely independently generating a similar cluster from the computing node, and updating the information of the nodes in groups and map and the id information of the corresponding node in the map.

Step 2.3, the computing nodes are recombined:

that is, existing similar clusters are merged, so that the number of similar clusters is reduced as much as possible.

For any two similar clusters, as long as any two node nodes in the two similar clusters are satisfied_uAnd a node_vSatisfy | T_u-T_v|≤Threshold_TAnd | wave_u-wave_v|≤Threshold_waveThen merging can be performed, and the specific merging algorithm is as follows:

1) and sorting all similar cluster data efficiency T minimum values from small to large by using a principle that the minimum value of the fluctuation value wave is a main key and the minimum value of the fluctuation value wave is an auxiliary key. The start pointer and compare pointer are moved to the first similar cluster.

2) Moving the compare pointer to compare the T pointed by the start pointer_minAnd wave_minAnd T pointed to by the compare pointer_maxAnd wave_maxWhether or not the difference values of (a) are all less than or equal to the threshold value set therefor.

3) If any condition is not met, all clusters from the start pointer to the front of the comparison pointer are merged to form a new similar cluster. Moving the start pointer to the position of the comparison pointer, and jumping to 2) to run until all similar clusters are compared. When the similar clusters are combined, the relative iteration round of each node structure in the clusters needs to be updated, and the updating principle is that the minimum relative iteration round value is subtracted from the relative iteration round of the two similar clusters respectively.

And 2.4, transmitting the dictionary structure groups and the map serving as return values downwards.

For greater clarity, the self-organizing grouping procedure for nodes is exemplified by the following example: setting a cluster of three nodes, their Threshold_TAnd Threshold_wave0.5, 0.5, respectively, initially,

node₁:{wave＝1,T＝2.2,epoch_abs＝1,epoch_rela＝1}，

node₂:{wave＝3,T＝4.1,epoch_abs＝1,epoch_rela＝1}，

node₃:{wave＝4.6,T＝8.3,epoch_abs＝1,epoch_rela＝1

groups＝{1:node₁；2:node₂；3:node₃}。

when node₂After one round of iteration is completed, the data is updated to be node₂:{wave＝1.1,T＝2.4,epoch_abs＝2,epoch_rela2, then the nodes are merged₁And a node₂，groups＝{1:{node₁,node₂}；3:node₃And updating all node relative iteration rounds with id 1 in groups, and finally obtaining the node₁And a node₂Has the structure of a node₁:{wave＝1,T＝2.2,epoch_abs＝1,epoch_rela＝1}，node₂:{wave＝1.1,T＝2.4,epoch_abs＝2,epoch_rela＝1}

Step 3, mixed synchronous training:

the main purpose of the step is to train the model according to different synchronous strategies by utilizing the self-organizing grouping result, and the specific sub-processes are as follows:

step 3.1, when one round of iteration of any computing node is finished, calculating the maximum iteration round inside the similar cluster according to the following formula:

wherein wave represents the average value of all node fluctuations in the similar cluster, and nodeNumber represents the number of nodes in the similar cluster.

Step 3.2, judging whether the next step needs to enter a waiting process according to the cluster state, specifically as follows:

1) judging whether the absolute iteration round difference value between the node and all the nodes in the cluster is larger than a Threshold value Threshold or not_maxIf so, the cluster enters global wait.

2) And judging whether waiting is needed or not according to the relative iteration turns of the node and all nodes in the same similar cluster in the groups. If the difference between the relative iteration round of the node and the minimum relative round in the similar cluster is larger than the epoch_maxThen go to similar cluster waiting.

3) Otherwise, the distributed training does not enter a waiting process.

Step 3.3: the waiting process is specifically as follows:

1) if no waiting process is entered in step 3.2, the distributed neural network continues to the next iteration in the manner of step 3.1.

2) If the global waiting process is entered in step 3.2, all the computing nodes must iterate until the absolute round is the same as the node, the cluster can enter the next round of iteration, and when the global waiting process exits, the relative iteration round of all the nodes is updated to be 0.

If the waiting process of the similar cluster is entered in step 3.2, the node must wait for all nodes in the similar cluster where the node is located to perform the next iteration in the same relative round as the node, and when the waiting process of the similar cluster exits, the relative iteration round of all nodes of the similar cluster is updated.

Claims

1. A distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes is characterized in that,

the method comprises the following steps:

s2, carrying out self-organizing real-time grouping on the computing nodes based on the predicted synchronization efficiency of each computing node and the similarity degree of the fluctuation condition of the synchronization efficiency;

s3, according to the self-organizing real-time grouping result of the computing nodes, different synchronization strategies are adopted between groups and in the groups to train the models;

step S3 specifically includes:

Wherein, wave represents the average value of all node fluctuation in the similar cluster, and nodeNumber represents the number of nodes of the similar cluster;

s33, if the node enters global waiting, the node can continue to operate after all nodes have the same iteration turns; if the node enters cluster waiting, the node can continue to operate after all nodes in the same similar cluster have the same iteration round;

s34, if the node does not enter global waiting or cluster waiting, the node carries out single-machine asynchronous optimization, enters the next iteration and returns to the step S31;

in step S32, the determining whether the node will enter global waiting or cluster waiting specifically includes:

2. The distributed neural network hybrid synchronous training method based on the self-organizing grouping of the compute nodes as claimed in claim 1, wherein the step S1 specifically includes:

s11, for any calculation node i, the parameter server collects the time of each round of recent iteration to form a time window set T with a fixed size_iTime window set T_iEach element in (2) records the time taken by the parameter server from sending the latest model data to the node to receiving the latest gradient parameters of the node in one iteration;

s12, according to the efficiency of the last iteration of the node and the Kalman filtering condition, carrying out efficiency evaluation prediction on the next iteration to obtain the predicted next synchronization efficiency T of the node_i ^predict；

S13 solving time window set T_iThe mean square error of the medium element is used for analyzing the recent synchronous efficiency fluctuation condition of the computing node and is recorded as wave_iThe two variables are used as calculationAnd (4) node evaluation and prediction output.

3. The distributed neural network hybrid synchronous training method based on the self-organizing grouping of the compute nodes as claimed in claim 1, wherein the step S2 specifically includes:

4. The distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes as claimed in claim 3, wherein in step S21, the node similarity condition includes:

among them, Threshold_TThreshold, representing a similarity in node efficiency_waveA threshold value representing a similarity of recent efficiency fluctuations;

representing the efficiency of predictive synchronization, wave, of node y in a similar cluster_xRepresenting recent synchronization efficiency fluctuation value, wave, of node x in a similar cluster_xRepresenting the recency of node y in a similar clusterAnd synchronizing the efficiency fluctuation value.

5. The distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes as claimed in claim 3, wherein in step S23, the merging condition comprises:

for any two similar clusters, as long as any two nodes u and v inside the two similar clusters satisfy

And | wave_u-wave_v|≤Threshold_waveThe merging can be performed.