CN111813858B - Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes - Google Patents

Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes Download PDF

Info

Publication number
CN111813858B
CN111813858B CN202010662415.0A CN202010662415A CN111813858B CN 111813858 B CN111813858 B CN 111813858B CN 202010662415 A CN202010662415 A CN 202010662415A CN 111813858 B CN111813858 B CN 111813858B
Authority
CN
China
Prior art keywords
node
similar
iteration
cluster
efficiency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010662415.0A
Other languages
Chinese (zh)
Other versions
CN111813858A (en
Inventor
陈爱国
郑旭
罗光春
田玲
谢渊
邹冰洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010662415.0A priority Critical patent/CN111813858B/en
Publication of CN111813858A publication Critical patent/CN111813858A/en
Application granted granted Critical
Publication of CN111813858B publication Critical patent/CN111813858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/275Synchronous replication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to a distributed neural network technology, discloses a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes, and solves the problem that the accuracy and the training efficiency of a model cannot be well balanced by a traditional distributed neural network synchronous algorithm. The method comprises the following steps: s1, predicting the synchronization efficiency of the calculation node by adopting Kalman filtering, and analyzing the fluctuation condition of the synchronization efficiency of the calculation node by adopting mean square error; s2, carrying out self-organizing real-time grouping on the computing nodes based on the predicted synchronization efficiency of each computing node and the similarity of the fluctuation condition of the synchronization efficiency; and S3, training models by adopting different synchronization strategies between groups and in groups according to the self-organizing real-time grouping result of the computing nodes.

Description

Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes
Technical Field
The invention relates to a distributed neural network technology, in particular to a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes.
Background
When a single machine system utilizes super-large scale data to train a super-large scale neural network, the problem of low training efficiency can occur, and serious conditions even directly cause the failure of the neural network training process. The distributed neural network well solves the problems of low training efficiency and training failure of super-large scale data in a single machine system.
The distributed neural network is divided into data parallelization and model parallelization according to two conditions of dividing training data or dividing model data, wherein the data parallelization is a key technology for improving the efficiency of large-scale data training.
In the distributed neural network with data parallelization, firstly, training data needs to be split into a plurality of computing nodes, single-machine optimization is carried out on the training data at each computing node, then, after each round of single-machine optimization process, gradient parameters of the computing nodes are sent to a parameter server for parameter fusion, model data are updated, then, the model data are redistributed to the computing nodes for the next round of iteration, and the system architecture is shown in fig. 1. In a distributed environment, the computing power of the computing nodes and the bandwidth efficiency from the computing nodes to the parameter server are different, so the pace of synchronizing the parameter data to the parameter server by the computing nodes is different.
The traditional distributed neural network synchronization algorithm is divided into three types, namely a synchronous gradient descent algorithm (SSGD algorithm), an asynchronous synchronous gradient descent algorithm (ASGD algorithm) and a mixed synchronous gradient descent algorithm. The SSGD algorithm needs to wait for all the calculation nodes to send the gradient parameters to the parameter server before parameter fusion can be carried out, so that the parameter server waits for the slow nodes; the ASGD algorithm does not need to wait for parameter data of all the computing nodes, and once the parameter server receives one parameter data, the parameters can be fused, but the method can reduce the accuracy of the model; the hybrid synchronous gradient descent algorithm allows the parameter server to run asynchronously until the difference value of the iteration rounds between any two computing nodes exceeds a threshold value, so that the relation between the training efficiency and the model accuracy is well balanced, but the distributed system still cannot obtain the highest efficiency due to the use of a fixed threshold value.
Therefore, the traditional distributed neural network synchronization algorithm cannot well balance the model accuracy and the training efficiency.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes is provided, and the problem that the accuracy rate and the training efficiency of a model cannot be well balanced by a traditional distributed neural network synchronous algorithm is solved.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes comprises the following steps:
s1, predicting the synchronization efficiency of the calculation node by adopting Kalman filtering, and analyzing the fluctuation condition of the synchronization efficiency of the calculation node by adopting mean square error;
s2, carrying out self-organizing real-time grouping on the computing nodes based on the predicted synchronization efficiency of each computing node and the similarity of the fluctuation condition of the synchronization efficiency;
and S3, training models by adopting different synchronization strategies between groups and in groups according to the self-organizing real-time grouping result of the computing nodes.
As a further optimization, step S1 specifically includes:
s11, for any calculation node i, the parameter server collects the time of each round of recent iteration to form a time window set T with a fixed sizeiTime window set TiEach element records the time taken for the parameter server to complete the reception of the latest gradient parameters of the node from the transmission of the latest model data to the node in one iteration;
s12, according to the efficiency of the last iteration of the node and the Kalman filtering condition, carrying out efficiency evaluation prediction on the next iteration to obtain the predicted next synchronization efficiency of the node
Figure BDA0002579102400000022
S13 solving time window set TiThe mean square error of the medium element is used for analyzing the recent synchronous efficiency fluctuation condition of the computing node and is recorded as waveiThe two variables are used as the output of the evaluation and prediction of the computing node.
As a further optimization, step S2 specifically includes:
s21, forming a similar cluster by a plurality of similar computing nodes, and judging whether the node needs to be moved out of the similar cluster according to the node similarity condition when each iteration of any node starts;
s22, removing the nodes needing to be moved out of the similar cluster, and creating a new similar cluster for the nodes;
and S23, recombining and merging the similar clusters meeting the merging condition, and updating the internal information of the similar clusters.
As a further optimization, in step S21, the node similarity condition includes:
for the same similar cluster groupsimilarAny two computing nodes in (2) need to satisfy:
Figure BDA0002579102400000021
among them, ThresholdTThreshold, representing a similarity in node efficiencywaveA threshold value indicating that recent efficiency fluctuations are similar;
Figure BDA0002579102400000031
indicating the predicted synchronization efficiency of node x in a similar cluster,
Figure BDA0002579102400000032
representing the efficiency of predictive synchronization, wave, of node y in a similar clusterxRepresenting recent synchronization efficiency fluctuation value, wave, of node x in a similar clusterxRepresenting recent synchronization efficiency fluctuation values of nodes y in similar clusters.
As a further optimization, in step S23, the merging condition includes:
for any two similar clusters, as long as any two nodes u and v in the two similar clusters satisfy
Figure BDA0002579102400000033
And | waveu-wavev|≤ThresholdwaveThe merging can be performed.
As a further optimization, step S3 specifically includes:
s31, when one round of iteration of any computing node is finished, computing the maximum iteration round epoch in the similar clustermax
Figure BDA0002579102400000034
Wherein wave represents the average value of all node fluctuations in the similar cluster, and nodeNumber represents the number of nodes in the similar cluster;
s32, judging whether the node can enter global waiting or cluster waiting, if so, entering a step S33, otherwise, entering a step S34;
s33, if the node enters global waiting, the node can continue to operate after all nodes have the same iteration turns; if the node enters a cluster waiting state, the node can continue to operate after all nodes in the same similar cluster have the same iteration turns;
s34, if the node can not enter the global waiting or the cluster waiting, the node carries out the single machine asynchronous optimization, enters the next round of iteration and returns to the step S31.
As a further optimization, in step S32, the determining whether the node will enter global waiting or cluster waiting specifically includes:
(1) judging whether the absolute iteration turn difference value between the node and all nodes in the cluster is greater than a preset threshold value, and entering global waiting if the absolute iteration turn difference value is greater than the preset threshold value;
(2) if the difference between the relative iteration round of the node and the minimum relative round in the similar cluster is larger than the epochmaxEntering a similar cluster for waiting;
the absolute iteration turn refers to the total number of times that the node has iterated currently, and the relative iteration turn refers to the iteration turn of the node in the similar cluster in which the node is located.
The invention has the beneficial effects that:
the prediction of the node synchronization efficiency and the analysis of the fluctuation condition of the synchronization efficiency are used as the basis of node self-organizing grouping, so that the cluster grouping state of all the computing node sets forming the distributed neural network is dynamically adjusted, and different synchronization strategies are adopted between groups and in the groups to train models according to the real-time grouping result. Because the grouping is carried out by adopting the approximate condition considering the synchronous efficiency of the nodes, the efficiency of each node in the group is relatively close, and when the model training is carried out, the limit of the maximum allowable iteration round difference in the group is used as a node synchronous barrier in the cluster, so that the iterative waiting time can be reduced, and the training efficiency is improved; in addition, the accuracy of the trained model is ensured by using the limitation of the maximum iteration round difference between any nodes as a global synchronization barrier. Therefore, the scheme of the invention can well balance the model accuracy and the training efficiency.
Drawings
FIG. 1 is a diagram of a distributed neural network system architecture for data parallelization;
FIG. 2 is a schematic diagram of a cluster state after node self-organizing grouping according to the present invention;
fig. 3 is a flowchart of a hybrid synchronous training method for a distributed neural network according to an embodiment of the present invention.
Detailed Description
The invention aims to provide a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes, and solves the problem that the accuracy and the training efficiency of a model cannot be well balanced by a traditional distributed neural network synchronous algorithm.
The method comprises three major parts: calculating node efficiency evaluation and prediction, node self-organizing grouping and mixed synchronous training.
Firstly, evaluating and predicting the efficiency of a computing node:
for each computing node i, the parameter server collects the time of each recent iteration, and a time window set with a fixed size is formed and is recorded as Ti,TiComposed of fixed-size queues, each queue element being denoted as
Figure BDA0002579102400000041
m is the number of elements in the queue,
Figure BDA0002579102400000042
the time taken by the parameter server to receive the latest gradient parameters for the node from the time the latest model data is sent to the node in one iteration is recorded. Therefore, the temperature of the molten metal is controlled,
Figure BDA0002579102400000043
namely, in one iteration, the synchronization efficiency of the computing node comprises the single-machine optimization efficiency of the computing node and the efficiency of synchronizing data with the parameter server.
The method utilizes Kalman filtering to predict the efficiency of the computing node at the next moment, and obtains the efficiency evaluation of the computing node in the next iteration, and the efficiency evaluation is recorded as
Figure BDA0002579102400000051
At the same time, solving a set of time windows TiThe mean square error of the medium element is analyzed, and the efficiency fluctuation condition of the computing node in the near term is recorded as waveiThe two variables are used as the output of the evaluation and prediction of the computing node.
Secondly, node self-organizing grouping:
the main purpose of the ad hoc grouping of nodes is to group "similar" nodes into one group and "dissimilar" nodes into another group. Similar definition is that the node efficiency predicted in the next round is similar and the recent fluctuation is similar, and we mark the cluster formed by a plurality of "similar" nodes as a similar cluster groupsimilarA plurality of similar clusters form all the computing node sets S of the distributed neural network, as shown in fig. 2; for the same similar cluster groupsimilarEach compute node in (1) needs to satisfy a similar condition:
Figure BDA0002579102400000052
among them, ThresholdTThreshold, which represents a Threshold of similarity in node efficiencywaveRepresenting thresholds at which recent efficiency fluctuates similarly. The node self-organizing grouping algorithm is to dynamically adjust the set S when each iteration of the computing nodes is completed, so as to ensure all groupssimilarSimilar conditions are satisfied.
Specifically, the node ad-hoc grouping process is mainly divided into the following steps:
1) calling the algorithm when any node i enters a new iteration;
2) if the node i cannot meet the similar conditions in the original similar cluster, moving the node i out of the original similar cluster and updating the information of the set S;
3) traversing all groups in the set SsimilarElements, regrouping the elements;
in summary, self-organizing grouping is a process of splitting and recombining an existing set structure when any node enters a new iteration, and the process finally outputs a new set of all nodesCalculating a node set S to ensure any one cluster group thereinsimilarSimilar conditions are satisfied.
Thirdly, mixed synchronous training:
the hybrid synchronous training is to use synchronous algorithms with different strategies between groups and in the self-organized grouping, improve the accuracy of the model as much as possible, and improve the training efficiency of the model under the condition of ensuring the accuracy.
The invention designs two synchronous barriers to limit the communication pace of training so as to balance the model accuracy and the training efficiency.
Global synchronization barrier: in order to ensure the accuracy of the model, the maximum iteration delay between any two nodes needs to be determined, that is, the maximum iteration difference Threshold needs to be ensured between any two nodesmaxFor any similar cluster
Figure BDA0002579102400000061
Defining the maximum iteration difference allowed in the cluster as ThresholdiThen any two similar clusters
Figure BDA0002579102400000062
And
Figure BDA0002579102400000063
the sum of the maximum allowable iteration difference values between can not be greater than ThresholdmaxNamely:
Figure BDA0002579102400000064
similar cluster synchronization barriers: according to the fluctuation value of each similar cluster node, calculating the maximum iteration difference Threshold in the clusteriAllowing maximum Threshold among nodes in similar clusteriShould be less than or equal to Threshold, i.e. the iteration difference between the slowest node and the fastest nodeiOnce the iteration difference is greater than ThresholdiThen all nodes within the cluster must wait until the slowest node and the fastest node iteration wheelThe same time.
Example (b):
the present embodiment specifically describes the present invention with a preferred implementation algorithm.
The parameters of the algorithm comprise the length of a time statistical window _ length and a node efficiency similarity Threshold valueTNode fluctuation similarity ThresholdwaveThreshold value Threshold of iteration round difference of any two nodesmax. Referring to fig. 3, the distributed neural network hybrid synchronous training method in the present embodiment includes the following steps:
step 1, calculating node efficiency evaluation and prediction:
the main purpose of the step is to carry out prediction and evaluation on the efficiency of the computing node, and the method is used as the basis for node self-organizing grouping, and the specific sub-process is as follows:
step 1.1, initialization:
efficiency assessment and prediction data structures are initialized in the parameter server. For any computing node i, the parameter server generates a queue with a length of window _ lengthiAnd the time storage unit is used for storing the time used by each iteration of the computing node for a period of time. Meanwhile, the parameter server generates a Kalman filtering object corresponding to the node, initializes a matrix related to the Kalman filtering object, and comprises a state transition matrix of Kalman filtering
Figure BDA0002579102400000065
System measurement matrix
Figure BDA0002579102400000066
Covariance matrix of process noise with system
Figure BDA0002579102400000067
Step 1.2, detecting a node fluctuation value:
when any computing node i completes one iteration, the parameter server computes the time from sending model data by the parameter server to finishing transmitting gradient parameter data by the node
Figure BDA0002579102400000068
Parameter server updates the iteration time of the computing node
Figure BDA0002579102400000069
Put into the queue and dequeue the first element if the size of the queue exceeds the window length window. Calculating the mathematical expectation of all elements in the queue and calculating the mean square error of all elements, and for a specific calculation node i, the mean square error is recorded as wavei. To be provided with
Figure BDA00025791024000000610
The fluctuation detection returns the following results on behalf of the elements in the queue:
Figure BDA0002579102400000071
wherein the content of the first and second substances,
Figure BDA0002579102400000072
is an element
Figure BDA0002579102400000073
Is determined by the corresponding weighting factor of (a),
Figure BDA0002579102400000074
is an element
Figure BDA0002579102400000075
The mathematical expectation of (2).
Step 1.3, Kalman filtering efficiency prediction:
considering that the efficiency for any node should satisfy the principle of inertia, i.e. the efficiency should satisfy a smooth transition in a short time, without abrupt changes. Therefore, the efficiency can be predicted using kalman filtering. When the iteration round of any node i begins, counting the iteration time T of the previous roundiAs the last time of measurement
Figure BDA0002579102400000076
Prediction of the last iteration as prediction of the last measurement
Figure BDA0002579102400000077
When the iteration round is finished, the time is measured by using the iteration round
Figure BDA0002579102400000078
Correcting the Kalman filter and predicting the time required by the next iteration through the Kalman filter
Figure BDA0002579102400000079
Step 1.4: will be provided with
Figure BDA00025791024000000710
And waveiPassed down as a return value, updating the node packet condition.
Step 2, node self-organizing grouping step:
the main purpose of this step is to combine a plurality of homogenous nodes into a whole, use a kind of synchronization strategy in the whole, use another kind of synchronization strategy outside the whole, its concrete subprocess is as follows:
step 2.1, initialization:
when the system is initialized, dictionary groups (key _ id and value _ list) and map (key _ node and value _ group _ id) are created, keys of the groups represent ids of one similar cluster, and values represent all computing node information contained in the similar cluster. The node structure stores the absolute iteration round epoch of the nodeabsRelative iteration round epochrelaAnd the fluctuation value wave and the efficiency prediction value T, wherein the relative iteration turn refers to the iteration number of the node in the similar cluster where the node is located. map is the backward key of groups, and is used for quickly locating which node the node is located at. At the beginning of the first round, the parameter server treats each compute node as an independent similar cluster, and therefore, a groups dictionary containing the number elements of the compute node is created, and the value of each dictionary contains a compute node information node structure.
Step 2.2: splitting a similar cluster:
similar cluster splitting is attempted before any compute node starts a new iteration. The specific process is as follows:
1) updating node information, wherein the step mainly updates the information contained in the node structure and uses the information returned by the step of evaluating and predicting the efficiency of the computing node
Figure BDA00025791024000000711
waveiAnd updating the efficiency predicted value T and the fluctuation value wave of the node.
2) Judging whether the splitting is needed or not, and judging whether the splitting is needed or not, wherein the splitting is needed for the similar cluster group of the nodesimilarTraversing all node structures to find the wave minimum waveminWith maximum wavemaxMinimum value of efficiency TminAnd a maximum value TmaxJudging the following formula:
Figure BDA00025791024000000712
if any one of the two conditions cannot be met, the node cannot meet the similar condition and needs to be split. If not, go directly to step 2.3.
3) And (4) splitting the computing node, namely independently generating a similar cluster from the computing node, and updating the information of the nodes in groups and map and the id information of the corresponding node in the map.
Step 2.3, the computing nodes are recombined:
that is, existing similar clusters are merged, so that the number of similar clusters is reduced as much as possible.
For any two similar clusters, as long as any two node nodes in the two similar clusters are satisfieduAnd a nodevSatisfy | Tu-Tv|≤ThresholdTAnd | waveu-wavev|≤ThresholdwaveThen merging can be performed, and the specific merging algorithm is as follows:
1) and sorting all similar cluster data efficiency T minimum values from small to large by using a principle that the minimum value of the fluctuation value wave is a main key and the minimum value of the fluctuation value wave is an auxiliary key. The start pointer and compare pointer are moved to the first similar cluster.
2) Moving the compare pointer to compare the T pointed by the start pointerminAnd waveminAnd T pointed to by the compare pointermaxAnd wavemaxWhether or not the difference values of (a) are all less than or equal to the threshold value set therefor.
3) If any condition is not met, all clusters from the start pointer to the front of the comparison pointer are merged to form a new similar cluster. Moving the start pointer to the position of the comparison pointer, and jumping to 2) to run until all similar clusters are compared. When the similar clusters are combined, the relative iteration round of each node structure in the clusters needs to be updated, and the updating principle is that the minimum relative iteration round value is subtracted from the relative iteration round of the two similar clusters respectively.
And 2.4, transmitting the dictionary structure groups and the map serving as return values downwards.
For greater clarity, the self-organizing grouping procedure for nodes is exemplified by the following example: setting a cluster of three nodes, their ThresholdTAnd Thresholdwave0.5, 0.5, respectively, initially,
node1:{wave=1,T=2.2,epochabs=1,epochrela=1},
node2:{wave=3,T=4.1,epochabs=1,epochrela=1},
node3:{wave=4.6,T=8.3,epochabs=1,epochrela=1
groups={1:node1;2:node2;3:node3}。
when node2After one round of iteration is completed, the data is updated to be node2:{wave=1.1,T=2.4,epochabs=2,epochrela2, then the nodes are merged1And a node2,groups={1:{node1,node2};3:node3And updating all node relative iteration rounds with id 1 in groups, and finally obtaining the node1And a node2Has the structure of a node1:{wave=1,T=2.2,epochabs=1,epochrela=1},node2:{wave=1.1,T=2.4,epochabs=2,epochrela=1}
Step 3, mixed synchronous training:
the main purpose of the step is to train the model according to different synchronous strategies by utilizing the self-organizing grouping result, and the specific sub-processes are as follows:
step 3.1, when one round of iteration of any computing node is finished, calculating the maximum iteration round inside the similar cluster according to the following formula:
Figure BDA0002579102400000091
wherein wave represents the average value of all node fluctuations in the similar cluster, and nodeNumber represents the number of nodes in the similar cluster.
Step 3.2, judging whether the next step needs to enter a waiting process according to the cluster state, specifically as follows:
1) judging whether the absolute iteration round difference value between the node and all the nodes in the cluster is larger than a Threshold value Threshold or notmaxIf so, the cluster enters global wait.
2) And judging whether waiting is needed or not according to the relative iteration turns of the node and all nodes in the same similar cluster in the groups. If the difference between the relative iteration round of the node and the minimum relative round in the similar cluster is larger than the epochmaxThen go to similar cluster waiting.
3) Otherwise, the distributed training does not enter a waiting process.
Step 3.3: the waiting process is specifically as follows:
1) if no waiting process is entered in step 3.2, the distributed neural network continues to the next iteration in the manner of step 3.1.
2) If the global waiting process is entered in step 3.2, all the computing nodes must iterate until the absolute round is the same as the node, the cluster can enter the next round of iteration, and when the global waiting process exits, the relative iteration round of all the nodes is updated to be 0.
If the waiting process of the similar cluster is entered in step 3.2, the node must wait for all nodes in the similar cluster where the node is located to perform the next iteration in the same relative round as the node, and when the waiting process of the similar cluster exits, the relative iteration round of all nodes of the similar cluster is updated.

Claims (5)

1. A distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes is characterized in that,
the method comprises the following steps:
s1, predicting the synchronization efficiency of the calculation node by adopting Kalman filtering, and analyzing the fluctuation condition of the synchronization efficiency of the calculation node by adopting mean square error;
s2, carrying out self-organizing real-time grouping on the computing nodes based on the predicted synchronization efficiency of each computing node and the similarity degree of the fluctuation condition of the synchronization efficiency;
s3, according to the self-organizing real-time grouping result of the computing nodes, different synchronization strategies are adopted between groups and in the groups to train the models;
step S3 specifically includes:
s31, when one round of iteration of any computing node is finished, computing the maximum iteration round epoch in the similar clustermax
Figure FDA0003629833440000011
Wherein, wave represents the average value of all node fluctuation in the similar cluster, and nodeNumber represents the number of nodes of the similar cluster;
s32, judging whether the node can enter global waiting or cluster waiting, if so, entering a step S33, otherwise, entering a step S34;
s33, if the node enters global waiting, the node can continue to operate after all nodes have the same iteration turns; if the node enters cluster waiting, the node can continue to operate after all nodes in the same similar cluster have the same iteration round;
s34, if the node does not enter global waiting or cluster waiting, the node carries out single-machine asynchronous optimization, enters the next iteration and returns to the step S31;
in step S32, the determining whether the node will enter global waiting or cluster waiting specifically includes:
(1) judging whether the absolute iteration turn difference value between the node and all nodes in the cluster is greater than a preset threshold value, and entering global waiting if the absolute iteration turn difference value is greater than the preset threshold value;
(2) if the difference between the relative iteration round of the node and the minimum relative round in the similar cluster is larger than the epochmaxEntering a similar cluster for waiting;
the absolute iteration turn refers to the total number of times that the node has iterated currently, and the relative iteration turn refers to the iteration turn of the node in the similar cluster in which the node is located.
2. The distributed neural network hybrid synchronous training method based on the self-organizing grouping of the compute nodes as claimed in claim 1, wherein the step S1 specifically includes:
s11, for any calculation node i, the parameter server collects the time of each round of recent iteration to form a time window set T with a fixed sizeiTime window set TiEach element in (2) records the time taken by the parameter server from sending the latest model data to the node to receiving the latest gradient parameters of the node in one iteration;
s12, according to the efficiency of the last iteration of the node and the Kalman filtering condition, carrying out efficiency evaluation prediction on the next iteration to obtain the predicted next synchronization efficiency T of the nodei predict
S13 solving time window set TiThe mean square error of the medium element is used for analyzing the recent synchronous efficiency fluctuation condition of the computing node and is recorded as waveiThe two variables are used as calculationAnd (4) node evaluation and prediction output.
3. The distributed neural network hybrid synchronous training method based on the self-organizing grouping of the compute nodes as claimed in claim 1, wherein the step S2 specifically includes:
s21, forming a similar cluster by a plurality of similar computing nodes, and judging whether the node needs to be moved out of the similar cluster according to the node similarity condition when each iteration of any node starts;
s22, removing the nodes needing to be moved out of the similar cluster, and creating a new similar cluster for the nodes;
and S23, recombining and merging the similar clusters meeting the merging condition, and updating the internal information of the similar clusters.
4. The distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes as claimed in claim 3, wherein in step S21, the node similarity condition includes:
for the same similar cluster groupsimilarAny two computing nodes in (2) need to satisfy:
Figure FDA0003629833440000021
among them, ThresholdTThreshold, representing a similarity in node efficiencywaveA threshold value representing a similarity of recent efficiency fluctuations;
Figure FDA0003629833440000022
indicating the predicted synchronization efficiency of node x in a similar cluster,
Figure FDA0003629833440000023
representing the efficiency of predictive synchronization, wave, of node y in a similar clusterxRepresenting recent synchronization efficiency fluctuation value, wave, of node x in a similar clusterxRepresenting the recency of node y in a similar clusterAnd synchronizing the efficiency fluctuation value.
5. The distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes as claimed in claim 3, wherein in step S23, the merging condition comprises:
for any two similar clusters, as long as any two nodes u and v inside the two similar clusters satisfy
Figure FDA0003629833440000024
Figure FDA0003629833440000025
And | waveu-wavev|≤ThresholdwaveThe merging can be performed.
CN202010662415.0A 2020-07-10 2020-07-10 Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes Active CN111813858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010662415.0A CN111813858B (en) 2020-07-10 2020-07-10 Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010662415.0A CN111813858B (en) 2020-07-10 2020-07-10 Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes

Publications (2)

Publication Number Publication Date
CN111813858A CN111813858A (en) 2020-10-23
CN111813858B true CN111813858B (en) 2022-06-24

Family

ID=72841719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010662415.0A Active CN111813858B (en) 2020-07-10 2020-07-10 Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes

Country Status (1)

Country Link
CN (1) CN111813858B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633480B (en) * 2020-12-31 2024-01-23 中山大学 Calculation optimization method and system of semi-asynchronous parallel neural network
CN113035349B (en) * 2021-03-25 2024-01-05 浙江大学 Neural network dynamic fusion method for multi-center screening of genetic metabolic diseases
CN115865607A (en) * 2023-03-01 2023-03-28 山东海量信息技术研究院 Distributed training computing node management method and related device
CN117155928B (en) * 2023-10-31 2024-02-09 浪潮电子信息产业股份有限公司 Communication task processing method, system, equipment, cluster and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914735A (en) * 2014-04-17 2014-07-09 北京泰乐德信息技术有限公司 Failure recognition method and system based on neural network self-learning
CN106570563A (en) * 2015-10-13 2017-04-19 中国石油天然气股份有限公司 Deformation prediction method and apparatus based on Kalman filtering and BP neural network
CN108366386A (en) * 2018-05-11 2018-08-03 东南大学 A method of using neural fusion wireless network fault detect
CN110135575A (en) * 2017-12-29 2019-08-16 英特尔公司 Communication optimization for distributed machines study

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6892194B2 (en) * 2001-06-05 2005-05-10 Basf Corporation System and method for organizing color values using an artificial intelligence based cluster model
US20190273510A1 (en) * 2018-03-01 2019-09-05 Crowdstrike, Inc. Classification of source data by neural network processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914735A (en) * 2014-04-17 2014-07-09 北京泰乐德信息技术有限公司 Failure recognition method and system based on neural network self-learning
CN106570563A (en) * 2015-10-13 2017-04-19 中国石油天然气股份有限公司 Deformation prediction method and apparatus based on Kalman filtering and BP neural network
CN110135575A (en) * 2017-12-29 2019-08-16 英特尔公司 Communication optimization for distributed machines study
CN108366386A (en) * 2018-05-11 2018-08-03 东南大学 A method of using neural fusion wireless network fault detect

Non-Patent Citations (11)

* Cited by examiner, † Cited by third party
Title
Convergence Study in Extended Kalman Filter-Based Training of Recurrent Neural Networks;Xiaoyu Wang 等;《Neural Networks》;20110410;第22卷(第4期);第588-600页 *
The ART of adaptive pattern recognition by a self-organizing neural network;G.A. Carpenter 等;《Computer》;19880331;第21卷(第3期);第77 - 88页 *
Unsupervised Learning with Self-Organizing Spiking Neural Networks;Hananel Hazan 等;《Neural Networks》;20181015;第1-6页 *
不确定混沌系统的多项式函数模型补偿控制;曾喆昭 等;《物理学报》;20130704;第62卷(第15期);第74-81页 *
分布式环境轨迹预测算法研究与实现;谢渊;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20200715;I140-87 *
基于Kalman滤波的神经网络学习算法及其应用;田晓宇 等;《计算机与数字工程》;20051231(第2期);第40-42,104页 *
基于容积卡尔曼滤波的神经网络训练算法;胡振涛 等;《控制与决策》;20151019;第31卷(第2期);第355-360页 *
基于扩展卡尔曼粒子滤波算法的神经网络训练;王法胜 等;《算机工程与科学》;20100515;第32卷(第5期);第48-50页 *
应用自组织特征映射神经网络技术实现分布式入侵检测;杨森 等;《计算机应用》;20030828;第54-57页 *
自组织特征映射神经网络原理和应用研究;李春华 等;《北京师范大学学报(自然科学版)》;20061030;第42卷(第5期);第543-547页 *
面向大数据分布式存储的动态负载均衡算法;张栗粽 等;《计算机科学》;20170515;第44卷;第178-183页 *

Also Published As

Publication number Publication date
CN111813858A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
CN111813858B (en) Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes
CN110223517B (en) Short-term traffic flow prediction method based on space-time correlation
CN110610242B (en) Method and device for setting weights of participants in federal learning
CN108335487B (en) Road traffic state prediction system based on traffic state time sequence
CN110995487B (en) Multi-service quality prediction method and device, computer equipment and readable storage medium
CN106933649B (en) Virtual machine load prediction method and system based on moving average and neural network
CN105471631B (en) Network flow prediction method based on traffic trends
CN113852432B (en) Spectrum Prediction Sensing Method Based on RCS-GRU Model
CN108650065B (en) Window-based streaming data missing processing method
CN113778691B (en) Task migration decision method, device and system
CN109067583A (en) A kind of resource prediction method and system based on edge calculations
CN114585006B (en) Edge computing task unloading and resource allocation method based on deep learning
CN113760511B (en) Vehicle edge calculation task unloading method based on depth certainty strategy
CN104539601A (en) Reliability analysis method and system for dynamic network attack process
CN113887748B (en) Online federal learning task allocation method and device, and federal learning method and system
Zhao et al. Adaptive swarm intelligent offloading based on digital twin-assisted prediction in VEC
CN102130955B (en) System and method for generating alternative service set of composite service based on collaborative filtering
CN113382066A (en) Vehicle user selection method and system based on federal edge platform
Yang et al. Energy scheduling for DoS attack over multi-hop networks: Deep reinforcement learning approach
CN115691140B (en) Analysis and prediction method for space-time distribution of automobile charging demand
Cui et al. The learning stimulated sensing-transmission coordination via age of updates in distributed UAV swarm
CN115865914A (en) Task unloading method based on federal deep reinforcement learning in vehicle edge calculation
CN114693141A (en) Transformer substation inspection method based on end edge cooperation
CN108599834B (en) Method and system for analyzing utilization rate of satellite communication network link
CN116957166B (en) Tunnel traffic condition prediction method and system based on Hongmon system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant