CN111813858A - Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes - Google Patents

Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes Download PDF

Info

Publication number
CN111813858A
CN111813858A CN202010662415.0A CN202010662415A CN111813858A CN 111813858 A CN111813858 A CN 111813858A CN 202010662415 A CN202010662415 A CN 202010662415A CN 111813858 A CN111813858 A CN 111813858A
Authority
CN
China
Prior art keywords
node
similar
iteration
nodes
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010662415.0A
Other languages
Chinese (zh)
Other versions
CN111813858B (en
Inventor
陈爱国
郑旭
罗光春
田玲
谢渊
邹冰洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010662415.0A priority Critical patent/CN111813858B/en
Publication of CN111813858A publication Critical patent/CN111813858A/en
Application granted granted Critical
Publication of CN111813858B publication Critical patent/CN111813858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/275Synchronous replication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to a distributed neural network technology, discloses a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes, and solves the problem that the accuracy and the training efficiency of a model cannot be well balanced by a traditional distributed neural network synchronous algorithm. The method comprises the following steps: s1, predicting the synchronization efficiency of the calculation node by adopting Kalman filtering, and analyzing the fluctuation condition of the synchronization efficiency of the calculation node by adopting mean square error; s2, carrying out self-organizing real-time grouping on the computing nodes based on the predicted synchronization efficiency of each computing node and the similarity of the fluctuation condition of the synchronization efficiency; and S3, training models by adopting different synchronization strategies between groups and in groups according to the self-organizing real-time grouping result of the computing nodes.

Description

Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes
Technical Field
The invention relates to a distributed neural network technology, in particular to a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes.
Background
When a single machine system utilizes super-large scale data to train a super-large scale neural network, the problem of low training efficiency can occur, and even the failure of the neural network training process is directly caused in severe conditions. The distributed neural network well solves the problems of low training efficiency and training failure of super-large scale data in a single machine system.
The distributed neural network is divided into data parallelization and model parallelization according to two conditions of dividing training data or dividing model data, wherein the data parallelization is a key technology for improving the efficiency of large-scale data training.
In the distributed neural network with data parallelization, firstly, training data needs to be split into a plurality of computing nodes, single-machine optimization is carried out on the training data at each computing node, then, after each round of single-machine optimization process, gradient parameters of the computing nodes are sent to a parameter server for parameter fusion, model data are updated, then, the model data are redistributed to the computing nodes for the next round of iteration, and the system architecture is shown in fig. 1. In a distributed environment, the computing power of the computing nodes and the bandwidth efficiency from the computing nodes to the parameter server are different, so the pace of synchronizing the parameter data to the parameter server by the computing nodes is different.
The traditional distributed neural network synchronization algorithm is divided into three types, namely a synchronous gradient descent algorithm (SSGD algorithm), an asynchronous synchronous gradient descent algorithm (ASGD algorithm) and a mixed synchronous gradient descent algorithm. The SSGD algorithm needs to wait for all the calculation nodes to send the gradient parameters to the parameter server before parameter fusion can be carried out, so that the parameter server waits for the slow nodes; the ASGD algorithm does not need to wait for parameter data of all the computing nodes, and once the parameter server receives one parameter data, the parameters can be fused, but the method can reduce the accuracy of the model; the hybrid synchronous gradient descent algorithm allows the parameter server to run asynchronously until the difference value of the iteration rounds between any two computing nodes exceeds a threshold value, so that the relation between the training efficiency and the model accuracy is well balanced, but the distributed system still cannot obtain the highest efficiency due to the use of a fixed threshold value.
Therefore, the traditional distributed neural network synchronization algorithm cannot well balance the model accuracy and the training efficiency.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes is provided, and the problem that the accuracy rate and the training efficiency of a model cannot be well balanced by a traditional distributed neural network synchronous algorithm is solved.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes comprises the following steps:
s1, predicting the synchronization efficiency of the calculation node by adopting Kalman filtering, and analyzing the fluctuation condition of the synchronization efficiency of the calculation node by adopting mean square error;
s2, carrying out self-organizing real-time grouping on the computing nodes based on the predicted synchronization efficiency of each computing node and the similarity of the fluctuation condition of the synchronization efficiency;
and S3, training models by adopting different synchronization strategies between groups and in groups according to the self-organizing real-time grouping result of the computing nodes.
As a further optimization, step S1 specifically includes:
s11, for any calculation node i, the parameter server collects the time of each round of recent iteration to form a time window set T with a fixed sizeiTime window set TiEach element records the time taken for the parameter server to complete the reception of the latest gradient parameters of the node from the transmission of the latest model data to the node in one iteration;
s12, according to the effect of the last iteration of the nodeRate and Kalman filtering conditions, and carrying out efficiency evaluation prediction on the next iteration to obtain the predicted next synchronization efficiency of the node
Figure BDA0002579102400000022
S13 solving time window set TiThe mean square error of the medium element is used for analyzing the recent synchronous efficiency fluctuation condition of the computing node and is recorded as waveiThe two variables are used as the output of the evaluation and prediction of the computing node.
As a further optimization, step S2 specifically includes:
s21, forming a similar cluster by a plurality of similar computing nodes, and judging whether the node needs to be moved out of the similar cluster according to the node similarity condition when each iteration of any node starts;
s22, removing the nodes needing to be moved out of the similar cluster, and creating a new similar cluster for the nodes;
and S23, recombining and merging the similar clusters meeting the merging condition, and updating the internal information of the similar clusters.
As a further optimization, in step S21, the node similarity condition includes:
for the same similar cluster groupsimilarAny two computing nodes in (2) need to satisfy:
Figure BDA0002579102400000021
among them, ThresholdTThreshold, representing a similarity in node efficiencywaveA threshold value indicating that recent efficiency fluctuations are similar;
Figure BDA0002579102400000031
indicating the predicted synchronization efficiency of node x in a similar cluster,
Figure BDA0002579102400000032
representing the efficiency of predictive synchronization, wave, of node y in a similar clusterxRepresenting a similarity setRecent synchronization efficiency fluctuation value, wave, of node x in the groupxRepresenting recent synchronization efficiency fluctuation values of nodes y in similar clusters.
As a further optimization, in step S23, the merging condition includes:
for any two similar clusters, as long as any two nodes u and v inside the two similar clusters satisfy
Figure BDA0002579102400000033
And | waveu-wavev|≤ThresholdwaveThe merging can be performed.
As a further optimization, step S3 specifically includes:
s31, when one round of iteration of any computing node is finished, computing the maximum iteration round epoch in the similar clustermax
Figure BDA0002579102400000034
Wherein wave represents the average value of all node fluctuations in the similar cluster, and nodeNumber represents the number of nodes in the similar cluster;
s32, judging whether the node enters global waiting or cluster waiting, if so, entering a step S33, otherwise, entering a step S34;
s33, if the node enters global waiting, the node can continue to operate after all nodes have the same iteration turns; if the node enters a cluster waiting state, the node can continue to operate after all nodes in the same similar cluster have the same iteration turns;
and S34, if the node does not enter global waiting or cluster waiting, the node performs single-machine asynchronous optimization, enters the next iteration and returns to the step S31.
As a further optimization, in step S32, the determining whether the node will enter global waiting or cluster waiting specifically includes:
(1) judging whether the absolute iteration turn difference value between the node and all nodes in the cluster is greater than a preset threshold value, and entering global waiting if the absolute iteration turn difference value is greater than the preset threshold value;
(2) if the difference between the relative iteration round of the node and the minimum relative round in the similar cluster is larger than the epochmaxEntering a similar cluster for waiting;
the absolute iteration turn refers to the total number of times that the node has iterated currently, and the relative iteration turn refers to the iteration turn of the node in the similar cluster in which the node is located.
The invention has the beneficial effects that:
the prediction of the node synchronization efficiency and the analysis of the fluctuation condition of the synchronization efficiency are used as the basis of node self-organizing grouping, so that the cluster grouping state of all the computing node sets forming the distributed neural network is dynamically adjusted, and different synchronization strategies are adopted between groups and in the groups to train models according to the real-time grouping result. Because the grouping is carried out by adopting the approximate condition considering the synchronous efficiency of the nodes, the efficiency of each node in the group is relatively close, and when the model training is carried out, the limit of the maximum allowable iteration round difference in the group is used as a node synchronous barrier in the cluster, so that the iterative waiting time can be reduced, and the training efficiency is improved; in addition, the accuracy of the trained model is ensured by using the limitation of the maximum iteration turn difference value between any nodes as a global synchronization barrier. Therefore, the scheme of the invention can well balance the model accuracy and the training efficiency.
Drawings
FIG. 1 is a diagram of a distributed neural network system architecture for data parallelization;
FIG. 2 is a schematic diagram of a cluster state after node self-organizing grouping according to the present invention;
fig. 3 is a flowchart of a hybrid synchronous training method for a distributed neural network according to an embodiment of the present invention.
Detailed Description
The invention aims to provide a distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes, and solves the problem that the accuracy rate and the training efficiency of a model cannot be well balanced by a traditional distributed neural network synchronous algorithm.
The method comprises three major parts: calculating node efficiency evaluation and prediction, node self-organizing grouping and mixed synchronous training.
Firstly, evaluating and predicting the efficiency of a computing node:
for each computing node i, the parameter server collects the time of each recent iteration, and a time window set with a fixed size is formed and is recorded as Ti,TiComposed of fixed-size queues, each queue element being denoted as
Figure BDA0002579102400000041
m is the number of elements in the queue,
Figure BDA0002579102400000042
the time taken by the parameter server to receive the latest gradient parameters for the node from the time the latest model data is sent to the node in one iteration is recorded. Therefore, the temperature of the molten metal is controlled,
Figure BDA0002579102400000043
namely, in one iteration, the synchronization efficiency of the computing node comprises the single-machine optimization efficiency of the computing node and the efficiency of synchronizing data with the parameter server.
The method utilizes Kalman filtering to predict the efficiency of the calculation node at the next moment, obtains the efficiency evaluation of the calculation node in the next iteration, and records the efficiency evaluation as the efficiency evaluation
Figure BDA0002579102400000051
At the same time, solving a set of time windows TiThe mean square error of the medium element is analyzed, and the efficiency fluctuation condition of the computing node in the near term is recorded as waveiThe two variables are used as the output of the evaluation and prediction of the computing node.
Secondly, node self-organizing grouping:
the main purpose of the ad hoc grouping of nodes is to group "similar" nodes into one group and "dissimilar" nodes into another group. Similar definition is that the node efficiency of the next prediction round is similar and the recent fluctuation is similar, and weMarking the cluster formed by a plurality of similar nodes as a similar cluster groupsimilarA plurality of similar clusters form all the computing node sets S of the distributed neural network, as shown in fig. 2; for the same similar cluster groupsimilarEach compute node in (1) needs to satisfy a similar condition:
Figure BDA0002579102400000052
among them, ThresholdTThreshold, representing a similar node efficiencywaveRepresenting thresholds at which recent efficiency fluctuates similarly. The node self-organizing grouping algorithm is to dynamically adjust the set S when each iteration of the computing nodes is completed, so as to ensure all groupssimilarSimilar conditions are satisfied.
Specifically, the node ad-hoc grouping process is mainly divided into the following steps:
1) calling the algorithm when any node i enters a new iteration;
2) if the node i cannot meet the similar conditions in the original similar cluster, moving the node i out of the original similar cluster and updating the information of the set S;
3) traversing all groups in the set SsimilarElements, regrouping the elements;
in summary, the self-organizing grouping is a process of splitting and recombining an existing set structure when any node enters a new iteration, and the process finally outputs a new set S of all computing nodes to ensure any one cluster group thereinsimilarSimilar conditions are satisfied.
Thirdly, mixed synchronous training:
the hybrid synchronous training is to use synchronous algorithms with different strategies between groups and in the self-organized grouping, improve the accuracy of the model as much as possible, and improve the training efficiency of the model under the condition of ensuring the accuracy.
The invention designs two synchronous barriers to limit the communication pace of training so as to balance the model accuracy and the training efficiency.
Global synchronization barrier: in order to ensure the accuracy of the model, the maximum iteration delay between any two nodes needs to be determined, that is, the maximum iteration difference Threshold needs to be ensured between any two nodesmaxFor any similar cluster
Figure BDA0002579102400000061
Defining the maximum iteration difference allowed in the cluster as ThresholdiThen any two similar clusters
Figure BDA0002579102400000062
And
Figure BDA0002579102400000063
the sum of the maximum allowable iteration difference values between cannot be greater than ThresholdmaxNamely:
Figure BDA0002579102400000064
similar cluster synchronization barriers: according to the fluctuation value of each similar cluster node, calculating the maximum iteration difference Threshold in the clusteriAllowing maximum Threshold among nodes in similar clusteriShould be less than or equal to Threshold, i.e. the iteration difference between the slowest node and the fastest nodeiOnce the iteration difference is greater than ThresholdiThen the nodes inside the cluster must wait until the slowest and fastest iteration rounds are the same.
Example (b):
the present embodiment specifically describes the present invention with a preferred implementation algorithm.
The parameters of the algorithm comprise the length of a time statistical window _ length and a node efficiency similarity Threshold value ThresholdTNode fluctuation similarity ThresholdwaveThreshold value Threshold of iteration round difference of any two nodesmax. Referring to fig. 3, the distributed neural network hybrid synchronous training method in the present embodiment includes the following steps:
step 1, calculating node efficiency evaluation and prediction:
the main purpose of the step is to carry out prediction and evaluation on the efficiency of the computing node, and the method is used as the basis for node self-organizing grouping, and the specific sub-process is as follows:
step 1.1, initialization:
efficiency assessment and prediction data structures are initialized in the parameter server. For any computing node i, the parameter server generates a queue with the length of window _ lengthiAnd the time storage unit is used for storing the time used by each iteration of the computing node for a period of time. Meanwhile, the parameter server generates a Kalman filtering object corresponding to the node, initializes a matrix related to the Kalman filtering object, and comprises a state transition matrix of Kalman filtering
Figure BDA0002579102400000065
System measurement matrix
Figure BDA0002579102400000066
Covariance matrix of process noise with system
Figure BDA0002579102400000067
Step 1.2, detecting a node fluctuation value:
when any computing node i completes one iteration, the parameter server computes the time from sending model data by the parameter server to finishing transmitting gradient parameter data by the node
Figure BDA0002579102400000068
Parameter server updates the iteration time of the computing node
Figure BDA0002579102400000069
Put into a queue and dequeue the first element if the size of the queue exceeds the window length window. Calculating the mathematical expectation of all elements in the queue and calculating the mean square error of all elements, and for a specific calculation node i, the mean square error is recorded as wavei. To be provided with
Figure BDA00025791024000000610
The fluctuation detection returns the following results on behalf of the elements in the queue:
Figure BDA0002579102400000071
wherein the content of the first and second substances,
Figure BDA0002579102400000072
is an element
Figure BDA0002579102400000073
Is determined by the corresponding weighting factor of (a),
Figure BDA0002579102400000074
is an element
Figure BDA0002579102400000075
The mathematical expectation of (2).
Step 1.3, Kalman filtering efficiency prediction:
considering that the efficiency for any node should satisfy the principle of inertia, i.e. the efficiency should satisfy a smooth transition in a short time, without abrupt changes. Therefore, the efficiency can be predicted using kalman filtering. When the iteration round of any node i begins, counting the iteration time T of the previous roundiAs the last time of measurement
Figure BDA0002579102400000076
Prediction of the last iteration as prediction of the last measurement
Figure BDA0002579102400000077
When the iteration round is finished, the time is measured by using the iteration round
Figure BDA0002579102400000078
Correcting the Kalman filter and predicting the time required by the next iteration through the Kalman filter
Figure BDA0002579102400000079
Step 1.4: will be provided with
Figure BDA00025791024000000710
And waveiPassed down as a return value, updating the node packet condition.
Step 2, node self-organizing grouping step:
the main purpose of this step is to combine a plurality of homogenous nodes into a whole, use a kind of synchronization strategy in the whole, use another kind of synchronization strategy outside the whole, its concrete subprocess is as follows:
step 2.1, initialization:
when the system is initialized, dictionary groups (key _ id, value _ list) and map (key _ node, value _ group _ id) are created, keys of the groups represent id of one similar cluster, and values represent all computing node information contained in the similar cluster. The node structure stores the absolute iteration round epoch of the nodeabsRelative iteration round epochrelaAnd the fluctuation value wave and the efficiency prediction value T, wherein the relative iteration turn refers to the iteration number of the node in the similar cluster where the node is located. map is the backward key of groups, and is used for quickly locating which node the node is located at. At the beginning of the first round, the parameter server treats each compute node as an independent similar cluster, and therefore, a groups dictionary containing compute node number elements is created, and the value of each dictionary contains a compute node information node structure.
Step 2.2: splitting a similar cluster:
similar cluster splitting is attempted before any compute node begins a new iteration. The specific process is as follows:
1) updating node information, wherein the step mainly updates the information contained in the node structure and uses the information returned by the step of evaluating and predicting the efficiency of the computing node
Figure BDA00025791024000000711
waveiAnd updating the efficiency predicted value T and the fluctuation value wave of the node.
2) Determine whether disassembly is requiredThe similar cluster group of the node issimilarTraversing all node structures to find the wave minimum waveminWith maximum wavemaxMinimum value of efficiency TminAnd a maximum value TmaxJudging the following formula:
Figure BDA00025791024000000712
if any one of the two conditions cannot be met, the node cannot meet the similar condition and needs to be split. If not, go directly to step 2.3.
3) And (4) splitting the computing node, namely independently generating a similar cluster from the computing node, and updating the information of the nodes in groups and map and the id information of the corresponding node in the map.
Step 2.3, the computing nodes are recombined:
that is, existing similar clusters are merged, so that the number of similar clusters is reduced as much as possible.
For any two similar clusters, as long as any two node nodes in the two similar clusters are satisfieduAnd a nodevSatisfy | Tu-Tv|≤ThresholdTAnd | waveu-wavev|≤ThresholdwaveThen merging can be performed, and the specific merging algorithm is as follows:
1) and sorting all similar cluster data efficiency T minimum values from small to large by using a principle that the minimum value of the fluctuation value wave is a main key and the minimum value of the fluctuation value wave is an auxiliary key. The start pointer and compare pointer are moved to the first similar cluster.
2) Moving the compare pointer to compare the T pointed by the start pointerminAnd waveminAnd T pointed to by the compare pointermaxAnd wavemaxWhether or not the difference values of (a) are all less than or equal to the threshold value set therefor.
3) If any condition is not met, all clusters from the start pointer to the front of the comparison pointer are merged to form a new similar cluster. Moving the start pointer to the position of the comparison pointer, and jumping to 2) to run until all similar clusters are compared. When the similar clusters are combined, the relative iteration round of each node structure in the clusters needs to be updated, and the updating principle is that the minimum relative iteration round value is subtracted from the relative iteration round of the two similar clusters respectively.
And 2.4, transmitting the dictionary structure groups and the map serving as return values downwards.
For greater clarity, the self-organizing grouping procedure for nodes is exemplified by the following example: setting a cluster of three nodes, their ThresholdTAnd Thresholdwave0.5, 0.5, respectively, initially,
node1:{wave=1,T=2.2,epochabs=1,epochrela=1},
node2:{wave=3,T=4.1,epochabs=1,epochrela=1},
node3:{wave=4.6,T=8.3,epochabs=1,epochrela=1
groups={1:node1;2:node2;3:node3}。
when node2After one round of iteration is completed, the data is updated to be node2:{wave=1.1,T=2.4,epochabs=2,epochrela2, then the nodes are merged1And a node2,groups={1:{node1,node2};3:node3And updating all node relative iteration rounds with id 1 in groups, and finally obtaining the node1And a node2Has the structure of a node1:{wave=1,T=2.2,epochabs=1,epochrela=1},node2:{wave=1.1,T=2.4,epochabs=2,epochrela=1}
Step 3, mixed synchronous training:
the main purpose of the step is to train the model according to different synchronous strategies by utilizing the self-organizing grouping result, and the specific sub-processes are as follows:
step 3.1, after one round of iteration of any computing node is finished, computing the maximum iteration round inside the similar cluster according to the following formula:
Figure BDA0002579102400000091
wherein wave represents the average value of all node fluctuations in the similar cluster, and nodeNumber represents the number of nodes in the similar cluster.
Step 3.2, judging whether the next step needs to enter a waiting process according to the cluster state, specifically as follows:
1) judging whether the absolute iteration round difference value between the node and all the nodes in the cluster is larger than a Threshold value Threshold or notmaxIf so, the cluster enters global wait.
2) And judging whether waiting is needed or not according to the relative iteration turns of the node and all nodes in the same similar cluster in the groups. If the difference between the relative iteration round of the node and the minimum relative round in the similar cluster is larger than the epochmaxThen go to similar cluster waiting.
3) Otherwise, the distributed training does not enter a waiting process.
Step 3.3: the waiting process is specifically as follows:
1) if no waiting process is entered in step 3.2, the distributed neural network continues to the next iteration in the manner of step 3.1.
2) If the global waiting process is entered in step 3.2, all the computing nodes must iterate until the absolute round is the same as the node, the cluster can enter the next round of iteration, and when the global waiting process exits, the relative iteration round of all the nodes is updated to be 0.
If the waiting process of the similar cluster is entered in step 3.2, the node must wait for all nodes in the similar cluster where the node is located to perform the next iteration in the same relative round as the node, and when the waiting process of the similar cluster exits, the relative iteration round of all nodes of the similar cluster is updated.

Claims (7)

1. A distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes is characterized in that,
the method comprises the following steps:
s1, predicting the synchronization efficiency of the calculation node by adopting Kalman filtering, and analyzing the fluctuation condition of the synchronization efficiency of the calculation node by adopting mean square error;
s2, carrying out self-organizing real-time grouping on the computing nodes based on the predicted synchronization efficiency of each computing node and the similarity of the fluctuation condition of the synchronization efficiency;
and S3, training models by adopting different synchronization strategies between groups and in groups according to the self-organizing real-time grouping result of the computing nodes.
2. The distributed neural network hybrid synchronous training method based on the self-organizing grouping of the compute nodes as claimed in claim 1, wherein the step S1 specifically includes:
s11, for any calculation node i, the parameter server collects the time of each round of recent iteration to form a time window set T with a fixed sizeiTime window set TiEach element records the time taken for the parameter server to complete the reception of the latest gradient parameters of the node from the transmission of the latest model data to the node in one iteration;
s12, according to the efficiency of the last iteration of the node and the Kalman filtering condition, carrying out efficiency evaluation prediction on the next iteration to obtain the predicted next synchronization efficiency T of the nodei predict
S13 solving time window set TiThe mean square error of the medium element is used for analyzing the recent synchronous efficiency fluctuation condition of the computing node and is recorded as waveiThe two variables are used as the output of the evaluation and prediction of the computing node.
3. The distributed neural network hybrid synchronous training method based on the self-organizing grouping of the compute nodes as claimed in claim 1, wherein the step S2 specifically includes:
s21, forming a similar cluster by a plurality of similar computing nodes, and judging whether the node needs to be moved out of the similar cluster according to the node similarity condition when each iteration of any node starts;
s22, removing the nodes needing to be moved out of the similar cluster, and creating a new similar cluster for the nodes;
and S23, recombining and merging the similar clusters meeting the merging condition, and updating the internal information of the similar clusters.
4. The distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes as claimed in claim 3, wherein in step S21, the node similarity condition includes:
for the same similar cluster groupsimilarAny two computing nodes in (2) need to satisfy:
Figure FDA0002579102390000011
among them, ThresholdTThreshold, representing a similarity in node efficiencywaveA threshold value indicating that recent efficiency fluctuations are similar;
Figure FDA0002579102390000012
indicating the predicted synchronization efficiency of node x in a similar cluster,
Figure FDA0002579102390000013
representing the efficiency of predictive synchronization, wave, of node y in a similar clusterxRepresenting recent synchronization efficiency fluctuation value, wave, of node x in a similar clusterxRepresenting recent synchronization efficiency fluctuation values of nodes y in similar clusters.
5. The distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes as claimed in claim 3, wherein in step S23, the merging condition comprises:
for any two similar clusters, as long as any two nodes u and v inside the two similar clusters satisfy
Figure FDA0002579102390000021
Figure FDA0002579102390000022
And | waveu-wavev|≤ThresholdwaveThe merging can be performed.
6. The distributed neural network hybrid synchronous training method based on the self-organizing grouping of the compute nodes as claimed in claim 1, wherein the step S3 specifically includes:
s31, when one round of iteration of any computing node is finished, computing the maximum iteration round epoch in the similar clustermax
Figure FDA0002579102390000023
Wherein wave represents the average value of all node fluctuations in the similar cluster, and nodeNumber represents the number of nodes in the similar cluster;
s32, judging whether the node enters global waiting or cluster waiting, if so, entering a step S33, otherwise, entering a step S34;
s33, if the node enters global waiting, the node can continue to operate after all nodes have the same iteration turns; if the node enters a cluster waiting state, the node can continue to operate after all nodes in the same similar cluster have the same iteration turns;
and S34, if the node does not enter global waiting or cluster waiting, the node performs single-machine asynchronous optimization, enters the next iteration and returns to the step S31.
7. The distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes as claimed in claim 6, wherein in step S32, the determining whether the node will enter global waiting or cluster waiting specifically includes:
(1) judging whether the absolute iteration turn difference value between the node and all nodes in the cluster is greater than a preset threshold value, and entering global waiting if the absolute iteration turn difference value is greater than the preset threshold value;
(2) if the difference between the relative iteration round of the node and the minimum relative round in the similar cluster is larger than the epochmaxEntering a similar cluster for waiting;
the absolute iteration turn refers to the total number of times that the node has iterated currently, and the relative iteration turn refers to the iteration turn of the node in the similar cluster in which the node is located.
CN202010662415.0A 2020-07-10 2020-07-10 Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes Active CN111813858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010662415.0A CN111813858B (en) 2020-07-10 2020-07-10 Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010662415.0A CN111813858B (en) 2020-07-10 2020-07-10 Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes

Publications (2)

Publication Number Publication Date
CN111813858A true CN111813858A (en) 2020-10-23
CN111813858B CN111813858B (en) 2022-06-24

Family

ID=72841719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010662415.0A Active CN111813858B (en) 2020-07-10 2020-07-10 Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes

Country Status (1)

Country Link
CN (1) CN111813858B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633480A (en) * 2020-12-31 2021-04-09 中山大学 Calculation optimization method and system of semi-asynchronous parallel neural network
CN113035349A (en) * 2021-03-25 2021-06-25 浙江大学 Neural network dynamic fusion method for genetic metabolic disease multi-center screening
CN115865607A (en) * 2023-03-01 2023-03-28 山东海量信息技术研究院 Distributed training computing node management method and related device
CN117155928A (en) * 2023-10-31 2023-12-01 浪潮电子信息产业股份有限公司 Communication task processing method, system, equipment, cluster and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020184171A1 (en) * 2001-06-05 2002-12-05 Mcclanahan Craig J. System and method for organizing color values using an artificial intelligence based cluster model
CN103914735A (en) * 2014-04-17 2014-07-09 北京泰乐德信息技术有限公司 Failure recognition method and system based on neural network self-learning
CN106570563A (en) * 2015-10-13 2017-04-19 中国石油天然气股份有限公司 Deformation prediction method and apparatus based on Kalman filtering and BP neural network
CN108366386A (en) * 2018-05-11 2018-08-03 东南大学 A method of using neural fusion wireless network fault detect
CN110135575A (en) * 2017-12-29 2019-08-16 英特尔公司 Communication optimization for distributed machines study
US20190273510A1 (en) * 2018-03-01 2019-09-05 Crowdstrike, Inc. Classification of source data by neural network processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020184171A1 (en) * 2001-06-05 2002-12-05 Mcclanahan Craig J. System and method for organizing color values using an artificial intelligence based cluster model
CN103914735A (en) * 2014-04-17 2014-07-09 北京泰乐德信息技术有限公司 Failure recognition method and system based on neural network self-learning
CN106570563A (en) * 2015-10-13 2017-04-19 中国石油天然气股份有限公司 Deformation prediction method and apparatus based on Kalman filtering and BP neural network
CN110135575A (en) * 2017-12-29 2019-08-16 英特尔公司 Communication optimization for distributed machines study
US20190273510A1 (en) * 2018-03-01 2019-09-05 Crowdstrike, Inc. Classification of source data by neural network processing
CN108366386A (en) * 2018-05-11 2018-08-03 东南大学 A method of using neural fusion wireless network fault detect

Non-Patent Citations (11)

* Cited by examiner, † Cited by third party
Title
G.A. CARPENTER 等: "The ART of adaptive pattern recognition by a self-organizing neural network", 《COMPUTER》 *
HANANEL HAZAN 等: "Unsupervised Learning with Self-Organizing Spiking Neural Networks", 《NEURAL NETWORKS》 *
XIAOYU WANG 等: "Convergence Study in Extended Kalman Filter-Based Training of Recurrent Neural Networks", 《NEURAL NETWORKS》 *
张栗粽 等: "面向大数据分布式存储的动态负载均衡算法", 《计算机科学》 *
曾喆昭 等: "不确定混沌系统的多项式函数模型补偿控制", 《物理学报》 *
李春华 等: "自组织特征映射神经网络原理和应用研究", 《北京师范大学学报(自然科学版)》 *
杨森 等: "应用自组织特征映射神经网络技术实现分布式入侵检测", 《计算机应用》 *
王法胜 等: "基于扩展卡尔曼粒子滤波算法的神经网络训练", 《算机工程与科学》 *
田晓宇 等: "基于Kalman滤波的神经网络学习算法及其应用", 《计算机与数字工程》 *
胡振涛 等: "基于容积卡尔曼滤波的神经网络训练算法", 《控制与决策》 *
谢渊: "分布式环境轨迹预测算法研究与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633480A (en) * 2020-12-31 2021-04-09 中山大学 Calculation optimization method and system of semi-asynchronous parallel neural network
CN112633480B (en) * 2020-12-31 2024-01-23 中山大学 Calculation optimization method and system of semi-asynchronous parallel neural network
CN113035349A (en) * 2021-03-25 2021-06-25 浙江大学 Neural network dynamic fusion method for genetic metabolic disease multi-center screening
CN113035349B (en) * 2021-03-25 2024-01-05 浙江大学 Neural network dynamic fusion method for multi-center screening of genetic metabolic diseases
CN115865607A (en) * 2023-03-01 2023-03-28 山东海量信息技术研究院 Distributed training computing node management method and related device
CN117155928A (en) * 2023-10-31 2023-12-01 浪潮电子信息产业股份有限公司 Communication task processing method, system, equipment, cluster and readable storage medium
CN117155928B (en) * 2023-10-31 2024-02-09 浪潮电子信息产业股份有限公司 Communication task processing method, system, equipment, cluster and readable storage medium

Also Published As

Publication number Publication date
CN111813858B (en) 2022-06-24

Similar Documents

Publication Publication Date Title
CN111813858B (en) Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes
CN110223517B (en) Short-term traffic flow prediction method based on space-time correlation
CN110610242B (en) Method and device for setting weights of participants in federal learning
CN110460880B (en) Industrial wireless streaming media self-adaptive transmission method based on particle swarm and neural network
CN110995487B (en) Multi-service quality prediction method and device, computer equipment and readable storage medium
CN106933649B (en) Virtual machine load prediction method and system based on moving average and neural network
CN105471631B (en) Network flow prediction method based on traffic trends
CN108335487B (en) Road traffic state prediction system based on traffic state time sequence
CN113852432B (en) Spectrum Prediction Sensing Method Based on RCS-GRU Model
CN113778691B (en) Task migration decision method, device and system
CN112949828A (en) Graph convolution neural network traffic prediction method and system based on graph learning
CN108650065B (en) Window-based streaming data missing processing method
CN109067583A (en) A kind of resource prediction method and system based on edge calculations
CN113760511B (en) Vehicle edge calculation task unloading method based on depth certainty strategy
CN113887748B (en) Online federal learning task allocation method and device, and federal learning method and system
CN115051929A (en) Network fault prediction method and device based on self-supervision target perception neural network
CN112905436B (en) Quality evaluation prediction method for complex software
CN102130955B (en) System and method for generating alternative service set of composite service based on collaborative filtering
Yuan Jitter buffer control algorithm and simulation based on network traffic prediction
CN115865914A (en) Task unloading method based on federal deep reinforcement learning in vehicle edge calculation
Cui et al. The learning stimulated sensing-transmission coordination via age of updates in distributed uav swarm
CN113037648B (en) Data transmission method and device
CN115766475A (en) Semi-asynchronous power federal learning network based on communication efficiency and communication method thereof
CN114815755A (en) Method for establishing distributed real-time intelligent monitoring system based on intelligent cooperative reasoning
CN112001571B (en) Markov chain-based block chain performance analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant