CN109032630A - The update method of global parameter in a kind of parameter server - Google Patents

The update method of global parameter in a kind of parameter server Download PDF

Info

Publication number
CN109032630A
CN109032630A CN201810695184.6A CN201810695184A CN109032630A CN 109032630 A CN109032630 A CN 109032630A CN 201810695184 A CN201810695184 A CN 201810695184A CN 109032630 A CN109032630 A CN 109032630A
Authority
CN
China
Prior art keywords
parameter
global
working
working node
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810695184.6A
Other languages
Chinese (zh)
Other versions
CN109032630B (en
Inventor
徐杰
唐淳
田野
盛纾纬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201810695184.6A priority Critical patent/CN109032630B/en
Publication of CN109032630A publication Critical patent/CN109032630A/en
Application granted granted Critical
Publication of CN109032630B publication Critical patent/CN109032630B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/65Updates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Computer Security & Cryptography (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a kind of update methods of global parameter in parameter server, parameter server relevant parameter is first set, and global parameter is initialized, then training data is downloaded from database, and it is pre-processed, global parameter after recycling pretreated training parameter and initialization calculates the local parameter of each working node, and local parameter is finally returned to parameter server, is iterated update global parameter by parameter server.

Description

Method for updating global parameters in parameter server
Technical Field
The invention belongs to the technical field of optical communication, and particularly relates to a method for updating global parameters in a parameter server.
Background
The goal in designing a distributed machine learning system is to achieve acceleration-in the most ideal case, linear acceleration should be achieved. That is, every additional compute node comes in, an additional 1 x acceleration should be obtained relative to a single machine. But since synchronizing computational tasks or parameters on different nodes typically results in additional overhead, which may be greater than, or even several times, the computational overhead. If the system is not designed properly, the overhead can cause your training to be not speeded up on multiple machines, or even the case: when you use many times as many computing resources as a stand-alone machine to run your machine learning training program, it is found to be slower than a stand-alone machine.
The architecture of PS is in fact comparable to the client-server (CS) architecture, which mainly abstracts two main concepts: a parameter server and a work node. The server has some data placed inside, and the computing node can send data to the server or request the server to transmit data back. With these two concepts, the following mapping can be performed on the computation flow of distributed machine learning and the two modules of the server and the working node in the PS: server-side maintenance of globally shared model parameters w for PStAnd the client corresponds to each working node executing the computing task. Meanwhile, the server side provides two main APIs to the client side: push and pull.
When each iteration is started, all the clients call pull to send a request to the server to request the server to transmit the latest model parameters back, and after each computing node receives the transmitted model parameters, the computing node copies and overlays the latest parameters onto the old parameters, and then performs computation to obtain gradient update values. In other words, the pull of the PS ensures that each compute node can obtain a copy of the latest parameters before the computation starts.
In the actual using process, the distributed cluster environment has the problems of network delay and the like, and meanwhile, the performances of all machines in the cluster are different, so when the distributed deep learning algorithm runs in the heterogeneous cluster environment, the stability of the algorithm is reduced, and model non-convergence occurs in a serious case. This is contrary to our original intention of using distributed clusters for model training, and fails to achieve the goal of accelerating the neural network training process.
For the asynchronous random gradient descent SGD algorithm, delay in a system influences effective convergence of the algorithm, namely when a fast working node finishes several iterations and an updated value is updated to a global parameter, a parameter server receives the delayed updated value transmitted by a slow working node, and the delayed updated value is updated to the global parameter in the same updating mode, so that the global parameter deviates from the direction of an optimal solution, and the convergence speed of the whole model is influenced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a method for updating global parameters in a parameter server, which dynamically updates the global parameters according to the delay degree of weight parameters, thereby reducing the influence of high delay on an algorithm.
In order to achieve the above object, the present invention provides a method for updating global parameters in a parameter server, comprising the following steps:
(1) global parameter initialization
Setting a global time stamp T, and setting a maximum iteration time T of training, wherein T is 0,1,2, … and T-1; when t is 0, the parameter server randomly initializes the global parameter and initializes the initialized global parameter w0Is sent to allThe working node of (2);
(2) training data preprocessing
Downloading a plurality of training data from a database, equally dividing the training data into m parts according to the number of the working nodes, respectively distributing the training data to the m working nodes, and storing the training data in local data blocks of the working nodes;
(3) the working node is based on the global parameter w0Training to obtain local parameters;
(3.1) at the time of the t-th time stamp, randomly extracting n sample data from the local data block by each working node, wherein n is less than m;
(3.2) by Mini-batch algorithm, use the global parameter w0Training n sample data to obtain each sample training output value of each nodem is the number of working nodes;
(3.3) calculating the loss function value L of the j working nodej
Wherein,representing an expected output value of the jth working node during the ith sample training;
(3.4) according to the loss function value Ljcalculate gradient value ▽ Lj
(3.5) calculating the local parameter of the jth working node at the tth time stamp
wherein η represents a learning rate;
(4) global parameter update
(4.1) the parameter server receives the local parameters transmitted by each working node in sequenceDetermining a delay degree for each working node according to the sequence;
dj=t-tτ
wherein d isjIs the delay degree, t, of the jth working nodeτA timestamp representing the global parameter of the previous round of updating of the jth node;
(4.2) degree of delay d through jth working nodejcalculating the parameter alpha of the j-th working nodej
Wherein c is a constant;
(4.3) the parameter server updates the global parameter of the jth working node
Similarly, the parameter server updates the global parameters of the rest working nodes at the t-th timestamp in sequence;
(4.4) finishing the global parameter updating of all the working nodes when the tth timestamp is reachedAfter the completion, let t be t +1, the parameter server updates the global parameter wt+1And (4) returning to each corresponding working node, returning to the step (3) and repeating the steps until the global timestamp t reaches the maximum iteration times, finishing the iteration and finishing the updating of the global parameters.
The invention aims to realize the following steps:
the invention relates to a method for updating global parameters in a parameter server, which comprises the steps of firstly setting relevant parameters of the parameter server, initializing the global parameters, then downloading training data from a database, preprocessing the training data, calculating local parameters of each working node by using the preprocessed training parameters and the initialized global parameters, finally returning the local parameters to the parameter server, and iteratively updating the global parameters through the parameter server.
Meanwhile, the method for updating the global parameters in the parameter server has the following beneficial effects:
(1) the method changes the method for transmitting the gradient value in the network into the method for directly transmitting the weight parameter, and then directly carries out linear interpolation calculation on all the parameters on the parameter server, thereby solving the problem that the asynchronous protocol algorithm is not converged on a large data set, and obtaining better effect on the image classification problem.
(2) In the heterogeneous cluster, the method can sense the delay degree of the weight parameter transmitted by each working node, dynamically determine the update of the global parameter according to the delay degree, and effectively reduce the influence of high delay on the global parameter, so that the algorithm has better stability in the heterogeneous cluster.
Drawings
FIG. 1 is a flowchart of a method for updating global parameters in a parameter server according to the present invention;
FIG. 2 is a schematic diagram of an asynchronous SGD algorithm update;
FIG. 3 is a schematic diagram of a first step of the asynchronous SGD algorithm;
FIG. 4 is a schematic diagram of a second step of the asynchronous SGD algorithm;
FIG. 5 is a schematic diagram of an asynchronous parameter-transferring SGD algorithm update;
FIG. 6 is a schematic diagram illustrating a first operation step of the asynchronous parameter-transferring SGD algorithm;
FIG. 7 is a schematic diagram illustrating a second operation of the asynchronous parameter-transferring SGD algorithm;
FIG. 8 is a schematic diagram of the effect of high latency update values on global parameters;
FIG. 9 is a schematic diagram of the handling of high latency using dynamic α;
FIG. 10 is a statistical histogram of the number of iterations in convergence;
FIG. 11 is a statistical histogram of the average calculated time of the work nodes;
fig. 12 is a time statistics histogram used at convergence.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Examples
FIG. 1 is a flowchart of a method for updating global parameters in a parameter server according to the present invention.
In this embodiment, as shown in fig. 1, a method for updating global parameters in a parameter server according to the present invention includes the following steps:
s1, initializing global parameters
Setting a global time stamp T, and setting a maximum iteration time T of training, wherein T is 0,1,2, … and T-1; when t is 0, the parameter server randomly initializes the global parameter and initializes the initialized global parameter w0Sending the data to all the working nodes;
s2, preprocessing training data
Downloading a plurality of training data from a database, equally dividing the training data into m parts according to the number of the working nodes, respectively distributing the training data to the m working nodes, and storing the training data in local data blocks of the working nodes;
s3, the working node according to the global parameter w0Training to obtain local parameters;
s3.1, at the time of the t-th time stamp, randomly extracting n sample data from the local data block by each working node, wherein n is less than m;
s3.2, using the global parameter w through the Mini-batch algorithm0Training n sample data to obtain each sample training output value of each nodem is the number of working nodes;
s3.3, calculating a loss function value L of the jth working nodej
Wherein,representing an expected output value of the jth working node during the ith sample training;
s3.4, according to the loss function value LjComputinggradient value ▽ Lj
Wherein, when the sample data training is carried out by the Mini-batch algorithm, the weight between the neuron a and the neuron b is wabThe output of this layer is x, and the output of the last layer is v, and satisfies: x is v.wab
S3.5, calculating the local parameter of the jth working node at the tth timestamp
wherein η represents a learning rate;
in this embodiment, generally speaking, the value transmitted between the working node and the parameter server is a gradient value, that is, the working node calculates the gradient value of the current mini-batch, and then transmits the gradient value to the parameter server. The parameter server updates the global parameters according to a certain algorithm after receiving the gradient values, if the synchronous protocol uses an averaging mode, all the gradient values are averaged and then added to the global parameters, and the asynchronous protocol directly adds each gradient value to the global parameters.
However, if the global parameter is updated by using the gradient value, a problem occurs in the asynchronous case, and since the global parameter is updated once every time each working node transmits the gradient value to the parameter server, many versions of the global parameter exist, which has a great influence on the effective convergence of the algorithm. As shown in FIG. 2, the update process of global parameters is shown in the first row, and the corresponding working nodes in the second row, W1, W2 and W when the algorithm starts to run3 takes the initial global parameter w0Assuming that the global parameter is updated according to the sequence of the arrangement of the working nodes in the graph, when the W1 transmits the gradient value a to the parameter server, the latest global parameter W is obtainedaThen the gradient value b of W2 is received by the parameter server, and the calculated global parameter W isa,bReturning to W2 for the initial parameters for the next iteration. By analogy, the whole parameter exchange based on the asynchronous protocol is always carried out.
We can see the problem, taking W1 as an example, in the first step of the algorithm, the initial parameter for training is W0Calculating to obtain a gradient value a, and then transmitting the gradient value a to the parameter server, wherein the value w is still on the parameter server0The latest parameter w is obtained by calculationaAnd passed back to W1 for the initial parameters for the next iteration. When the calculation of W1 completes the second iteration, the obtained gradient value d is transmitted to the parameter server, and the latest global parameter in the parameter server is Wa,b,cIt is used. Generally, d is from waCalculated, but ultimately used for the parameter wa,b,cIn this case, the update of the global parameters would seem like a random change, and when the number of machines increases, as shown in fig. 3, this randomness leads to more serious consequences, and fig. 3 and 4 illustrate this problem in a more intuitive way.
Assuming that the working nodes always update the parameters in the same order, in the first step all working nodes will calculate the gradient using the same initial global parameters. As shown in fig. 4, the solid arrows in the figure represent the updated gradient vectors from the respective working nodes. The updating is completed according to the updating sequence of 1,2 and 3, and when the third machine completes the updating, the position of the global parameter comes to the end position 4 shown in fig. 4. In an ideal situation, after the update is completed, the update value taken by the working node 1 becomes a value of 1 point, the global parameter taken by the working node 2 is a value of 2 points, and the value taken by the working node 3 is a value of 3 points, and then the calculation of the second step is performed, as shown in fig. 4.
The update generated by the working node 1 is originally obtained by training the parameter of the position of the 1 point, but after the update is transmitted to the parameter server, the gradient value is used at the position of the starting point in the graph, and similarly, the gradient value obtained by training the working nodes 2 and 3 is also updated to the global parameter, so that the position difference between the real global parameter and the locally calculated parameter is very large. If the number of machines is larger, and the delay is taken into account, the difference will be even larger, resulting in the update of the global parameters appearing to be walking at random.
S4, global parameter updating
S4.1, the concept of delay is provided, the delay is quantized, and the updating value of each working node has the delay, so that whether the value can be used for updating the global parameter or not can be judged in the parameter server through the delay. The delay degree calculation method comprises the following steps:
the parameter server receives the local parameters transmitted by each working node in turnDetermining a delay degree for each working node according to the sequence;
dj=t-tτ
wherein d isjIs the delay degree, t, of the jth working nodeτA timestamp representing the global parameter of the previous round of updating of the jth node;
s4.2, delay degree d passing through jth working nodejcalculating the parameter alpha of the j-th working nodej
Wherein c is a constant; the exponential function is used here to ensure the parametersnumber alphajThe value of (c) can still be in the range of 0 to 1.
S4.3, the parameter server updates the global parameter of the jth working node
Similarly, the parameter server updates the global parameters of the rest working nodes at the t-th timestamp in sequence;
s4.4, after the global parameters of all the working nodes at the time of the tth timestamp are updated, making t equal to t +1, and enabling the parameter server to update the updated global parameters wt+1And returning to each corresponding working node, returning to the step S3, and repeating the steps until the global timestamp t reaches the maximum iteration time, ending the iteration, and finishing the updating of the global parameters.
In this embodiment, as shown in fig. 5, the update value passed by each working node directly updates the global parameter, when there is a large difference in performance of the working nodes in the cluster, it can be found from the update sequence of worker1 and worker2 in fig. 5 that, when worker1 passes the first update value, worker2 is already performing the third update and the global parameter in the parameter server is already updated four times, at this time, the delay of the update value of worker1 is already very large, but according to the previous update mechanism of the asynchronous SGD algorithm, the update value still directly updates the global parameter, which is unreasonable. From the foregoing analysis, it is concluded herein that the impact of high-latency update values on global parameters is a major cause of performance and efficiency degradation of algorithms in a distributed environment.
However, in a distributed environment with multiple worker nodes, a worker node should not completely replace a parameter on a server at each update, because the parameter passed by one worker node represents the result of the one worker node only. The global parameter update should include parameter values of all working nodes, that is, when the global parameter is updated, the original global parameter value and the parameter value transmitted by the working node need to be simultaneously retained. There are many ways to achieve this, and a linear interpolation method is chosen here.
In updating global parameters, the difference between updating using local parameters and updating using gradient values is similar to moving a point from one location to another using coordinates and directions, respectively. When directions are used, the destination to be reached will also be different due to the different starting point, but when coordinates are used, the destination will always be the same regardless of the starting position, which is shown in fig. 6 and 7.
As shown in fig. 6, points a, b, and c represent weight parameter values transmitted by the working node, and it is assumed that the update sequence of each time is fixed, that is, the global parameter is updated according to the sequence of a, b, and c, the leftmost black point represents the global parameter value in the parameter server in the initial state, and the other black points 1,2, and 3 represent values obtained after updating, and are labeled according to the update sequence. As shown, the result of the update of the first step is the position of the black dot labeled 3.
The second step of the analysis was performed in the same manner as shown in FIG. 7. After the global parameter is updated in two steps, that is, after two complete iterations, it can be seen that the final result position is closer to the position of the optimal point relative to the initial starting value, and no random jump occurs, so that the situation is far away from the optimal point.
In a heterogeneous environment, high latency in a cluster has a large impact on global parameters. An example is used here for illustration, as shown in fig. 8. The rightmost black dot represents the latest global parameter in the current parameter server, which is already very close to the optimal point. At this time, the parameter server receives a very delayed update parameter, i.e. the position indicated by point 1, and according to the parameter update mechanism of the asynchronous SGD algorithm, we can approximate the position of a new global parameter, i.e. the position of point 2 in the figure. It can be seen that the 2 points deviate from the optimal direction and the deviation distance is very far from the initial starting value, and this problem is more obvious in a cluster environment with serious delay, so that the global parameter has large fluctuation.
as shown in fig. 9, where the rightmost black dot represents the latest global parameter in the current parameter server, point 1 is the delayed update parameter, and it is assumed that its delay degree is 8, the number m of the whole cluster is 4, and the value of c is set to 1, so that the value of α is 0.13, and then a new global parameter can be obtained, as shown in the position shown by point 2 in fig. 9.
Examples of the invention
In order to test the distributed algorithm, a convenient distributed cluster environment is built by using one server and four working nodes, and hardware and software configuration information is shown in table 1.
In this experiment, the logical main server and the parameter server are run on the same physical server, and the data server and the Redis are also run on the server. The working nodes in the experimental environment can simulate the work of a plurality of computing nodes because all the working nodes are multi-core processors, and only a plurality of Tab label pages need to be opened in a browser.
Table 1 is an experimental environment configuration information table;
TABLE 1
All experiments in the embodiment belong to the technical field of solving the problem of image classification, and the aim of analyzing the efficiency of the algorithm is fulfilled by analyzing indexes such as the training error rate in the training process. In the embodiment, two classical image data sets MNIST and CIFAR10 are selected, and the new algorithm is tested on the two data sets, so that each function of the algorithm can be effectively detected. MNIST is a grayscale handwritten digital image data set of size $28\ times28\ times1 $. The training data set contains 50000 images of 10 classes. For the MNIST data set, the structure of the CNN can be directly configured using the user interface of MLitB, and the configuration parameters are shown in Table 2.
Table 2 is a table of CNN network parameter information on MNIST data set
Network layer indexing Network layer type Parameter information
1 Input layer size=(28,28,1)
2 Convolutional layer size=(5,5),stride=1,filters=8,actFun=relu
3 Pooling layer size=(2,2),stride=2
4 Convolutional layer size=(5,5),stride=1,filters=16,actFun=relu
5 Pooling layer size=(3,3),stride=3
6 Full connection layer neurons=10,actFun=softmax
TABLE 2
For all experiments, we used small batch size Ncto 100, the learning rate η is set to 0.01, we run 5 times per experiment and plot the results of the experiment with the average results.
In addition, in the present embodiment, the heterogeneous degree HL is set to 1 and 2, and performances of the synchronous SGD algorithm, the asynchronous SGD algorithm, and the asynchronous parameter-transferring SGD algorithm on the MNIST data set are tested respectively.
When the degree of heterogeneity is 1, we keep the computation time of all working nodes around 1 second by adding a delay, while at the degree of heterogeneity 2, we increase the delay of half of the working nodes to 2 seconds. In the code at the parameter server side, a variable step records the number of iterations of the whole system. The average calculation time of the working nodes is obtained by calculating the system time stamps before and after each calculation of each working node, and the system time stamps are printed on a control console of a browser and finally obtained by averaging.
As shown in fig. 10, the change of the iteration number of several algorithms when the degree of heterogeneity is increased from 1 to 2 when the algorithms reach a specified error rate is shown, and the significance of the index is that when convergence is reached, the smaller the iteration number is, it indicates that each iteration can generate more effective updates, that is, the global parameter can be closer to the optimal solution, and the larger the iteration number is, it indicates that the algorithm is updated less effectively in the whole training process. And when the asynchronous degree is increased to 2, the iteration times of the asynchronous SGD algorithm when convergence is achieved are obviously increased, which shows that the convergence of the model is seriously influenced by the random jump condition. The number of iterations of the asynchronous parameter-transferring SGD algorithm is increased less, which shows that the algorithm can sense the delay in the cluster, reduce the influence of high delay on global parameters and reduce invalid updating.
As shown in fig. 11, the increase of the average calculation time of several algorithms when the heterogeneous degree is increased from 1 to 2 is shown. When the heterogeneous degree in the cluster is increased, if the average calculation time is linearly related to the heterogeneous degree, it indicates that in the operation process of the algorithm, the influence of the node with low operation speed on the node with high operation speed is large, so that the condition that the average calculation time is linearly increased occurs. Ideally, when the degree of heterogeneity increases, the average computation time increases slowly, rather than linearly, and this result indicates that the slow-computing node in the cluster has a smaller influence on the fast-computing node, and increases the utilization rate of the computation resource. The average calculation time of the synchronous SGD algorithm is almost a linearly increased condition, when the heterogeneous degree is 2, the average calculation time is increased by 2 times, because the strict synchronization mechanism enables the calculation time of each time to be determined by the slowest working node, and the experimental result is identical with the theoretical analysis. The average calculation time of the three algorithms using the asynchronous protocol is increased less, and the average calculation time of the three algorithms is increased almost the same by combining two indexes of the iteration times and the average calculation time, so that the asynchronous protocol enables the calculation resources in the cluster to be fully utilized.
As shown in fig. 12, several algorithms are shown for time varying situations when the algorithm reaches a specified error rate when the degree of heterogeneity is increased from 1 to 2. When the heterogeneous degree in the cluster is increased, the whole operation time of the algorithm is checked, the operation speed of the algorithm can be visually judged, the comprehensive performance of the algorithm can be judged by combining the two indexes, and it is very important that whether the algorithm can achieve convergence in the effective time is judged. The synchronous SGD algorithm, while not performing well on average computation time, is most efficient to update, and therefore performs acceptably well on the overall runtime performance. The asynchronous SGD algorithm has obvious disadvantages in average calculation time and iteration times, and when the heterogeneous degree is increased, the overall operation time of the algorithm is greatly increased, so that the efficiency of the algorithm is obviously reduced. While the asynchronous reference SGD algorithm still performs best over the entire runtime. In conclusion, the asynchronous reference SGD algorithm has very good stability when the heterogeneous degree is increased.
In summary, the method provided by the invention changes the value transmitted between the parameter server and the working node from the gradient value to the weight parameter, and the algorithm can dynamically change the update mechanism of the global parameter according to the delay of the weight parameter, thereby reducing the influence of high delay on the algorithm. Experiments show that the algorithm achieves a good effect on the image classification problem and can stably run in a heterogeneous environment.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (1)

1. A method for updating global parameters in a parameter server is characterized by comprising the following steps:
(1) global parameter initialization
Setting a global time stamp T, and setting a maximum iteration time T of training, wherein T is 0,1,2, … and T-1; when t is 0, the parameter server randomly initializes the global parameter and initializes the initialized global parameter w0Sending the data to all the working nodes;
(2) training data preprocessing
Downloading a plurality of training data from a database, equally dividing the training data into m parts according to the number of the working nodes, respectively distributing the training data to the m working nodes, and storing the training data in local data blocks of the working nodes;
(3) the working node is based on the global parameter w0Training to obtain local parameters;
(3.1) at the t-th time stamp, randomly extracting n sample data from the local data block by each working node, wherein n is less than m;
(3.2) by Mini-batch algorithm, use the global parameter w0Training n sample data to obtain each sample training output value of each nodei is 1,2, …, n, j is 1,2, …, m, m is the number of working nodes;
(3.3) calculating the loss function value L of the j working nodej
Wherein,representing an expected output value of the jth working node during the ith sample training;
(3.4) calculating a gradient value ▽ L according to the loss function value L;
(3.5) calculating the local parameter of the jth working node at the tth time stamp
wherein η represents a learning rate;
(4) global parameter update
(4.1) the parameter server receives the local parameters transmitted by each working node in sequenceDetermining a delay degree for each working node according to the sequence;
dj=t-tτ
wherein d isjIs the degree of delay, t, of the first working nodeτA timestamp representing the global parameter of the previous round of updating of the jth node;
(4.2) degree of delay d through jth working nodejcalculating the parameter alpha of the j-th working nodej
Wherein c is a constant;
(4.3) the parameter server updates the global parameter of the jth working node
Similarly, the parameter server updates the global parameters of the rest working nodes at the t-th timestamp in sequence;
(4.4) after the global parameters of all the working nodes at the time of the tth timestamp are updated, making t equal to t +1, and the parameter server updates the updated global parameters wt+1And (4) returning to each corresponding working node, returning to the step (3) and repeating the steps until the global timestamp t reaches the maximum iteration times, finishing the iteration and finishing the updating of the global parameters.
CN201810695184.6A 2018-06-29 2018-06-29 Method for updating global parameters in parameter server Active CN109032630B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810695184.6A CN109032630B (en) 2018-06-29 2018-06-29 Method for updating global parameters in parameter server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810695184.6A CN109032630B (en) 2018-06-29 2018-06-29 Method for updating global parameters in parameter server

Publications (2)

Publication Number Publication Date
CN109032630A true CN109032630A (en) 2018-12-18
CN109032630B CN109032630B (en) 2021-05-14

Family

ID=65520873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810695184.6A Active CN109032630B (en) 2018-06-29 2018-06-29 Method for updating global parameters in parameter server

Country Status (1)

Country Link
CN (1) CN109032630B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710289A (en) * 2018-12-21 2019-05-03 南京邮电大学 The update method of distributed parameters server based on deeply learning algorithm
CN110929878A (en) * 2019-10-30 2020-03-27 同济大学 Distributed random gradient descent method
CN111461207A (en) * 2020-03-30 2020-07-28 北京奇艺世纪科技有限公司 Picture recognition model training system and method
CN112990422A (en) * 2019-12-12 2021-06-18 中科寒武纪科技股份有限公司 Parameter server, client and weight parameter processing method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067738A1 (en) * 2012-08-28 2014-03-06 International Business Machines Corporation Training Deep Neural Network Acoustic Models Using Distributed Hessian-Free Optimization
CN105630882A (en) * 2015-12-18 2016-06-01 哈尔滨工业大学深圳研究生院 Remote sensing data deep learning based offshore pollutant identifying and tracking method
CN106709565A (en) * 2016-11-16 2017-05-24 广州视源电子科技股份有限公司 Neural network optimization method and device
CN107784364A (en) * 2016-08-25 2018-03-09 微软技术许可有限责任公司 The asynchronous training of machine learning model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067738A1 (en) * 2012-08-28 2014-03-06 International Business Machines Corporation Training Deep Neural Network Acoustic Models Using Distributed Hessian-Free Optimization
CN105630882A (en) * 2015-12-18 2016-06-01 哈尔滨工业大学深圳研究生院 Remote sensing data deep learning based offshore pollutant identifying and tracking method
CN107784364A (en) * 2016-08-25 2018-03-09 微软技术许可有限责任公司 The asynchronous training of machine learning model
CN106709565A (en) * 2016-11-16 2017-05-24 广州视源电子科技股份有限公司 Neural network optimization method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
夏亚峰 等: "对数正态参数估计的损失函数和风险函数的Bayes推断", 《兰州理工大学学报》 *
肖红 等: "基于分段线性插值的过程神经网络训练", 《计算机工程》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710289A (en) * 2018-12-21 2019-05-03 南京邮电大学 The update method of distributed parameters server based on deeply learning algorithm
CN110929878A (en) * 2019-10-30 2020-03-27 同济大学 Distributed random gradient descent method
CN110929878B (en) * 2019-10-30 2023-07-04 同济大学 Distributed random gradient descent method
CN112990422A (en) * 2019-12-12 2021-06-18 中科寒武纪科技股份有限公司 Parameter server, client and weight parameter processing method and system
CN111461207A (en) * 2020-03-30 2020-07-28 北京奇艺世纪科技有限公司 Picture recognition model training system and method

Also Published As

Publication number Publication date
CN109032630B (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN109032630B (en) Method for updating global parameters in parameter server
CN111382844B (en) Training method and device for deep learning model
CN110533183B (en) Task placement method for heterogeneous network perception in pipeline distributed deep learning
CN108875955B (en) Gradient lifting decision tree implementation method based on parameter server and related equipment
CN108446761B (en) Neural network accelerator and data processing method
CN109951438A (en) A kind of communication optimization method and system of distribution deep learning
CN110659678B (en) User behavior classification method, system and storage medium
CN108108233B (en) Cluster job scheduling method and system for task multi-copy execution
CN107103359A (en) The online Reliability Prediction Method of big service system based on convolutional neural networks
CN112434789B (en) Distributed neural network model partitioning method for edge video analysis
Xiong et al. Straggler-resilient distributed machine learning with dynamic backup workers
CN114116995B (en) Session recommendation method, system and medium based on enhanced graph neural network
Zhang et al. Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster
CN117675823A (en) Task processing method and device of computing power network, electronic equipment and storage medium
CN117042184A (en) Calculation unloading and resource allocation method based on deep reinforcement learning
CN116776969A (en) Federal learning method and apparatus, and computer-readable storage medium
WO2020037512A1 (en) Neural network calculation method and device
JP2020003860A (en) Learning system, processing device, processing method, and program
CN115529350A (en) Parameter optimization method and device, electronic equipment and readable storage medium
Ji et al. Performance prediction for distributed graph computing
CN116016212B (en) Decentralised federation learning method and device for bandwidth perception
Kazemi et al. Asynchronous delay-aware accelerated proximal coordinate descent for nonconvex nonsmooth problems
Yokoyama et al. Efficient distributed machine learning for large-scale models by reducing redundant communication
CN113641905B (en) Model training method, information pushing method, device, equipment and storage medium
US20240111607A1 (en) Similarity-based quantization selection for federated learning with heterogeneous edge devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant