CN109032630A

CN109032630A - The update method of global parameter in a kind of parameter server

Info

Publication number: CN109032630A
Application number: CN201810695184.6A
Authority: CN
Inventors: 徐杰; 唐淳; 田野; 盛纾纬
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2018-12-18
Anticipated expiration: 2038-06-29
Also published as: CN109032630B

Abstract

The invention discloses a kind of update methods of global parameter in parameter server, parameter server relevant parameter is first set, and global parameter is initialized, then training data is downloaded from database, and it is pre-processed, global parameter after recycling pretreated training parameter and initialization calculates the local parameter of each working node, and local parameter is finally returned to parameter server, is iterated update global parameter by parameter server.

Description

Method for updating global parameters in parameter server

Technical Field

The invention belongs to the technical field of optical communication, and particularly relates to a method for updating global parameters in a parameter server.

Background

The goal in designing a distributed machine learning system is to achieve acceleration-in the most ideal case, linear acceleration should be achieved. That is, every additional compute node comes in, an additional 1 x acceleration should be obtained relative to a single machine. But since synchronizing computational tasks or parameters on different nodes typically results in additional overhead, which may be greater than, or even several times, the computational overhead. If the system is not designed properly, the overhead can cause your training to be not speeded up on multiple machines, or even the case: when you use many times as many computing resources as a stand-alone machine to run your machine learning training program, it is found to be slower than a stand-alone machine.

The architecture of PS is in fact comparable to the client-server (CS) architecture, which mainly abstracts two main concepts: a parameter server and a work node. The server has some data placed inside, and the computing node can send data to the server or request the server to transmit data back. With these two concepts, the following mapping can be performed on the computation flow of distributed machine learning and the two modules of the server and the working node in the PS: server-side maintenance of globally shared model parameters w for PS_tAnd the client corresponds to each working node executing the computing task. Meanwhile, the server side provides two main APIs to the client side: push and pull.

When each iteration is started, all the clients call pull to send a request to the server to request the server to transmit the latest model parameters back, and after each computing node receives the transmitted model parameters, the computing node copies and overlays the latest parameters onto the old parameters, and then performs computation to obtain gradient update values. In other words, the pull of the PS ensures that each compute node can obtain a copy of the latest parameters before the computation starts.

In the actual using process, the distributed cluster environment has the problems of network delay and the like, and meanwhile, the performances of all machines in the cluster are different, so when the distributed deep learning algorithm runs in the heterogeneous cluster environment, the stability of the algorithm is reduced, and model non-convergence occurs in a serious case. This is contrary to our original intention of using distributed clusters for model training, and fails to achieve the goal of accelerating the neural network training process.

For the asynchronous random gradient descent SGD algorithm, delay in a system influences effective convergence of the algorithm, namely when a fast working node finishes several iterations and an updated value is updated to a global parameter, a parameter server receives the delayed updated value transmitted by a slow working node, and the delayed updated value is updated to the global parameter in the same updating mode, so that the global parameter deviates from the direction of an optimal solution, and the convergence speed of the whole model is influenced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a method for updating global parameters in a parameter server, which dynamically updates the global parameters according to the delay degree of weight parameters, thereby reducing the influence of high delay on an algorithm.

In order to achieve the above object, the present invention provides a method for updating global parameters in a parameter server, comprising the following steps:

(1) global parameter initialization

Setting a global time stamp T, and setting a maximum iteration time T of training, wherein T is 0,1,2, … and T-1; when t is 0, the parameter server randomly initializes the global parameter and initializes the initialized global parameter w₀Is sent to allThe working node of (2);

(2) training data preprocessing

Downloading a plurality of training data from a database, equally dividing the training data into m parts according to the number of the working nodes, respectively distributing the training data to the m working nodes, and storing the training data in local data blocks of the working nodes;

(3) the working node is based on the global parameter w₀Training to obtain local parameters;

(3.1) at the time of the t-th time stamp, randomly extracting n sample data from the local data block by each working node, wherein n is less than m;

(3.2) by Mini-batch algorithm, use the global parameter w₀Training n sample data to obtain each sample training output value of each nodem is the number of working nodes;

(3.3) calculating the loss function value L of the j working node_j：

Wherein,representing an expected output value of the jth working node during the ith sample training;

(3.4) according to the loss function value L_jcalculate gradient value ▽ L_j；

(3.5) calculating the local parameter of the jth working node at the tth time stamp

wherein η represents a learning rate;

(4) global parameter update

(4.1) the parameter server receives the local parameters transmitted by each working node in sequenceDetermining a delay degree for each working node according to the sequence;

d_j＝t-t_τ

wherein d is_jIs the delay degree, t, of the jth working node_τA timestamp representing the global parameter of the previous round of updating of the jth node;

(4.2) degree of delay d through jth working node_jcalculating the parameter alpha of the j-th working node_j：

Wherein c is a constant;

(4.3) the parameter server updates the global parameter of the jth working node

Similarly, the parameter server updates the global parameters of the rest working nodes at the t-th timestamp in sequence;

(4.4) finishing the global parameter updating of all the working nodes when the tth timestamp is reachedAfter the completion, let t be t +1, the parameter server updates the global parameter w_t+1And (4) returning to each corresponding working node, returning to the step (3) and repeating the steps until the global timestamp t reaches the maximum iteration times, finishing the iteration and finishing the updating of the global parameters.

The invention aims to realize the following steps:

the invention relates to a method for updating global parameters in a parameter server, which comprises the steps of firstly setting relevant parameters of the parameter server, initializing the global parameters, then downloading training data from a database, preprocessing the training data, calculating local parameters of each working node by using the preprocessed training parameters and the initialized global parameters, finally returning the local parameters to the parameter server, and iteratively updating the global parameters through the parameter server.

Meanwhile, the method for updating the global parameters in the parameter server has the following beneficial effects:

(1) the method changes the method for transmitting the gradient value in the network into the method for directly transmitting the weight parameter, and then directly carries out linear interpolation calculation on all the parameters on the parameter server, thereby solving the problem that the asynchronous protocol algorithm is not converged on a large data set, and obtaining better effect on the image classification problem.

(2) In the heterogeneous cluster, the method can sense the delay degree of the weight parameter transmitted by each working node, dynamically determine the update of the global parameter according to the delay degree, and effectively reduce the influence of high delay on the global parameter, so that the algorithm has better stability in the heterogeneous cluster.

Drawings

FIG. 1 is a flowchart of a method for updating global parameters in a parameter server according to the present invention;

FIG. 2 is a schematic diagram of an asynchronous SGD algorithm update;

FIG. 3 is a schematic diagram of a first step of the asynchronous SGD algorithm;

FIG. 4 is a schematic diagram of a second step of the asynchronous SGD algorithm;

FIG. 5 is a schematic diagram of an asynchronous parameter-transferring SGD algorithm update;

FIG. 6 is a schematic diagram illustrating a first operation step of the asynchronous parameter-transferring SGD algorithm;

FIG. 7 is a schematic diagram illustrating a second operation of the asynchronous parameter-transferring SGD algorithm;

FIG. 8 is a schematic diagram of the effect of high latency update values on global parameters;

FIG. 9 is a schematic diagram of the handling of high latency using dynamic α;

FIG. 10 is a statistical histogram of the number of iterations in convergence;

FIG. 11 is a statistical histogram of the average calculated time of the work nodes;

fig. 12 is a time statistics histogram used at convergence.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flowchart of a method for updating global parameters in a parameter server according to the present invention.

In this embodiment, as shown in fig. 1, a method for updating global parameters in a parameter server according to the present invention includes the following steps:

s1, initializing global parameters

Setting a global time stamp T, and setting a maximum iteration time T of training, wherein T is 0,1,2, … and T-1; when t is 0, the parameter server randomly initializes the global parameter and initializes the initialized global parameter w₀Sending the data to all the working nodes;

s2, preprocessing training data

s3, the working node according to the global parameter w₀Training to obtain local parameters;

s3.1, at the time of the t-th time stamp, randomly extracting n sample data from the local data block by each working node, wherein n is less than m;

s3.2, using the global parameter w through the Mini-batch algorithm₀Training n sample data to obtain each sample training output value of each nodem is the number of working nodes;

s3.3, calculating a loss function value L of the jth working node_j：

s3.4, according to the loss function value L_jComputinggradient value ▽ L_j；

Wherein, when the sample data training is carried out by the Mini-batch algorithm, the weight between the neuron a and the neuron b is w_abThe output of this layer is x, and the output of the last layer is v, and satisfies: x is v.w_ab；

S3.5, calculating the local parameter of the jth working node at the tth timestamp

wherein η represents a learning rate;

in this embodiment, generally speaking, the value transmitted between the working node and the parameter server is a gradient value, that is, the working node calculates the gradient value of the current mini-batch, and then transmits the gradient value to the parameter server. The parameter server updates the global parameters according to a certain algorithm after receiving the gradient values, if the synchronous protocol uses an averaging mode, all the gradient values are averaged and then added to the global parameters, and the asynchronous protocol directly adds each gradient value to the global parameters.

However, if the global parameter is updated by using the gradient value, a problem occurs in the asynchronous case, and since the global parameter is updated once every time each working node transmits the gradient value to the parameter server, many versions of the global parameter exist, which has a great influence on the effective convergence of the algorithm. As shown in FIG. 2, the update process of global parameters is shown in the first row, and the corresponding working nodes in the second row, W1, W2 and W when the algorithm starts to run3 takes the initial global parameter w₀Assuming that the global parameter is updated according to the sequence of the arrangement of the working nodes in the graph, when the W1 transmits the gradient value a to the parameter server, the latest global parameter W is obtained_aThen the gradient value b of W2 is received by the parameter server, and the calculated global parameter W is_a,bReturning to W2 for the initial parameters for the next iteration. By analogy, the whole parameter exchange based on the asynchronous protocol is always carried out.

We can see the problem, taking W1 as an example, in the first step of the algorithm, the initial parameter for training is W₀Calculating to obtain a gradient value a, and then transmitting the gradient value a to the parameter server, wherein the value w is still on the parameter server₀The latest parameter w is obtained by calculation_aAnd passed back to W1 for the initial parameters for the next iteration. When the calculation of W1 completes the second iteration, the obtained gradient value d is transmitted to the parameter server, and the latest global parameter in the parameter server is W_a,b,cIt is used. Generally, d is from w_aCalculated, but ultimately used for the parameter w_a,b,cIn this case, the update of the global parameters would seem like a random change, and when the number of machines increases, as shown in fig. 3, this randomness leads to more serious consequences, and fig. 3 and 4 illustrate this problem in a more intuitive way.

Assuming that the working nodes always update the parameters in the same order, in the first step all working nodes will calculate the gradient using the same initial global parameters. As shown in fig. 4, the solid arrows in the figure represent the updated gradient vectors from the respective working nodes. The updating is completed according to the updating sequence of 1,2 and 3, and when the third machine completes the updating, the position of the global parameter comes to the end position 4 shown in fig. 4. In an ideal situation, after the update is completed, the update value taken by the working node 1 becomes a value of 1 point, the global parameter taken by the working node 2 is a value of 2 points, and the value taken by the working node 3 is a value of 3 points, and then the calculation of the second step is performed, as shown in fig. 4.

The update generated by the working node 1 is originally obtained by training the parameter of the position of the 1 point, but after the update is transmitted to the parameter server, the gradient value is used at the position of the starting point in the graph, and similarly, the gradient value obtained by training the working nodes 2 and 3 is also updated to the global parameter, so that the position difference between the real global parameter and the locally calculated parameter is very large. If the number of machines is larger, and the delay is taken into account, the difference will be even larger, resulting in the update of the global parameters appearing to be walking at random.

S4, global parameter updating

S4.1, the concept of delay is provided, the delay is quantized, and the updating value of each working node has the delay, so that whether the value can be used for updating the global parameter or not can be judged in the parameter server through the delay. The delay degree calculation method comprises the following steps:

the parameter server receives the local parameters transmitted by each working node in turnDetermining a delay degree for each working node according to the sequence;

d_j＝t-t_τ

s4.2, delay degree d passing through jth working node_jcalculating the parameter alpha of the j-th working node_j：

Wherein c is a constant; the exponential function is used here to ensure the parametersnumber alpha_jThe value of (c) can still be in the range of 0 to 1.

S4.3, the parameter server updates the global parameter of the jth working node

s4.4, after the global parameters of all the working nodes at the time of the tth timestamp are updated, making t equal to t +1, and enabling the parameter server to update the updated global parameters w_t+1And returning to each corresponding working node, returning to the step S3, and repeating the steps until the global timestamp t reaches the maximum iteration time, ending the iteration, and finishing the updating of the global parameters.

In this embodiment, as shown in fig. 5, the update value passed by each working node directly updates the global parameter, when there is a large difference in performance of the working nodes in the cluster, it can be found from the update sequence of worker1 and worker2 in fig. 5 that, when worker1 passes the first update value, worker2 is already performing the third update and the global parameter in the parameter server is already updated four times, at this time, the delay of the update value of worker1 is already very large, but according to the previous update mechanism of the asynchronous SGD algorithm, the update value still directly updates the global parameter, which is unreasonable. From the foregoing analysis, it is concluded herein that the impact of high-latency update values on global parameters is a major cause of performance and efficiency degradation of algorithms in a distributed environment.

However, in a distributed environment with multiple worker nodes, a worker node should not completely replace a parameter on a server at each update, because the parameter passed by one worker node represents the result of the one worker node only. The global parameter update should include parameter values of all working nodes, that is, when the global parameter is updated, the original global parameter value and the parameter value transmitted by the working node need to be simultaneously retained. There are many ways to achieve this, and a linear interpolation method is chosen here.

In updating global parameters, the difference between updating using local parameters and updating using gradient values is similar to moving a point from one location to another using coordinates and directions, respectively. When directions are used, the destination to be reached will also be different due to the different starting point, but when coordinates are used, the destination will always be the same regardless of the starting position, which is shown in fig. 6 and 7.

As shown in fig. 6, points a, b, and c represent weight parameter values transmitted by the working node, and it is assumed that the update sequence of each time is fixed, that is, the global parameter is updated according to the sequence of a, b, and c, the leftmost black point represents the global parameter value in the parameter server in the initial state, and the other black points 1,2, and 3 represent values obtained after updating, and are labeled according to the update sequence. As shown, the result of the update of the first step is the position of the black dot labeled 3.

The second step of the analysis was performed in the same manner as shown in FIG. 7. After the global parameter is updated in two steps, that is, after two complete iterations, it can be seen that the final result position is closer to the position of the optimal point relative to the initial starting value, and no random jump occurs, so that the situation is far away from the optimal point.

In a heterogeneous environment, high latency in a cluster has a large impact on global parameters. An example is used here for illustration, as shown in fig. 8. The rightmost black dot represents the latest global parameter in the current parameter server, which is already very close to the optimal point. At this time, the parameter server receives a very delayed update parameter, i.e. the position indicated by point 1, and according to the parameter update mechanism of the asynchronous SGD algorithm, we can approximate the position of a new global parameter, i.e. the position of point 2 in the figure. It can be seen that the 2 points deviate from the optimal direction and the deviation distance is very far from the initial starting value, and this problem is more obvious in a cluster environment with serious delay, so that the global parameter has large fluctuation.

as shown in fig. 9, where the rightmost black dot represents the latest global parameter in the current parameter server, point 1 is the delayed update parameter, and it is assumed that its delay degree is 8, the number m of the whole cluster is 4, and the value of c is set to 1, so that the value of α is 0.13, and then a new global parameter can be obtained, as shown in the position shown by point 2 in fig. 9.

Examples of the invention

In order to test the distributed algorithm, a convenient distributed cluster environment is built by using one server and four working nodes, and hardware and software configuration information is shown in table 1.

In this experiment, the logical main server and the parameter server are run on the same physical server, and the data server and the Redis are also run on the server. The working nodes in the experimental environment can simulate the work of a plurality of computing nodes because all the working nodes are multi-core processors, and only a plurality of Tab label pages need to be opened in a browser.

Table 1 is an experimental environment configuration information table;

TABLE 1

All experiments in the embodiment belong to the technical field of solving the problem of image classification, and the aim of analyzing the efficiency of the algorithm is fulfilled by analyzing indexes such as the training error rate in the training process. In the embodiment, two classical image data sets MNIST and CIFAR10 are selected, and the new algorithm is tested on the two data sets, so that each function of the algorithm can be effectively detected. MNIST is a grayscale handwritten digital image data set of size $28\ times28\ times1 $. The training data set contains 50000 images of 10 classes. For the MNIST data set, the structure of the CNN can be directly configured using the user interface of MLitB, and the configuration parameters are shown in Table 2.

Table 2 is a table of CNN network parameter information on MNIST data set

Network layer indexing	Network layer type	Parameter information
			1	Input layer	size＝(28,28,1)
2	Convolutional layer	size＝(5,5),stride＝1,filters＝8,actFun＝relu
			3	Pooling layer	size＝(2,2),stride＝2
4	Convolutional layer	size＝(5,5),stride＝1,filters＝16,actFun＝relu
			5	Pooling layer	size＝(3,3),stride＝3
6	Full connection layer	neurons＝10,actFun＝softmax

TABLE 2

For all experiments, we used small batch size N_cto 100, the learning rate η is set to 0.01, we run 5 times per experiment and plot the results of the experiment with the average results.

In addition, in the present embodiment, the heterogeneous degree HL is set to 1 and 2, and performances of the synchronous SGD algorithm, the asynchronous SGD algorithm, and the asynchronous parameter-transferring SGD algorithm on the MNIST data set are tested respectively.

When the degree of heterogeneity is 1, we keep the computation time of all working nodes around 1 second by adding a delay, while at the degree of heterogeneity 2, we increase the delay of half of the working nodes to 2 seconds. In the code at the parameter server side, a variable step records the number of iterations of the whole system. The average calculation time of the working nodes is obtained by calculating the system time stamps before and after each calculation of each working node, and the system time stamps are printed on a control console of a browser and finally obtained by averaging.

As shown in fig. 10, the change of the iteration number of several algorithms when the degree of heterogeneity is increased from 1 to 2 when the algorithms reach a specified error rate is shown, and the significance of the index is that when convergence is reached, the smaller the iteration number is, it indicates that each iteration can generate more effective updates, that is, the global parameter can be closer to the optimal solution, and the larger the iteration number is, it indicates that the algorithm is updated less effectively in the whole training process. And when the asynchronous degree is increased to 2, the iteration times of the asynchronous SGD algorithm when convergence is achieved are obviously increased, which shows that the convergence of the model is seriously influenced by the random jump condition. The number of iterations of the asynchronous parameter-transferring SGD algorithm is increased less, which shows that the algorithm can sense the delay in the cluster, reduce the influence of high delay on global parameters and reduce invalid updating.

As shown in fig. 11, the increase of the average calculation time of several algorithms when the heterogeneous degree is increased from 1 to 2 is shown. When the heterogeneous degree in the cluster is increased, if the average calculation time is linearly related to the heterogeneous degree, it indicates that in the operation process of the algorithm, the influence of the node with low operation speed on the node with high operation speed is large, so that the condition that the average calculation time is linearly increased occurs. Ideally, when the degree of heterogeneity increases, the average computation time increases slowly, rather than linearly, and this result indicates that the slow-computing node in the cluster has a smaller influence on the fast-computing node, and increases the utilization rate of the computation resource. The average calculation time of the synchronous SGD algorithm is almost a linearly increased condition, when the heterogeneous degree is 2, the average calculation time is increased by 2 times, because the strict synchronization mechanism enables the calculation time of each time to be determined by the slowest working node, and the experimental result is identical with the theoretical analysis. The average calculation time of the three algorithms using the asynchronous protocol is increased less, and the average calculation time of the three algorithms is increased almost the same by combining two indexes of the iteration times and the average calculation time, so that the asynchronous protocol enables the calculation resources in the cluster to be fully utilized.

As shown in fig. 12, several algorithms are shown for time varying situations when the algorithm reaches a specified error rate when the degree of heterogeneity is increased from 1 to 2. When the heterogeneous degree in the cluster is increased, the whole operation time of the algorithm is checked, the operation speed of the algorithm can be visually judged, the comprehensive performance of the algorithm can be judged by combining the two indexes, and it is very important that whether the algorithm can achieve convergence in the effective time is judged. The synchronous SGD algorithm, while not performing well on average computation time, is most efficient to update, and therefore performs acceptably well on the overall runtime performance. The asynchronous SGD algorithm has obvious disadvantages in average calculation time and iteration times, and when the heterogeneous degree is increased, the overall operation time of the algorithm is greatly increased, so that the efficiency of the algorithm is obviously reduced. While the asynchronous reference SGD algorithm still performs best over the entire runtime. In conclusion, the asynchronous reference SGD algorithm has very good stability when the heterogeneous degree is increased.

In summary, the method provided by the invention changes the value transmitted between the parameter server and the working node from the gradient value to the weight parameter, and the algorithm can dynamically change the update mechanism of the global parameter according to the delay of the weight parameter, thereby reducing the influence of high delay on the algorithm. Experiments show that the algorithm achieves a good effect on the image classification problem and can stably run in a heterogeneous environment.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A method for updating global parameters in a parameter server is characterized by comprising the following steps:

(1) global parameter initialization

(2) training data preprocessing

(3.1) at the t-th time stamp, randomly extracting n sample data from the local data block by each working node, wherein n is less than m;

(3.2) by Mini-batch algorithm, use the global parameter w₀Training n sample data to obtain each sample training output value of each nodei is 1,2, …, n, j is 1,2, …, m, m is the number of working nodes;

(3.3) calculating the loss function value L of the j working node_j：

(3.4) calculating a gradient value ▽ L according to the loss function value L;

wherein η represents a learning rate;

(4) global parameter update

d_j＝t-t_τ

wherein d is_jIs the degree of delay, t, of the first working node_τA timestamp representing the global parameter of the previous round of updating of the jth node;

Wherein c is a constant;

(4.3) the parameter server updates the global parameter of the jth working node

(4.4) after the global parameters of all the working nodes at the time of the tth timestamp are updated, making t equal to t +1, and the parameter server updates the updated global parameters w_t+1And (4) returning to each corresponding working node, returning to the step (3) and repeating the steps until the global timestamp t reaches the maximum iteration times, finishing the iteration and finishing the updating of the global parameters.