AU2020101959A4

AU2020101959A4 - Decentralized optimization algorithm for machine learning tasks in networks: Resource efficient

Info

Publication number: AU2020101959A4
Application number: AU2020101959A
Authority: AU
Inventors: Zhengran Cao; Qingguo Lü; Keke ZHANG; Lifeng Zheng
Original assignee: Southwest University
Current assignee: Southwest University
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2020-10-01
Anticipated expiration: 2028-08-24

Abstract

Through utilizing the momentum acceleration mechanism, a novel decentralized accelerated stochastic double-efficient algorithm based on stochastic gradient-tracking technique is pro posed to solve the problem of decentralized optimization to minimize a finite-sum of convex cost functions over the nodes of networks where each cost function is further considered as the average of several constituent functions. The algorithm mainly includes five parts: Variable initialization, parameter selection, information exchange, event-triggering commu nication strategy, and variable updates. By adopting the event-triggering strategy and the variance-reduction technique, the proposed algorithm which is set forth in the present inven tion realizes better communication efficiency and achieves higher computation efficiency in comparison with related works. Presuming the smoothness and strong convexity cost func tions, the proposed algorithm with well-selected constant step-size can converge in mean to the exact optimal solution. At the same time, linear convergence rate is achieved if each constituent function is strongly convex and smooth. Furthermore, under certain conditions the time interval between two successive triggering instants for each node is proved to larger than the iteration interval. The present invention plays an important role in many large-scale machine learning tasks where the samples of a training dataset are randomly decentralized across multiple computing nodes. 1/2 Start Each node sets t=O and the maximum number of iteration, tmax 4 Each node initializes local variables and selects fixed step-size as well as the momentum coefficient according to the network and the problem Each node updates the decision estimator and the accelerated estimator according to the event triggering strategy and the momentum acceleration mechanism I Each node computes the stochastic gradient according to the variance reduction technique Each node updates the gradient auxiliary estimator according to the stochastic gradient-tracking method Each node tests the triggering condition, broadcasts the decision estimator and the gradient auxiliary estimator, and updates update the latest triggering time Each node sets t-t+1 No t>tma I~Yes7 Figure 1

Description

1/2

Start

Each node sets t=O and the maximum number of iteration, tmax 4 Each node initializes local variables and selects fixed step-size as well as the momentum coefficient according to the network and the problem

Each node updates the decision estimator and the accelerated estimator according to the event triggering strategy and the momentum acceleration mechanism I Each node computes the stochastic gradient according to the variance reduction technique

Each node updates the gradient auxiliary estimator according to the stochastic gradient-tracking method

Each node tests the triggering condition, broadcasts the decision estimator and the gradient auxiliary estimator, and updates update the latest triggering time

Each node sets t-t+1

No t>tma

I~Yes7

Figure 1

1. Technical Field

[0001] The present invention relates to a field of machine learning.

) 2. Background

[0002] The Internet of Things (IoT) and artificial intelligence have promoted the emergence of networked control systems, which urgently need efficient communication and computation. Due to the limited computational and storage capacity of the nodes, centralized processing of large-scale tasks on a single computing node becomes unrealistic. Decentralized optimization can solve the issues of multiple nodes interacting over a network and is of significance in many areas such as machine learning, resource allocation, data analysis, privacy masking, and signal processing, owing to its ability to parallelize computation and prevent agents from sharing information considered as privacy. Decentralized algorithm usually follows an iterative process, which involves nodes preserving certain estimator of decision vectors in the context of optimization, exchanging this information with neighboring nodes over a communication network, and updating their estimator according to the received information. Designing efficient decentralized algorithms for broad optimization problems is in general more challenging. Some of the literature for such decentralized schemes includes the early work on decen tralized gradient descent (DGD) and its various extensions in achieving efficiency, solving constraints, applying to complex networks, or performing acceleration. These methods suc cessfully showed the effectiveness for figuring out optimization problems in a decentralized manner over networks. Nonetheless, although these methods were intuitive and flexible for cost functions and networks, their computing speeds were particularly slow in comparison with that of centralized counterparts. Besides, linear convergence rates in sub-optimality could be derived for DGD-based methods with constant step-sizes. Therefore, from an op timization point of view, it is always a priority to propose and analyze methods that are comparable in performance to centralized counterparts in terms of convergence rate. In a recent stream of literature, decentralized gradient methods that solve this exactness-rate dilemma have been proposed, which achieve exact linear convergence rate for smooth and strongly convex cost functions. Instances of such methods, including methods based on gradient-tracking, methods based on Lagrangian multiplier, and methods based on dual decomposition, are characterized by various mechanism. Towards practical optimization models, approaches of momentum acceleration have been successfully and widely applied to optimization trick that facilitate the convergence of gradient-based methods. First-order optimization methods based on momentum acceler ation have been of significance in the machine learning community due to their scalability to large-scale problems (including deep learning, federal learning, etc.) and good performance in practice. To improve communication efficiency and meanwhile maintain the desired per formance of the network, various types of strategies have recently been proposed and gained popularity in existing works. The emergence of the event-triggering strategy provides a new perspective for collecting and transmitting information. The main idea behind the event triggering strategy is that nodes only take actions when necessary, that is, only when a measurement of the local node's state error reaches a specified threshold. Its superiority is that some desired properties of the network can still be maintained efficiently. With the advent of the big-data era, the amount of data that nodes in the network need to process is getting larger and more complicated. Therefore, the general methods can be computa tionally very necessary because the nodes in the network need to cope with large amounts of various data. Based on this, we assume that at each iteration, the proposed algorithm only evaluates the gradient of one randomly selected constituent function, and employs the unbiased stochastic average gradient (obtained by the average of all most recent stochastic gradients) to estimate the local gradients. Moreover, the proposed algorithm integrates the event-triggering strategy and the variance-reduction technique with the distributed acceler ated gradient-tracking method to linearly achieve in mean the exact optimal solution.

3. Notation

[0003] In this invention, all of the vectors are default to column vectors. Let R, RP and RP q

be the set of real numbers, p-dimensional real column vectors andp x q matrices, respectively. The inner product of vectors c and d is represented by (c, d). Given a random estimator x, E[x] denotes its expectation. The spectral radius of a matrix A is denoted by p(A). For two matrices A, B E RPXP, A 9 B represents its Kronecker product. The transposes of a vector x and a matrix A are represented by the symbols xT and AT. We denote ||- as the Euclidean norm of a vector. The notation Vf(y) denotes the gradient of function f at y. The matrix, Ip, is the p x p identity matrix, whereas and 0 (appropriate dimensional) are two column vectors of all ones and all zeros, respectively. A quantity (probably a vector) allocated to node i is indexed by a superscript i; e.g., x is node i's estimator at time t.

4. Network Model

[0004] Throughout this invention, the intrinsic interconnections among nodes in the network are considered as an undirected graph g = {V, S, A} involving the nodes set V = {1, 2,..., m}, the edges set S C V x V and the weight matrix A= [a] E Rm m . An edge (i, j) E S implies that node i can directly exchange data with node j.The connection weight between nodes i and j in graph g satisfies asj = aj3 > 0 if (i,j) and otherwise asj = aj = 0. The neighbors set of node i is denoted by V = {jlasj > 0}. A path is a series of consecutive edges. If there is at least one path between any pair of distinct nodes in the graph g, then the graph g is connected. Without loss of generality, it is noticed that a= 0 means no self-connection in the graph. The degree of node i E V is represented by d EM aij and the degree matrix Dg = diag{d',d 2 , ... , dm } is a diagonal matrix. The Laplacian matrix of graph g is defined as Lg = Dg - A which is symmetric and positive semi-definite if the graph g is undirected.

5. Problem Formulation

[0005]This invention focuses on optimizing a finite-sum cost functions which are commonly encountered in machine learning and which can be formulated as:

Tnn

min f (x) = (X)=-(X),f i.fY' (xW, xcRb~P M i=1 W j=1

where x E RP is the optimization estimator (decision vector) and fi : RP - R is a convex function that we view as the privately cost of node i, which is represented as the average of ni constituent functions fi0. Assume that x* is an optimal solution to the above problem. In addition, we make the following assumption on the constituent functions: 1) Each local constituent function fiJ, i E {1,...,m},j {1, ... ,n},is i-strongly convex and K 2-smooth, i.e., for any a, b E RP: (i) f ' (a + b) > f f (a) + Vf J (a)T b + (r1/2)1 b|2 (ii) Vf'(a+ b) - Vfi'(a)|| < i 2 ||b||; 2) The undirected communication network g is connected and the

corresponding weight matrix A= [aic] E Rmxm corresponding to the network g is primitive and doubly stochastic, which indicates that the second largest singular value r3 of A is less than 1, i.e., r3= |A- (1/m)1m1m < 1 The above formulated problem can be frequently found in some machine learning mod els, such as empirical risk minimization, logistic regression, support vector machines, deep neural networks, etc. The machine learning model contains large-scale training samples that are randomly scattered on multiple computing nodes. These computing nodes focus on collectively training a model x E RP by utilizing the neighboring nodes' data. Although the accuracy of the machine learning model can be improved when the local data batch at a single computing node is very large, i.e., ni » 1, the limited memory of the computing node causes a significant increase in training time as well as the amount of communication and computation. However, it is expensive to improve the computing and communication capabilities of a single piece of hardware. Hence, designing a novel decentralized accelerated stochastic double-efficient algorithm will have far-reaching implications.

6. Detailed Implementation Description

[0006] Figure 1 is the flowchart of the proposed algorithm in the present invention. As shown in Figure 1, the decentralized accelerated stochastic double-efficient algorithm comprises the following steps:

6.1. Variable initialization

[0007] Step 1: Each node icV sets t = 0 and maximum number of iterations, tmax.

[0008] Step 2: Each node istarts with x = x RP, s RP, e = ,Vj{1,...,ni},and

o = Y = go = Vf(s ) E RP.

6.2. Parameters selection

[0009] Step 3: According to the definition of the constituent functions, we denote =

maxicv{n} and h = minicv{ni}. Moreover, we select arbitrary parameter wi, select t 3

according to t 3 > 8w, chooset 2 following from 2 Kj > 8 2w - 8rja2 3 with 0 < a < wi/t 3 , pick W 4 such that 2nwi + 2nw2 < W4, and opt W5 satisfying 4960w + 1064W2 + 7523 + 1684 < W5 (1 - K3)2

[0010] Step 4: According to the parameters denoted in Step 3, the momentum coefficient a is selected from the interval 0 < a < Vw 1 /w3 , and the constant step-size i is selected from the interval 0 < i <

wiIj mi2 moi~ mna W3 1 - K3 V/os - 8wi min{ -3 2v2 rW4 ' 8Kw 4 iw4 Kiw4 99K 2 K2 160w3 + 96w 1 + 32W4 + 16W5

6.3. Information exchange

[0011] Step 5: According to the weight of the communication network, each node ic V exchanges variable (information) xi and yi with its neighboring nodes jcE . Then, each node ic V computes the weighted sum E ,/jo a l(x'- zj) and 1 _j ai(y' - yj) for t > 0, where a' > 0 is the weight between node j and node i.

6.4. Event-triggering communication strategy

[0012] Step 6: The emergence of the event-triggering strategy provides a new perspective for information sampling and transmission. Before introducing the event-triggering strategy, we first define t by the k-th triggering time of the i-th node, where ic V. In the methods based on event-triggering strategy, the local estimators of node i at time t are determined by its own estimators and the latest information sent from its neighbors j Eci (at the latest triggering time of node before t). Assume that _t and Qare the information that node i transmits to its neighbors at the latest triggering time before time t, i.e.,

ii= 4K for 'i, ~~)1 x fo t < <

Q = y k(ijt) for t(i tk < t < t(kit) t

where x and y' are two estimators of node i. Moreover, we suppose that all the nodes broadcast its estimators x and yi at initial time, i.e., H = x and Q=y for all ic V. In addition, the next triggering time tk(it)l after t for node ic V is decided by

ti ±2 c 2 >i} 2 k(it+l = inf{tt > tit), I XE!' I|2 + ||E"||2 > Crt}, (2)

where Crj is the event-triggering threshold with parameters C > 0 and 0 < r4 < 1, and E', E,' are the measurement errors which are defined by

E2,X Xi jjt j jyl (3)

6.5. Variables update

[0013] Step 7: According to the event-triggering strategy and the momentum acceleration mechanism, each node i first updates the step of the local decision estimator x 7 1 and the local accelerated estimator si+, i.e.,

i= ±+ a - -(si - i) (4) j=1 St. = Xi(5) =+ +I+C (%i 'a+aI 1 - 4)D

[0014] Step 8: Subsequently, each node i must own a gradient table that possesses all local constituent gradients Vfi(ej+), where e4(1 is the most recent estimator at which the con stituent gradient Vf2 '" was evaluated. At each iteration t + 1, each node i uniformly and randomly selects one constituent function that indexed byx c{1,..., ni} from its own local data batch, and then generates the local stochastic gradient g+1 as

1 +1 =Vf X(s+)- Vf 2 iz+ {m+) + Vf'J(ei 1). (6) 3=1

After generating g+1, the entry VfXi+1(e+21) is replaced by the newly constituent gradient Vf AX+1(si+1), while the other entries remain the same. That is to say, if j=X 1,then store Vf'(C2)= Vf"(s+1); else Vfl (e27' 2 ) =+1

[0015] Step 9: Based on the Step 8, each node ic V updates the local auxiliary estimator y'+1 according to

y '+1 = =Y±Zt&P! yt + a (Y- g) +±9<~ g+ - ti (7) T j=1

[0016] Step 10: Based on the above updates, each node ic V calculates the measurement errors E' ", E in (3), and then test the triggering condition in (2). If the triggering condition

in (2) is satisfied, then each node i E V broadcasts s+r 1 and y+1 to its neighbors j E Vi, and update the latest triggering time.

[0017] Step 11: Each node ic V sets t +1 and goes to Step 7 until a certain stopping criterion is satisfied, e.g., t > tmax where tmax is the maximum number of iterations.

7. Innovation

[0018] 1) Leverage the event-triggering strategy to realize better communication efficiency and the variance-reduction technique to achieve higher computation efficiency, which may avoid energy consumption and extend the useful life of the network.

[0019] 2) Substitute the momentum acceleration mechanism into the event-triggering decen tralized stochastic gradient-tracking method with some well-selected constant step-size to linearly achieve in mean the exact optimal solution.

[0020] 3) Propose a novel decentralized accelerated stochastic double-efficient algorithm for solving the problem of decentralized optimization, which has broad application in many large-scale machine learning tasks where the samples of a training dataset are randomly decentralized across multiple computing nodes.

8. Simulations

8.1. Logistic regression

[0021] First, the proposed algorithm, named as DE-SDAA, is leveraged to deal with a binary classification problem by logistic regression using two real datasets from UCI Machine Learn- ing Repository. In this example, we use the breast cancer wisconsin (diagnostic) dataset to examine the performance of DE-SDAA and use the mushroom dataset as well as the breast cancer wisconsin (diagnostic) dataset for the comparison with other related decentralized methods. In breast cancer wisconsin (diagnostic) dataset, we adopt n = 200 samples as training data, where each training data has dimension p = 9. In mushroom dataset, we employ n = 6000 samples as training data, where each training data has dimension p = 112. All the characters have been preprocessed and normalized to the unit vector for each dataset. For the network, we generate a randomly connected network with m = 10 nodes utilizing an Erdos-Renyi network with probability 0.4. The decentralized logistic regression problem can be formally described as

1i 1E X12 >n nCR~r ln(1 + exp(-bV' (c"' ) Tx))+ 2 |2, (8) i=1 i=1

with the local objective function f (x) being

I 2i7 fi(x) = ln(1 + exp(-b's (ca's)Tx)) + ||2, (9) j=1

where b' {-1,1} and c'CE RP are local data kept by node i for j {1,...,nt};The regularization term (7r/2)||x|1is added to avoid overfitting. In the simulation, we assign data randomly to each local node, i.e., Lin' = n. We set the regularization parameter 7r = 40. Moreover, the step-size r of each algorithm is selected to ensure the best possible convergence. The breast cancer wisconsin (diagnostic) dataset is applied to examine the performance of DE-SDAA. Firstly, the transient behaviors of three dimensions (randomly selected) of state estimator x are shown in Figure 2, which illustrates that the state estimator x in DE SDAA can achieve the consensus in mean at the global optimal solution (The test accuracy is 97.72%). Secondly, the triggering times of five nodes (randomly selected) for its neighbors by DE-SDAA are shown in Figure 3, which imply by combining with Figure 2 that DE SDAA with event-triggering communication strategy can achieve the expected results with fewer communications compared to the time-triggering algorithm. Thirdly, compared with DE-SDAA consider other specific scenarios, the appealing features of DE-SDAA are shown in Figure 4, where the residual (1/m)logi(E 1 |xi - x*||) are treated as the comparison metric. Figure 4 means that DE-SDAA can achieve accelerated linear convergence compared to the algorithms without momentum acceleration mechanism.

8.2. Energy-based source localization

[0022] Second, we further verify the application behavior of the proposed algorithm with numerical simulations for energy-based source localization problem over a network of m sensors. Estimating the location of the energy-based source is an important issue in the military field. Assume that there is a stationary acoustic source x* C R2 locating in an unknown location that we aim at locating in the sensor networks. The source emits an isotropic signal, and we want to use the energy measurement of the signal received by each sensor to estimate the location of the source. In this example, we suppose that each sensor is randomly distributed in spatial locations denoted as a R 2 ,i= 1, ... , m, which is known privately by itself, and each sensor collects ni measurements. Then, an isotropic energy propagation model is applied to measure the j-th received signal strength at sensor i, which is represented by s' = c/(sI- adl) + bi, where c > 0 is a constant and d > 1 is an attenuation characteristic; ||t- aill > 1 and bi is an independent and identically distributed sample noise following from the zero-mean Gaussian distribution with variance a 2 . The maximum-likelihood estimator for the source's location is found by solving the following problem:

mnIT Q(8'~ x-a* (10) _=_ j=1

with the local objective function f2 (x) being

c2 f .(x) = I.Y(si' - -. - a i||d (11) n' 3=1 ||x

According to the analysis, it is suffices to verify that solving the nonlinear least squares problem (11) is equivalent to find the optimal estimator x of the following transformed problem:

min (Y - xPQi, (x) 2), (12) xER2 Tn i=1 n j=1

where QGJ- = {x E R2|||x - aJ l < c/s'i} and PQi,j (x) is the orthogonal projection of x onto Q'0. In specific, we consider that m = 50 sensors are uniformly distributed in a 100 x 100

square and the source location is randomly chosen from the square. The source emits a signal with strength c = 100 and each sensor has n = 100 measurements. Based on the above, the randomly selected 7 paths taken by DE-SDAA are shown in Figure 5 which is plotted on top of contours of the log-likelihood. Figure 5 illustrates that DE-SDAA can successfully find the exact source location like other verified effective algorithm, which is suitable for the practical energy-based source localization problem.

9. Brief Description of The Drawings

[0023] Figure 1 is a flowchart of the decentralized accelerated stochastic double-efficient algorithm.

[0024] Figure 2 shows the transient behaviors of three dimensions (randomly selected) of state estimator x in the proposed algorithm.

[0025] Figure 3 shows the triggering times of five nodes (randomly selected) for its neighbors by the proposed algorithm.

[0026] Figure 4 shows the comparisons between DE-SDAA and DE-SDAA with other specific scenarios.

[0027] Figure 5 shows the randomly selected 7 paths displayed on top of contours of log likelihood function.

Claims

The claims defining the invention are as follows:

1. A decentralized accelerated stochastic double-efficient algorithm 1.1. Variable initialization

Step 1: Each node i E V sets t = 0 and maximum number of iterations, tmax. Step 2: Each node i starts with x= xE RP, s E RP, e 3= 4,VjE{1,...,ni},and

Qo = YO = go = Vf (s ) E RP.

1.

2. Parameters selection

Step 3: According to the definition of the constituent functions, we denote n = maxicv{ni} and h = miniev{n'}. Moreover, we select arbitrary parameter wi, select W3 according to

W3 > 8w1 , choose W2 following from 2 K, > 8 wi - 8ra2 3 with 0 < a < Vwi/w 3 , pick

w4 such that 2nwi+2nw2 < w 4 h, and optw 5 satisfying 4960w 1 +1064W 2 +752W 3 +168W4 < W5 (1 Step 4: According to the parameters denoted in Step 3, the momentum coefficient a is selected from the interval 0 < a < Vw 1 / 3, and the constant step-size r is selected from the interval 0 < i <

mimw2 mow ma2 W 3 1 - K3 VW 3 - 8wi min{1 -22 wi 2v/2K2 W4' 82W4 Kiw4 Kiw4 '99K2 K 2/16W37+ 96wi + 32W4 + 16W5

1.

3. Information exchange

Step 5: According to the weight of the communication network, each node i E V exchanges variable (information) xi and yi with its neighboring nodes jA . Then, each node i E V computes the weighted sum EM ,j aij(xi - xz) and E _ a'j(y' - yj) for t > 0, where aj '> 0 is the weight between node j and node 1.

1.

4. Event-triggering communication strategy Step 6: The emergence of the event-triggering strategy provides a new perspective for infor mation sampling and transmission. Before introducing the event-triggering strategy, we first define t by the k-th triggering time of the i-th node, where i E V. In the methods based on event-triggering strategy, the local estimators of node iat time t are determined by its own estimators and the latest information sent from its neighbors j E V' (at the latest triggering time of node before t). Assume that x' and Q are the information that node i transmits to its neighbors at the latest triggering time before time t, i.e.,

= x(i for t (i;t) < k(it)4,(

Q =y( for t'kit < t < t, where x' and y' are two estimators of node i. Moreover, we suppose that all the nodes broadcast its estimators x and yi at initial time, i.e., H = x and Q=y for all ic V. In addition, the next triggering time tk(it after t for node ic V is decided by )~~t k(itl = inf{tt > t(it), I E 2 2> C4}, (2) where Cj is the event-triggering threshold with parameters C > 0 and 0 < r4 < 1, and E ', Et are the measurement errors which are defined by

ECt =xt~i- Xi, C" i- t 3

1.

5. Variables update

Step 7: According to the event-triggering strategy and the momentum acceleration mech anism, each node i first updates the step of the local decision estimator x's and the local accelerated estimator sj+, i.e.,

Xi= aj±>ai(si - )-r/ (4) j=1 St. = Xi(5) =+ +I+C x ( 1'i+a I4- xD.

Step 8: Subsequently, each node i must own a gradient table that possesses all local con stituent+gradientsVfl(e), where e<1 is the most recent estimator at which the con stituent gradient Vf2 '" was evaluated. At each iteration t + 1, each node i uniformly and randomly selects one constituent function that indexed by x 1 {1,..., n'} from its own local data batch, and then generates the local stochastic gradient g+1 as

9i= Vf 'Xti(s 3 +) - Vf ix+1(e*m)±+ + Vf ' (1). (6) j=1

After generating g+1, the entry Vf'Xi+1(e+21) is replaced by the newly constituent gradient VftXti+(sig), while the other entries remain the same. That is to say, ifj= y 1,then store Vf'(C2)= Vf 3 (sti); else Vfi j(e' 2 )= t+1

Step 9: Based on the Step 8, each node ic V updates the local auxiliary estimator y 1 according to

y 4 =y a(i - 9 ) + g±Z+1 - 9! (7) j=1

Step 10: Based on the above updates, each node ic V calculates the measurement errors

E E 'Y in (3), and then test the triggering condition in (2). If the triggering condition in

(2) is satisfied, then each node ic V broadcasts x 1 and y+ 1 to its neighbors jcA, and update the latest triggering time. Step 11: Each node i E V sets t + 1 and goes to Step 7 until a certain stopping criterion is satisfied, e.g., t > tmax where tmax is the maximum number of iterations.