AU2020101237A4

AU2020101237A4 - A computation-efficient distributed algorithm for convex constrained optimization problem

Info

Publication number: AU2020101237A4
Application number: AU2020101237A
Authority: AU
Inventors: Zhengran Cao; Qingguo Lü; Keke ZHANG; Yunhang Zhu
Original assignee: Southwest University
Current assignee: Southwest University
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-08-06
Anticipated expiration: 2028-07-03

Abstract

A computation-efficient distributed algorithm based on the stochastic gradient projection method for solving the problem of convex constrained optimization with a sum of smooth convex functions and non-smooth regularization terms subject to locally general constraints is proposed over an undirected network. The algorithm mainly includes five parts: Variable initialization, parameter definition and selection, information exchange, stochastic gradien t computation, and variable update. By adopting the variance reduction technique, the proposed algorithm which is set forth in the present invention reduces the amount of com putation in comparison with related works. Presuming the smoothness and strong convexity cost functions, the proposed algorithm can find the exact optimal solution in expectation in the event that the constant step-size is less than an explicitly estimated upper bound. With respect to the existing distributed schemes, the proposed algorithm is more convenient for general constrained optimization problems and possesses low computation complexity in terms of the total number of local gradient evaluations. The present invention has broad application in the modern large-scale information processing problems in machine learning (The samples of a training dataset are randomly distributed across multiple computing nodes and each of the smooth objective functions is further considered as the average of several constituent functions). 1/2 Start Each node sets k=O and maximum number of iteration, kmax Each node initializes local variables and selects fixed step-size as well as fixed tunable parameter according to the network and the problem Each node computes the stochastic gradient according to the variance reduction technique Each node updates the main variable according to the stochastic gradient projection method Each node updates the Lagrangian multipliers according to the main variable and the projection operator Each node sets k=k+l No <k Akax Yes End Figure 1

Description

1/2

Start

Each node sets k=O and maximum number of iteration, kmax

Each node initializes local variables and selects fixed step-size as well as fixed tunable parameter according to the network and the problem

Each node computes the stochastic gradient according to the variance reduction technique

Each node updates the main variable according to the stochastic gradient projection method

Each node updates the Lagrangian multipliers according to the main variable and the projection operator

Each node sets k=k+l

No <kAkax Yes

End

Figure 1

1. Technical Field

[0001] The present invention relates to a field of machine learning.

2. Background

[0002] Due to the limited computational and storage capacity of the nodes, centralized pro cessing of large-scale tasks on a single computing node becomes unrealistic. Distributed optimization has been a classical topic, yet is sparked considerable interest recently in many emerging applications (large-scale tasks) including but not limited to parameter estimation, network attack, machine learning, and IoT Networks. This resurgence of interest is facilitat ed by at least two facts: a) The latest development of high performance computing platform enables us to adopt distributed resources to significantly promote computation efficiency; b) The size of datasets often far exceeds the storage capacity of one machine and requires coordi nation among multiple machines. In distributed optimization (no centralized coordination), each node is only allowed to interact with its neighbors over a locally connected network. Designing computation-efficient distributed algorithms for broad optimization problems is in general more challenging. Distributed optimization methods that are only dependent on gradient information have become a core interest in processing large-scale task due to their excellent scalability. Many known methods, including distributed gradient descent (DGD), dual averaging, EXTRA, ADMM, adaptive diffusion, and gradient-tracking have been studied in the literature. More over, quite a few efficient methods for dealing with various practical problems such as privacy security, machine learning, and online distributed learning, have been emerged. Simultane ously, much attention has been paid for the distributed and continuous-time optimization methods, mainly due to its flexible application in continuous-time physical systems and hardware implementation, and the well-developed continuous-time control technology be ing helpful for analysis. During this period, a number of the distributed continuous-time optimization methods that adopt the first-order gradient information or the second-order Hessian information have been investigated for various kinds of problems. With the advent of the big-data era, the amount of data that nodes in the network need to process is getting larger and more complicated. Therefore, the above methods can be computationally very demanding due to the requirement that each iteration of the algorithm needs a full gradient evaluation of local objective functions. This may make these methods to be practically in feasible when dealing with large-scale tasks, mainly because the nodes in the network need to cope with large amounts of various data. Based on this, we assume that at each iteration, the proposed algorithm only evaluates the gradient of one randomly selected constituent function, and employs the unbiased stochastic average gradient (obtained by the average of all most recent stochastic gradients) to estimate the local gradients. Moreover, we integrate the variance reduction technique with the distributed stochastic gradient projection method with constant step-size to achieve the exact optimal solution in expectation.

3. Notation

[0003] In this invention, all of the vectors are default to column vectors. Let R, R' and R"' be the set of real numbers, n-dimensional real column vectors and m x n real matrices, respectively. The matrix, In, is the n x n identity matrix, whereas 1 and 0 (appropriate dimensional) are two column vectors of all ones and all zeros, respectively. A quantity (probably a vector) allocated to node i is indexed by a subscript i; e.g., x' is node i's estimate at time k. For a real symmetric matrix A, we use Xmax(A) and Xmi(A) to represent the largest and the smallest eigenvalues of A, respectively. The transposes of a vector x and a matrix A are represented by the symbols xT and AT. We denote and as the Euclidean norm (vectors) and 1-norm, respectively. For a positive semi-definite matrix A E R , we use ||x||A = v/xTAx. The symbols 9 and ] Jdenote the Kronecker product and the Cartesian product, respectively. Given a random estimator x, E[x] denotes its expectation. For a vector x = [x 1 , x2 , ..., xn]T, we utilize Z= diag{x} to represent a diagonal matrix which satisfies that zi = xi, Vi=1, ... , n, and zi =0, Vi j. Denote(-) = max{0, -}. Let PQ : R~ Qn Q

and o : R Y [-1, 1]" be two projection operators.

4. Network Model

[0004] The intrinsic interconnections among nodes in the network are considered as an undi rected graph 9 = {V, S, A} involving the node set V = {1, 2,..., n}, the edge set S C V x V and the adjacency matrix A = [aij] E R<". An edge (i, j) cS implies that node i can directly exchange data with node j. The connection weight between nodes i and j in graph 9 satisfies aij = aji > 0 if (i,J) S and otherwise aij = aji = 0. Without loss of generality, it is noticed that aii = 0 means no self-connection in the graph. The degree of node ic V is represented by d' = E_ aij and the degree matrix Dg = diag{d', d',..., d'} is a diagonal

matrix. The Laplacian matrix of graph 9 is defined as Lg = Dg - A which is symmetric and positive semi-definite if the graph 9 is undirected. A path is a series of consecutive edges. If there is at least one path between any pair of distinct nodes in the graph 9, then the graph 9 is connected.

5. Problem Formulation

[0005] Consider the convex constrained optimization problem of the following form: n min J(1)= (fi(-) + ||Ps - qi||1), s.t. Bis = ci, Dil < si, EV (1)

where ERd is the optimization estimator, fi(s) = (1/ei) E=1 fij(1) is the local objective function of node i, and Pis - qi Iis a non-smooth Li-regularization term of node i; Pi E R"i d (mi > 0), qi E R", Bi E R'id (0 < wi < n) is full row-rank, ci E Rwi, Di E R< d

(vi > 0), si E Rvi, and Qi C R nis a non-empty and closed convex set. Here, we consider that the invention is based on the following assumptions: 1) The network g is undirected and connected; 2) Each local constituent function fiC, I E V, j E {1, ... , e}, is --smooth and x-strongly connected, where o > , > 0; 3) The feasible set of (1) is nonempty, i.e., the optimization problem (1) is solvable. Under the above assumptions, problem (1) can be equivalently reformulated as the following form: n min J(x) = fi(xi) +||Px - q||1 ,s.t. Bx = c, Dx < s, Lx = 0 (2) i=1

where fi(xi) = (1/ei) E", f,(xi) and other parameters are defined below. The formulated problem can be frequently found in machine learning (such as modern large-scale information processing problems, reinforcement learning problems, etc.) with large-scale training samples randomly decentralized across the multiple computing nodes which focus on collectively training a model utilizing the neighboring nodes' data.

6. Detailed Implementation Description

[0006] Figure 1 is the flowchart of the proposed algorithm in the present invention. As shown in Figure 1, the computation-efficient distributed algorithm comprises the following steps:

6.1. Variable initialization

[0007] Step 1: Each node iC V sets k = 0 and maximum number of iterations, kmax.

[0008] Step 2: Each node i starts with x0 E Rd, ac Rm , 0 c Rw, AO E Ri, and E Rd.

6.2. Parameters definition and selection

[0009] Step 3: According to the convex constrained optimization problem (1) and the re formulated optimization problem (2), we define x as a vector that stacks all the local es timators xi, E V (i.e., x = vec[x1,...,,n] E Rnd). Let P, B and D be the block di

agonal matrices of P 1 to Pn (i.e., P = blkdiag{P1,..., Pn}C R""nd), B 1 to Bn (i.e., B = blkdiag{B1,..., Bn} E R'Xnd), and D 1 to Dn (i.e., D = blkdiag{D1,..., Dn} E R' nd), respec tively, where m = mi, w = w= and v = oi, v"oi. Denote q = [q qTT R, c=[cT, ... ,c'TT CR, s = [sT,...,sT]T E R, Q= ] Q I Qand L= Lg @Id.

[0010] Step 4: According to the definition of the constituent functions, we denote= maxicv{e} and e= minicv{ei}.

[0011] Step 5: According to the parameters denoted in Step 3 and Step 4, we select the step-size i as follows

1 0 < it < 1(3) + 2-Ind+ BTB + L + 2DTD + 2 PTP) Xma(aInd

where a is a tunable parameter. Then, we select the tunable parameter b as follows

41o < b < 28tXm(BTB) 48o-t(o - p1) (4) a o- au 6.3. Information exchange

[0012] Step 6: According to the weight of the communication network, each node i E V exchanges variable (information) x with its neighboring nodes j E V. Then, each node i E V computes the weighted sum Eu mo ag(X k- xk) for k > 0, where ai ;> 0 is the weight between node j and node 1.

6.4. Stochastic gradient computation

[0013] Step 7: Each node i must own a gradient table that possesses all local constituent gradients Vfj,(ti), where tij is the most recent estimator at which the constituent gradient

Vfi,j was evaluated. At each iteration k > 0, each node i uniformly at random selects one constituent function that indexed by X {1,..., ci} from its own local data batch, and then generates the local stochastic gradient g kas

g=Vf. -kf~y , (zX) (t1' i Vffi(a4) t Vfi,j (tkj)(5

j=1

After generating g , the entry Vfi'(tXk) is replaced by the newly constituent gradient

Vfi' (xi), while the other entries remain the same. That is to say, ifj= Xi then store

ij Vfi, (z4)in gradient table position; else VfV(t ')=

6.5. Variables update

[0014] Step 8: Each node i E V implements the projection step of estimator x kon the local stochastic gradient gi, i.e.,

k+1X

{ k _ 1q(gk +F%>(j±F4J+-p A TB(O 1- J<ck Xk Tg| Pq@,2f ii ) +BI #+Bix -C, (6)i

) z k+ = P2 +D i(Ak +DiX - s ±)+ + ia + "_1± ai (X - x)) (6)

[0015] Step 9: Based on the projection operator and Step 8, each node i c V updates the variable a>+1 according to

a>' i±F+x'-Ti) (7)

[0016] Step 10: Based on the Step 8, each node i E V updates the variable/3+1 according to

1=#+B +1- ci (8)

[0017] Step 11: Based on the definition of (-)+ and Step 8, each node ic V updates the variable A'+1 according to

A+1 = (Ak + Dix+1 _ S' (9)

[0018] Step 12: Based on the Step 8, each node ic V updates the variable ji+1 according to

ih'=tZ aij (x+1 _+1) (10) j='1#i

[0019] Step 13: Each node ic V sets k +1 and goes to Step 7 until a certain stopping criterion is satisfied, e.g., k > kmax where kmax is the maximum number of iterations.

7. Innovation

[0020] 1) Use the unbiased stochastic average gradient to highly reduces the expense of full gradient evaluations, which may avoid energy consumption and extend the useful life of the network.

[0021] 2) Substitute the variance reduction technique into the distributed stochastic gradient projection method with some well-selected constant step-size to achieve the exact optimal solution in expectation.

[0022] 3) Propose a computation-efficient distributed algorithm for solving the convex con strained optimization problem, which has broad application in the modern large-scale infor mation processing problems in machine learning.

8. Simulations

8.1. Performance examination

[0023] First, the proposed algorithm is applied to solve a general distributed minimization problem which is described as follows: n C

min mm Z( 1 Z ICij-b ||CY_- - bijs|2±f- 1)

1 =1

S.t. x 1 -+52 +x3 +x4 =3,

1 - 2 x3 - x4 2,

- 2< i 2, 1 = 1, . . , 4, (11)

4 where R4, C,j E R' , Pic R x 4 , bi j E R, and qi E R for all i, j. Let n = 10 and

ei = 10 for all i. The components of Cij, bij, Pi , and qi are randomly selected in [0, 2],

[-4,4], [0,2], and [-4,4], respectively. The communication among 10 nodes is modeled as a ring network. The node i is assigned the ith objective function fi(x) +||P4x- q|i| 1 , where Ci 2,J fi(x) = (i/eI) I Cui i-b,|2 with the constituent function fij)= ||Cij (x) - b 1 j=1 1, ... , c. In the simulation, the constant step-size r is set as 0.04 and the initial conditions (4i, ,j#, A, and ')are randomly generated as the proposed algorithm. Figure 2 depicts the transient behaviors of all dimensions of state estimator x. Figure 2 indicates that the state estimator in the proposed algorithm can successfully achieve the consensus at the global optimal solution in expectation. Figure 3 verifies the privacy masking properties of the generalization version of the proposed algorithm by using the differential privacy strategy. Suppose that two datasets Z and Z' differ in exactly one element while all other elements are identical. Figure 3 means that two outputs (one randomly displayed node) i and x' are almost fitted resulting in the adversary unable to obtain personal sensitive information.

8.2. Application behavior

[0024] Second, we further verify the application behavior of the proposed algorithm with nu merical simulations for real datasets. We consider the distributed sparse logistic regression problem using the breast cancer wisconsin (diagnostic) dataset and the mushroom dataset provided in UCI Machine Learning Repository. In breast cancer wisconsin (diagnostic) dataset, we adopt n = 200 samples as training data, where each training data has dimension d = 9. In mushroom dataset, we employ n = 6000 samples as training data, where each training data has dimension d = 112. All the characters have been preprocessed and normal- ized to the unit vector for each dataset. For the network, we generate a randomly connected network with n = 10 nodes utilizing an Erdos-Renyi network with probability p = 0.4. The distributed sparse logistic regression problem can be formally described as n min fix() + r1||s||1, (12) PcRd with the local objective function fi(t) being fi(J)O= ln(1 + exp(-bijc i)) +2 2 2' C.1i i=1 where bij {-1, 1} and cij Rd are local data kept by node i for j{1, ... , e; The regu larization term ,1 I is applied to impose sparsity of the optimal solution and( 2 /2)|| S|| is added to avoid overfitting, respectively. In the simulation, we assign data randomly to each local node, i.e., E= ei = n. We set the regularization parameters fi = 0.1 and K 2 = 5, respectively. Moreover, the step-size r of each algorithm is selected to ensure the best possi ble convergence. Figure 4 depicts the evolutions of residuals logi(-|i - 1*||) respected to the proposed algorithm and the the distributed method that can deal with non-smooth regularization terms for two training datasets. From Figure 4, we can find that the proposed algorithm performs linear convergence rate under two training sets.

9. Brief Description of The Drawings

[0025] Figure 1 is a flowchart of the computation-efficient distributed algorithm.

[0026] Figure 2 shows the convergence of i for solving the general minimization problem.

[0027] Figure 3 shows the outputs (one randomly displayed) iand i' corresponding to the adjacent datasets Z and Z'.

[0028] Figure 4 shows the comparisons between the proposed algorithm and the applicable method under two real datasets.

Claims

The claims defining the invention are as follows:

1. A computation-efficient distributed algorithm

1.1. Variable initialization

Step 1: Each node i E V sets k = 0 and maximum number of iterations, kmax.

Step 2: Each node i starts with xO E Rd, aO c R, 00 E Rwi, AO c Rvi, and Y E Rd.

1.

2. Parameters definition and selection

Step 3: According to the convex constrained optimization problem (1) and the reformulated optimization problem (2), we define x as a vector that stacks all the local estimators xi, i E V (i.e., x= vec[xi, ... , xn] E R d) Let P, B and D be the block diagonal matrices of P1 to Pn (i.e., P= blkdiag{P1,..., Pn} c R"x"d), B 1 to Bn (i.e., B = blkdiag{B1,..., B} ER xn),

and D 1 to Dn (i.e., D= blkdiag{D 1 ,..., Dn} E Rvd), respectively, where m = E m,

w = , wi, and v = i, vi. Denoteq= [q qTT c R", c = [ci,..., cn]T E Rw,

s = [sT, ... ,GsT R, =]= 1 and L = Lg 9 Id.

Step 4: According to the definition of each constituent functions (2), we denote = maxicv{ei} and e= minicv{ei}. Step 5: According to the parameters denoted in Step 3 and Step 4, we select the step-size i as follows

1 0 < it < (3) + 2-Ind+ xma(aInd BTB + L + 2DTD + 2 PTP)

41t8 < b < 28iXmin(BTB) 48oi(o- - [t) (4) a o- au 1.

3. Information exchange

Step 6: According to the weight of the communication network, each node i E V exchanges variable (information) x with its neighboring nodes jV. Then, each node 1 E V computes the weighted sum E mji agj(x - x ) for k > 0, where ai ;> 0 is the weight between node jand node 1.

1.

4. Stochastic gradient computation

Step 7: Each node i must own a gradient table that possesses all local constituent gradients Vfj,(tj,), where tij is the most recent estimator at which the constituent gradient Vfi,j was evaluated. At each iteration k > 0, each nodei uniformly at random selects one constituent function that indexed by xci E{1, ... , ei} from its own local data batch, and then generates the local stochastic gradient g kas

= Vfx()- Vfix( ) Vfi,(t)

After generating gk, the entry Vfi'X (t ) is replaced by the newly constituent gradient

Vfi'x (Xi, while the other entries remain the same. That is to say, if j= xi then store

ij Vfi,(x)in gradient table position; elseVf,(t ')= Vfi 1 (tt9.

1.

5. Variables update

Step 8: Each node i E V implements the projection step of estimator x kon the local stochastic gradient gi, i.e.,

k+1 S X k+ = p . fk+Di ±Fi(+P-g) T(A[k ii+ DiX - s + ± q(g4 + B(/ + Bix- c,) ±)+ + En, 1 au(a- r )) }(6) (6)

Step 9: Based on the projection operator and Step 8, each node ic V updates the variable a> according to

a+1 = g( k + Pq)+1 (7)

Step 10: Based on the Step 8, each node ic V updates the variable 0k+1 according to

t'1= #<±BxO+1- c+ (8)

Step 11: Based on the definition of (-)+ and Step 8, each node ic V updates the variable A+1 according to

A+1= (A+Dix+1- + (9)

Step 12: Based on the Step 8, each node ic V updates the variable j>+' according to

t-+1'=t--y aij (x+1 _+1) (10) j='1#i

Step 13: Each node ic V sets k +1 and goes to Step 7 until a certain stopping criterion is satisfied, e.g., k > kmax where kmax is the maximum number of iterations.