AU2020100180A4

AU2020100180A4 - Effective Doubly-Accelerated Distributed Asynchronous Strategy for General Convex Optimization Problem

Info

Publication number: AU2020100180A4
Application number: AU2020100180A
Authority: AU
Inventors: Huqiang Cheng; Jinhui Hu; Huaqing Li; Zheng Wang
Original assignee: Southwest University
Current assignee: Southwest University
Priority date: 2020-02-05
Filing date: 2020-02-05
Publication date: 2020-03-12
Anticipated expiration: 2028-02-05

Abstract

Abstract With advent of the large-scale network or data era, traditional synchronous algorithms, due to the requirement of the clock synchronization, is not suitable for handling large-scale network tasks. In view of this, this patent presents an effective doubly-accelerated distributed asynchronous algorithm based on heavy-ball method and Nesterov gradient method for solving general convex optimization problems, which are defined in a fixed directed multi-node network system. The algorithm mainly comprises six stages including variable initialization; picking delay value and activated node; eliminating outdated information; computing gradient; exchanging information; updating variable. The algorithm set forth in the present invention adopts a general asynchronous scheme, where agents can communicate with their in-neighbors at any time without any coordina tion or scheduling and perform their local computations by using outdated information from their in-neighbors. Therefore, the algorithm highly reduces idle time of communication links, mitigates congestion of communication and memory access, saves power, and has more fault-tolerant and robust. The present invention has broad application in large-scale machine learning and network information processing. Start Select the global objective function Each node initializes local variables Each node set k-0 and and maximum number of iteration, kmax N .k Compute system parameters Pick delay value dk Eliminate the old variables Select a step size and a momentum parameter according to the computing parameters Each activated node updates the variables and computes the gradient Each activated node sends variables to its out-neighbor nodes Each activated node sets k-k+ k> k x?N End Fig. 4

Description

1. Technical Field

The present invention relates to a field of large-scale machine learning and network information processing.

2. Background and Purpose

Early, centralized algorithms were widely concerned by researchers because of the excellent performance of master agents. In these algorithms, the master agents run the optimization algorithm gathering the needed information from slave agents, which only compute their local task. One obvious flaw of this kind of algorithm is that once the master agent is damaged, the whole network will not work. With the development of large-scale network or data, traditional centralized algorithms are no longer capable of solving large-scale computing problems. On the contrary, distributed optimization algorithms show great potential in large-scale network data computing, which have been widely used in network system control, machine learning, network information processing and source allocation. Due to this, people’s interest is shift to the design of distributed optimization algorithm. In this class of problems, each agent optimizes the global objective function by operating on its local objective functions and only communicating with its in-neighbors. Specifically, consider a variable x G R and a strongly connected network of m agents which cooperatively solve the following optimization problem:

m min f (5) = (1) i=l where each agent i has only access to a local objective function f, : Rⁿ —> R. To solve problem (1), there are two types of distributed algorithms: synchronous and asynchronous algorithms. The distinct feature between them is that agents in asynchronous algorithm do not wait for updates from other agents but simply compute updates using its currently available information. We further use Fig. 2 to elaborate the striking differences between synchronous and asynchronous algorithm by taking the directed graph in Fig. 1 as an example. Obviously, all the agents in a synchronous algorithm need to agree on the update time t(/c) where k denotes the number of updates of the network, which usually needs a global clock or synchronization of all nodes. It is worthy mentioning that the clock synchronization is not easy for a large-scale system and has been studied for quite a long time.

However, many of the aforementioned applications give rise to extremely large-scale networks or data. From this point of view, we naturally call for asynchronous solution methods. In fact, compared with synchronous communication, asynchronous communication has the following ad2

2020100180 05 Feb 2020 vantages, such as reducing idle time of communication links, mitigating congestion of communication and memory access, saving power, and making the algorithm more fault-tolerant and robust. In addition, communication devices that support asynchronous communication are relatively simple and inexpensive. Thus, we develop an effective doubly-accelerated distributed asynchronous algorithm by combining the gradient tacking technologies with the sum-push (not push-sum) technologies. In each iteration, activated agents update their variables by communicating with their in-neighbors and the other agents keep their variables unchanged to next iteration.

3. notation

We use lowercase italics to denote column vectors and use uppercase italics to denote matrices. Let 1, gj, I, and O denote the column vector of all ones, the /-th canonical vector, the identity matrix, and the zero matrix, respectively, whose dimensions can be deduced from the context. For two arbitrary matrices, X and Y, we use X®Y to denote its Kronecker product. For arbitrary set V, let | V| represent its cardinality. Given an arbitrary vector x, let x and x indicate the largest element and the smallest element of x respectively, and the diag(rr) denotes the diagonal matrix, whose diagonal elements equal to the vector x. The spectral radius of a matrix, T, is represented by p(T). For a primitive, row-stochastic matrix, A, we denote its left and right eigenvectors corresponding to the eigenvalue of 1 by and 1, respectively, such that τη'1 = 1; Similarly, for a primitive, columnstochastic matrix, B, we denote its left and right eigenvectors corresponding to the eigenvalue of 1 by 1 and tf_c, respectively, such that 7rjl = 1. For a matrix X, we denote X.^ as its infinite power, i.e., Xqo = X^k. According to the Perron-Frobenius theorem, we have A = Itt'¹ and = tt_c1^t. We use || || for both vectors and matrices, where it, in the former case, represents the Euclidean norm and it is the spectral norm in the latter case. The set of nonnegative (resp. positive) integer is denoted by N_o (resp. N).

4. Communication Network Model

Consider a m agents connected directed original graph, Q = (V, £), where V = {1,2,..., m} is the set of agents, and £ is the set of directed edges, (i, j), i, j E V, such that agent j can receive information from agent i. Let Λζζ = {j \(j, i) E £} denote the set of in-neighbors of agent i and A/”/_ut = O l (fi j) G £ } denote the set of out-neighbors of agent i. Then, we conduct the augmented graph by adding virtual agents to the original graph Q = (V, £). Specifically, we add an ordered set of virtual agents, denoted by να^,να^, , ναto associate to each edge (j, i) E £, where each virtual agent corresponds to a possible delay value. That is to say, these virtual agents store the information based on its associated delay value, which implies that the information has been

2020100180 05 Feb 2020 generated by agent j for i but not used by i yet. Here, we further use a simple example in Fig. 3 to illustrate this augmented graph. Agents in the original graph Q are named computing agents while the virtual agents are named noncomputing agents. The set of computing and noncomputing agents is defined as V = V U {va^ \(j, i) G £, d = 0,1, , D }, and its cardinality is denoted by S' = | V| = m + (D + 1) |£|. We reconsider this augmented directed graph. Each computing agent j only sends information to the noncomputing agent va’^ with (j, i) G £. Each noncomputing agent va^ can send information to the next noncomputing agent or the computing agent.

5. Problem Formulation

We rewrite the problem (1) with the following form:

m min Fix) = —fii.x¹'), (2) i=l subject to x¹ = x-i. i,j G V, where each local function f : Rⁿ —> R is only known to agent i. Each local objective function, fi : Rⁿ —> R, i = 1, 2 , m, is μ-strongly convex with Ly-Lipschitz continuous gradient. That is, for any i and xux2 G Rⁿ, ffxf - ffx2) < (Vfi(rci))^T(rri-x2) - ^||iri-rr2||² and ||Λ(τι)fi(x2) || C Lf ||rci — x₂1|, where Lf > μ > 0. The global optimal solutions to problems (1) and (2) are denoted by x* and x*, respectively, where x* = 1 ® x*.

6. Detailed Implementation Description

Fig. 4 is an algorithm flowchart of the present invention. As shown in Fig. 4, the distributed asynchronous optimization algorithm comprises the following steps:

6.1. Initializing Variables

Step 1: Each agent i G V sets k = 0 and sets a stopping criterion.

Step 2: Each agent i G V initializes with χ£_γ = 0, Xq G Rⁿ, Sq G Rⁿ, μθ = filsf), Vq = 0, for all i G V; pf = 0, for all j G Λζζ and i G V; = 0, for all t = —D, , 0.

6.2. Conducting augmented weight matrices

Step 3: According to the original graph, firstly conduct row-stochastic matrix A and columnstochastic matrix B. Meanwhile, introduce the matrix W = {w'^J} to denote either A or B. There exists w > 0 such that w f w and f w, for all i G V and for all (j, i) G £, respectively. Otherwise, we set w²·⁷ = 0.

2020100180 05 Feb 2020

Step 4: Based on the original graph, conduct a augmented row-stochastic matrix A_k as follows:

a^lklk, if p = r = i_k;

, if p = ik, r = j + (d³ _k + l)m;

1, ϊΐp = m G {1,2, ,2m}\{ik,ik + τη};

< = if p G {2m + 1,2m + 2, , (D + 2)m}

U{ifc + m} and r = p — m;

otherwise.

Step 5: Conduct a augmented column-stochastic matrix B_k in two steps. One is establishing the transition matrix of the sum step as follows:

if r G — t^³ < d < D} and p = i^k;

if p G V \{va^ \k — r^l _k ³ < d < D} and r = p;

0, otherwise.

The other is establishing the transition matrix of the push step as follows:

lP^k, if r = ik and p = va^,j G Af^i, b^lklk, if r = p = ik,

1,

1, o, if r = p G V\ik] if r = να^,ρ = rA... (i,f) G £, 0 < d < D - 1;

if r = p = va³^, (i,f) G £;

otherwise.

Thus, there holds B_k = PkSk by combining the sum step with the push step.

6.3. Selecting Parameters

Step 6: According to the graph and the properties of row- and column-stochastic matrices, select < T < oo and 0 < D < oo. Note that within T iterations all agents update, and D represent 1 the maximum delay value. Then, compute parameters p = (1 — ^K1 G (0,1) with K3 = (2m -l)T + mD,p = w^K^ G (0,1), G = C2 = and C₄ = ^2C₃

2020100180 05 Feb 2020 with C₃ = ^ά p

Step 7: Based on strongly-convex coefficient μ, Lipschitz-continuous coefficient Lf, using the small gain theorem, compute parameters di,i= 1,2, , 16, are

lai = Cirriy/mLf,	a₃ = x/3 + 1,	a₉ = CymLfX, <213 = 2my/mLf,
a₂ = (1 + p) Ci,	a₆ = y/3my/mLf,	αχό = CymLf, (214 = my/mLf,
a₃ = Cim,	d₇ = χ/3772,	^all = ^a15 = 2m,
<24 = CimLf,	as = VSrnLf,	(212 ⁼ rnLf, (21g ⁼ nr.

and λ G (max {p, p + CiLfm^/ma, (χ/3 + 1) β, 1 — μη²a + mA_aLf] , 1). In addition, define £(», Δ_α) = 1 — flnft -f Οι₂Δ_α.

6.4. Computing step-size and momentum parameter

Step 8: According to strongly-convex coefficient μ, Lipschitz-continuous coefficient Lf, select the largest step-size a, the gap between the largest step-size and the smallest step-size Δ_α, and the largest momentum parameter β as follows:

_ .( ρω₃- ρ²ω₁ ω₂ - α₅^ι 1 1 - ρ Ί (J < a < mm < --------------------------,----------------------, ,> , (l\pjJ\ -J^- Η^- Δ)+Ί I +'3 “I^- ^8++ ?? (p J

C < mm <----------------------,--[ <7 Ι2<4 + Οχ₃iUι + 7/(212 {1 ραμ — ρ²αμ — αιρω₃α — α₃ρω₃α — α^ρω^α —,, (¾ 7/2 <2 (ω>4 (1 — C (α, Δ_α)) — αμα^α \ —ω₃α_1βα + ω₃αι₃Δ_α — ω₁α₁₃Δ_α y (1 — ρ) ω₃ — α₉ω₂

2ω₂ ’ α₉ω₂ + α₁₀ω₂’

0^2 — — (Ιθϋ^ιΟί — (ΙγΟΟβΟί —1 ^α5^ω2J where αμ, ω₂, ω₃, ω₄ are arbitrary constants satisfy that 0<ω₃<^β,0<ω₂< ^fp³, and > 6113^1+^15^3 ⁴ 6t₁₀*

6.5. Selecting activated agents and delay

Step 9: Pick (?_fc, df) with dk = (άβ) ^i_k, where i_k represents that the agent i is activated at time k, and d³ _k denotes the delay value which satisfies 0 < D < oo.

6.6. eliminating outdated information

Step 10: Set τβ³ = max(r^^fcJ ₁+ — +) ^to eliminate the outdated information. Note that records the generated time of the cumulative-mass variable p''L where p'' ^J with j G f/β captures

2020100180 05 Feb 2020 the cumulative information generated by agent j up to the current time and sent to agent i_k.

6.7. Exchanging Information

Step 11: Each agent ik updates variable according to n/fc rPk I f^k (rPk rPk λ ^+1 ~^Xk ^a Uk ' P \^xk ^Xk-lJ·

Step 12: Each agent ik updates variable ν^₊₁ according to ^Xk+1 =^α^^νϊ+ι + 52 + ^^fc(4^fc - 4Li)· ^k

Step 13: To accelerated algorithm, introduce the auxiliary variable s_k which, for each agent ik, is updated by

Pk rPk I (fik (rPk rPk\ ^bk+l ~^+1 ' p mi ^xk )·

Step 14: Introduce two auxiliary variables to prevent the packet loss. One is the cumulative-mass variable /T'·’ with j G which captures the cumulative information generated by agent j up to the current time and sent to agent ik. The other is the buffer variable /T' ^J with j G ΛΑ which stores the information sent from agent j to agent i and used by agent i in its last update. Then, adopt sum-push to achieve gradient tracking

514.1 Sum step:

yf =Λ + Σ (<;L -4^U) + W.M1) ^k

514.2 Push step:

pP =P^k + ^yk^k+h·

Step 15: Update the mass-buffer as follows:

p1+i = pp'k

Step 16: The remaining inactive agents keep the value of the last moment. Set k = k + 1 and go to Step 6 until a predefined stopping criterion is satisfied, e.g., k f /c_max where /c_max is the maximum number of iterations.

7. Innovation

2020100180 05 Feb 2020

The innovative points of the present invention are as follows:

1) this patent proposes an effective asynchronous scheme to execute distributed asynchronous optimization algorithm, which highly reduces idle time of communication links, mitigates congestion of communication and memory access, saves power, and has more fault-tolerant and robust.

2) the proposed asynchronous algorithm prevents the packet loss and is easily applied to the largescale machine learning and network information processing.

3) the proposed asynchronous algorithm employs uncoordinated constant step-sizes, which increases the flexibility of it.

4) the proposed asynchronous algorithm can linearly converge to the global optimal solution when the step-size and the momentum parameter are positive and do not exceed some explicit upper bound.

8. Simulations

8.1. Binary classification

To verify the effectiveness of the algorithm, we test it on a robust classification problem as follows:

₁ m +¹¹ mi Σ Σ + AllVM·)II², x 27 ^{1 1} 7=1 where D = (JZL ^e ^set °f ^e data distributed across the agents, with each agent i owning 77' and satisfying 77' Π D^l = 0, for arbitrary i ψ I. In addition, the training data, id and y'^J e {-i, i}, are the feature vector and the associated label of the ./-th sample in 77, respectively. In the last term, we set λ = 1, and /+(·) is a linear function with parameter x. Note that V is the loss function, which reads as follows:

^r

0, if r > 1

V(r) = < |_r ³ — |r 4- 1 _? if — 1 < r < 1

1, if r < — 1

Data: We use the Cleveland Heart Disease Data set with 14 features, preprocessing it by deleting observations with missing entries, scaling features between 0 and 1, and distributing the data to agents evenly. We set = e±₅x + Σά=ι

2020100180 05 Feb 2020

Graph: We consider a directed graph with m = 30 agents as shown in Fig. 5, where each agent has 7 out-neighbors. One of out-neighbors links all the agents within a directed cycle while the others are chosen uniformly and randomly.

Asynchronous model: We mainly consider three activation rules by concatenating random rounds: I) Agents are awakened according to a cyclic rule where the order is randomly permuted at the beginning of each round; II) Activation lists are generated by concatenating random rounds. We first set each agent appearing exactly once and sample agents uniformly for the remaining spots within a round. Then a random shuffle of the agents order is performed on each round; III) Agents are activated by a pure random strategy in all iterations. To generate one round, we first sample its length uniformly from the interval [m, T] with T = 90, and each transmitted message has traveling time which is sampled uniformly from the interval [0, D_t\ with D_t = 90.

Fig. 6 depicts the evolution of the residual ;7\/Σ=ι ||afy — rf*||² while Fig. 7 depicts the effects of three activated rules. Furthermore, Fig. 8 shows the effects of different step-sizes, and Fig. 9 shows the effects of different momentum parameters. According to Fig. 7 and Fig. 8, we can know that the practical upper bound of the constant step-size and the momentum parameter are around a = 0.865 and β = 0.087, respectively. Moreover, the best performance of the proposed algorithm is achieved when a = 0.70 and β = 0.08.

9. Brief Description of The Drawings

Fig. 1 depicts a simple directed graph with 3 computing.

Fig. 2 shows the difference between the synchronous algorithm and the asynchronous algorithm.

Fig. 3 shows a simple generating process of augmented graph.

Fig. 4 is a flowchart of the distributed asynchronous optimization algorithm.

Fig. 5 depicts a directed strongly-connected network with 30 agents.

Fig. 6 depicts the evolution of the residual with running times.

Fig. 7 shows the effects of the three activated rules on the proposed.

Fig. 8 depicts the evolution of residuals at the 45000th iteration with different constant step-sizes. Fig. 9 depicts the evolution of residuals at the 40000th iteration with different momentum parameter values.

Claims

The claims defining the invention are as follows:

1. An effective doubly-accelerated distributed asynchronous optimization algorithm

/./. Initializing Variables

Step 1: Each agent i G V sets k = 0 and sets a stopping criterion.

Step 2: Each agent i G V initializes with χ^ι_γ = 0, x^{* 2 * * *}0 G R, 8θ G R, ρθ = V/,(sq), νθ = 0, for all i G V; ρθ = 0, for all j G A//_n and i G V; = 0, for all t = —D, , 0.

1.2. Conducting augmented weight matrices

Step 3: According to the original graph, firstly conduct row-stochastic matrix A and columnstochastic matrix B. Meanwhile, introduce the matrix W = {w'·⁷} to denote either A or B. There exists w > 0 such that w w and ?C·⁷ > w, for all i G V and for all (j, i) G £, respectively.

Otherwise, we set w²·⁷ = 0.

Step 4: Based on the original graph, conduct a augmented row-stochastic matrix A_k as follows:

ifp = r = i_k;

aV·⁷, ifp = i_k,r

A^p _k ^r =

1, ifp = m G {1,2, , 2m}\{ik, ik + m};

1, if p 6 {2m + 1,2m + 2, , (D + 2)m}

U{fy + m} and r = p — m;

otherwise.

Step 5: Conduct a augmented column-stochastic matrix B_k in two steps. One is establishing the transition matrix of the sum step as follows:

Qpr _

1,

1, if r G {va^|k — < d < D} and p = i^k;

if p G V \{w^\k — T_k ^kj < d < D} and r = p;

0, otherwise.

ίο

2020100180 05 Feb 2020

The other is establishing the transition matrix of the push step as follows:

if r = i_k and p = υα^,β G if _r = p = i_k- 1, if r = p G V\ik', Ff = < 1, if r = να^,ρ = (i,j) G £, 0 < d < D - 1; 1, if r = p = vay_Dy (i,f) G 5; ο, otherwise.

Thus, there holds B_k = PkSk by combining the sum step with the push step.

1.3. Selecting Parameters

Step 6: According to the graph and the properties of row- and column-stochastic matrices, select

0 < T < oo and 0 < D < oo. Note that within T iterations all agents update, and D represent 1 the maximum delay value. Then, compute parameters p = (1 — ^K1 G (0,1) with Ki = (2m -l)T + mD,p = w^K^ G (0,1), C. = _{= 2}|±^, and C₄ = ^26¾ with C₃ = ⁰ p

Step 7: Based on strongly-convex coefficient μ, Lipschitz-continuous coefficient Ly, using the small gain theorem, compute parameters a,i,i= 1,2, , 16, are

lai = Cimy/mLy, 05 — a/3 + 1, ag = C4mLy\, o_i3 = 2my/mLy, a₂ = (1 + p) Ci, a₆ = s/3my/mLy, a_w = C4mLy, Oi4 = my/mLy, o₃ = Cim, 07 = s/3m, On = μη², Uis = 2m, 04 = CimLy, a₈ = y/3mLy, o₃2 ⁼ mLy, Οχθ = m.

and λ G (max {p, p + CiLymy/ma, (^/3 + 1) β, 1 — μη²a + mA_aLy} , 1). In addition, define

Δ_α) = 1 — O| |O + Ο,γ2^α·

1.4. Computing step-size and momentum parameter

Step 8: According to strongly-convex coefficient μ, Lipschitz-continuous coefficient Ly, select the largest step-size a, the gap between the largest step-size and the smallest step-size Δ_α, and the largest momentum parameter β as follows:

{ρω_γ — ρ²ϋμ ω₂ — α₅ωι 1

-----------------------------------------------------> ---------------------------------------------> 9 _τ 1 α-^ρωγ + α₈ρω₈ + α^ρω^ αθωχ + + α₈ω4 p²Ly

2020100180 05 Feb 2020 λ d^yUJ^Oi — — d\ffJf(f Αμύ! 1

0 C Δ_α < mm <------------------------,> , [ <7 12 >’4 + «13^1 + ^α 15^ω3 {1 ρωγ — ρ²ωγ — αγρωγα — α₃ρω₃α — α^ρω^α —,, <7 5(12^2 (ω>4 (1 — £ (a, Δ_α)) — ωρι^α \ + ^ω3^αΐ5Δ_α — ω!α₁₃Δ_α y (ι ρ) _ω., α₉ω₂
2ω₂ ' α₉ω₂ + αι₀ω₂ ’ ϋ^2 — (15^1 — — dfOJ^Cf — d^OJ^Cf 1 ^α5^ω2 J where ω₃, ω₂, ω₃, ω₄ are arbitrary constants satisfy that 0 < ω, < pj, 0 < ω₂ < and > ^13^1+^15^3 ⁴ «10 *

1.5. Selecting activated agents and delay

Step 9: Pick (?_fc, d_k~) with d_k = ((¾) ,p_k, where i_k represents that the agent i is activated at time d ^v in k, and d³ _k denotes the delay value which satisfies 0 « <11 < D < oo.

1.6. eliminating outdated information

Step 10: Set rf^J = max^, k — df) to eliminate the outdated information. Note that τ'”' records the generated time of the cumulative-mass variable p^lkf where with j G Λ?’ captures the cumulative information generated by agent j up to the current time and sent to agent i_k.

1.7. Exchanging Information

Step 11: Each agent i_k updates variable v_k+1 according to rPk —rPk rv^koSk I R^klrPk rPk λ ^+1 ~^Xk ^a Uk 'P \^Xk h-ll·

Step 12: Each agent i_k updates variable according to ++=^++ + Σ ux? ^k

Step 13: To accelerated algorithm, introduce the auxiliary variable s_k which, for each agent i_k, is updated by

Pk --rPk I R^k (rPk _ rPk \ ' P ^k )’

Step 14: Introduce two auxiliary variables to prevent the packet loss. One is the cumulative-mass variable p^lk:> with j G Iff which captures the cumulative information generated by agent j up to the current time and sent to agent i_k. The other is the buffer variable p¹^ with j G J\ff which

2020100180 05 Feb 2020 stores the information sent from agent j to agent i and used by agent i in its last update. Then, adopt sum-push to achieve gradient tracking

514.1 Sum step:

«&J =Λ + Σ - d¹) + - V/.J4?).

^k

514.2 Push step:

yl^k ₊i =^blklky^lk+^ =rt^k + v^ik^_+¥

Step 15: Update the mass-buffer as follows:

rf’kj __ rk+1 ~ ’ ‘ k

Step 16: The remaining inactive agents keep the value of the last moment. Set k = k + 1 and go to

Step 6 until a predefined stopping criterion is satisfied, e.g., k k_max where k_max is the maximum number of iterations.