WO2019042571A1 - DESCENT OF STOCHASTIC GRADIENT DISTRIBUTED TO AVERAGE ASYNCHRONOUS GRADIENT FORMATION - Google Patents

DESCENT OF STOCHASTIC GRADIENT DISTRIBUTED TO AVERAGE ASYNCHRONOUS GRADIENT FORMATION Download PDF

Info

Publication number
WO2019042571A1
WO2019042571A1 PCT/EP2017/072079 EP2017072079W WO2019042571A1 WO 2019042571 A1 WO2019042571 A1 WO 2019042571A1 EP 2017072079 W EP2017072079 W EP 2017072079W WO 2019042571 A1 WO2019042571 A1 WO 2019042571A1
Authority
WO
WIPO (PCT)
Prior art keywords
machine learning
learning model
training
gradient
computing nodes
Prior art date
Application number
PCT/EP2017/072079
Other languages
English (en)
French (fr)
Inventor
Zuguang WU
Roman Talyansky
Natan Peterfreund
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2017/072079 priority Critical patent/WO2019042571A1/en
Priority to CN201780094579.4A priority patent/CN111052155B/zh
Publication of WO2019042571A1 publication Critical patent/WO2019042571A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention in some embodiments thereof, relates to distributed training of a machine learning model and, more particularly, but not exclusively, to distributed training of a machine learning model by averaging a plurality of models trained locally and asynchronously by a plurality of computing nodes.
  • Machine learning model may be deep learning, support vector machine, decision tree, etc.
  • machine learning models are constantly growing.
  • the machine learning models may present multiple advantages and solutions to a plurality of problems and/or applications to which may have limited solution and/or no solution using standard rule based methods, techniques and/or algorithms.
  • Such machine learning models must be first trained before they may be applied to actual test data. Training the machine learning models may present major obstacles and setbacks due to several reasons, such as, for example, complexity of the models, size of training datasets and/or the like. These challenges may further manifest themselves as the complexity of the models increases to handle high complexity problems and/or applications.
  • the huge training datasets which may be required to train such complex models may further increase the computing resources, for example, processing resources, storage resources, communication resources and/or the like required to train the models.
  • a server connected to a plurality of computing nodes and configured to control a training of a machine learning model in a plurality of training iterations, each of the plurality of iterations comprising:
  • each of the plurality of computing nodes Instructing each of the plurality of computing nodes to train a respective local copy of the machine learning model, locally stored on each respective processing node, by locally computing a respective one of a plurality of cumulative gradients, wherein each of the plurality of cumulative gradients includes one or more gradients.
  • one or more of the plurality of computing nodes computes a new respective cumulative gradient that is merged with the machine learning model in a following training iteration.
  • a deep learning model may significantly reduce the training time which may be significantly long, in particular for large models trained with large training datasets.
  • the convergence rate of the optimized (trained) machine learning model may be significantly increased since the aggregated value may significantly reduce and/or eliminate singular irregularities contributed by one or more of the computing nodes.
  • the convergence rate may further increase since the plurality of computing nodes may be better synchronized with each other as they each start each of the training iterations with a local copy of the same machine learning model.
  • the computing resources utilization may be significantly increased for each of the computing nodes since the local training (cumulative gradient computation) is done asynchronously by each of the computing nodes.
  • the computing nodes may therefore independently compute their respective cumulative gradients at their own speed (according to their available computing resources) without being blocked by slower computing nodes.
  • the communication time in which the server obtains the plurality of cumulative gradients and creates the updated machine learning model may not block one or more of the computing nodes from computing the new cumulative gradient thus reducing the idle time and further increasing the computing resources utilization.
  • a method of distributed training of a machine learning model over a plurality of computing nodes comprising training a machine learning model through a plurality of training iterations, each of the plurality of iterations comprising:
  • each of the plurality of cumulative gradients includes one or more gradients.
  • one or more of the plurality of computing nodes computes a new respective cumulative gradient that is merged with the machine learning model in a following training iteration.
  • the server distributes the respective local copy to each of the plurality of computing nodes, wherein during the distribution the computing node(s) computes their new respective cumulative gradient.
  • Such deployment may allow adaptation to certain centralized systems in which the server distributes the local copies to one or more of the computing nodes.
  • one or more of the computing nodes may continue computing additional gradients thus further improving their computing resources utilization.
  • each of the plurality of computing nodes fetches the respective local copy from the server, wherein during the fetching, one or more of the computing nodes compute their new respective cumulative gradient.
  • Such deployment may allow adaptation to systems in which the computing nodes upload/download their local copies to/from the server, independently from each other. This may naturally be more efficient then the centralized systems in which the server distributes the local copies. Moreover, one or more of the computing nodes may continue computing additional gradients while downloading the updated local copy thus further improving their computing resources utilization.
  • the one or more gradients computed by each of the plurality of computing nodes are computed by applying a stochastic gradient descent for minimizing a loss function for the respective local copy, the loss function is selected according to the machine learning model. Using optimization methods as known in the art may significantly reduce the implementation and/or integration efforts.
  • each of the plurality of computing nodes uses a subset of a training dataset for training the respective local copy. Since the training dataset may be very large, splitting the training set to the plurality of computing nodes which process it in parallel may allow using the entire training dataset and/or its majority while limiting the training session time.
  • the aggregated value is an average of the plurality of cumulative gradients. Averaging the cumulative gradients obtained from the plurality of computing nodes has proved to present a significantly high convergence rate.
  • each of the plurality of computing nodes repeats updating the respective cumulative gradient with one or more additional gradients until exceeding a staleness threshold.
  • the staleness threshold may be applied to prevent one or more of the computing nodes from diverging which may result from computing too many gradients (advancing the machine learning model) without synchronizing with the cumulative gradients provided by the other computing nodes.
  • the one or more computing nodes locally merge the respective copy of the updated machine learning model with the new respective cumulative gradient which was computed during the obtaining and creating phases of the previous training iteration and not merged with the updated machine learning model. This may significantly increase the convergence rate since at the start of each training iteration each computing node first synchronizes the updated (global) machine learning model with the respective new cumulative gradient (computed during the previous training iteration and not yet merged with updated (global) machine learning model).
  • the one or more computing nodes prevent the server from obtaining the new cumulative gradient before the local merge of the new cumulative gradient with the updated machine learning model created in a previous training iteration. This may further increase the convergence rate since a certain cumulative gradient of a certain computing node may not be obtained by the server before first locally merged with the most up to date version of the (global) machine learning model. Only after locally merged and synchronized with the most up to date version of the (global) machine learning model, the server may merge the certain cumulative gradient with the following version of the (global) machine learning model.
  • Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
  • a data processor such as a computing platform for executing a plurality of instructions.
  • the data processor includes a volatile memory for storing instructions and/or data and/or a non- volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data.
  • a network connection is provided as well.
  • a display and/or a user input device such as a keyboard or mouse are optionally provided as well.
  • FIG. 1 is a flowchart of an exemplary process of distributed training of a machine learning model in a distributed system comprising a plurality of computing nodes, according to some embodiments of the present invention
  • FIG. 2 is a schematic illustration of an exemplary distributed system comprising a plurality of computing nodes for distributed training of a machine learning model, according to some embodiments of the present invention
  • FIG. 3 is a schematic illustration of a sequence of an exemplary gradient averaging implementation of distributed training of a machine learning model
  • FIG. 4 is a schematic illustration of a sequence of an exemplary gradient delay implementation of distributed training of a machine learning model
  • FIG. 5 is a schematic illustration of a sequence of an exemplary Stale
  • FIG. 6 is a schematic illustration of a convergence of an exemplary gradient delay implementation of distributed training of a machine learning model
  • FIG. 7 is a schematic illustration of cumulative gradients locally computed by workers during a distributed training of a machine learning model, according to some embodiments of the present invention
  • FIG. 8 is a schematic illustration of an exemplary merging sequence of a current version of a machine learning model with a plurality of cumulative gradients locally computed at a plurality of computing nodes, according to some embodiments of the present invention
  • FIG. 9 is a schematic illustration of an exemplary local merging sequence of an updated version of a machine learning model at a plurality of computing nodes, according to some embodiments of the present invention.
  • FIG. 10 is a schematic illustration of an exemplary merger prevention measure applied in a distributed training process of training a machine learning model, according to some embodiments of the present invention.
  • the present invention in some embodiments thereof, relates to distributed training of a machine learning model and, more particularly, but not exclusively, to distributed training of a machine learning model by averaging a plurality of models trained locally and asynchronously by a plurality of computing nodes.
  • asynchronous averaging training method for training a machine learning model for example, a deep learning model in a distributed system comprising a plurality of computing nodes.
  • the training of the machine learning model is conducted through a plurality of training iterations in which each of the computing nodes computes one or more gradients to optimize a local copy of the machine learning model. While the computing nodes locally compute the gradient(s) asynchronously to each other, a global machine learning model is updated in each of the training iterations with an aggregated value which aggregates the gradients computed by all of the computing nodes.
  • Training the machine learning model in the distributed system may present major challenges, in particular a tradeoff between utilization of the computing resources available at each of the plurality of computing nodes and a convergence rate for optimizing the machine learning model.
  • computing resources utilization becomes critical in system deployments where each of the plurality of computing nodes has different computing resources availability, for example, processing resources (processing power), storage resources, communication resources and/or the like.
  • communication between the server and the computing nodes may also present a limitation in efficiently utilizing the computing resources of the computing nodes as explained hereinafter.
  • the gradient averaging implementation is typically a synchronous iterative process in which a central server also known as Parameter Server (PS) holds a global copy of the machine learning model and controls the distributed training process.
  • PS Parameter Server
  • each of the plurality of computing nodes obtains (e.g. downloads) a local copy (replica) of the machine learning model from the server.
  • Each computing node may locally train its respective local copy by computing a gradient using one or more techniques as known in the art, for example, applying a stochastic gradient descent to minimize a loss function selected for training the machine learning model.
  • the computing nodes may upload their gradients to the server.
  • the server may then collect the plurality of gradients provided by the computing nodes and averages them to produce an average value which may be merged with the current version of the (global) machine learning model to produce an updated version of the machine learning model. This process may be repeated through a plurality of training iterations.
  • the server updates the (global) machine learning model with the aggregated value averaging the results received from all of the computing nodes thus reducing variance of the average gradient.
  • the computing nodes are synchronized at the beginning of each of the training iterations since they use the same version of the (global) machine learning model created (updated) by the server. This may limit and/or prevent the computing nodes from diverging from each other.
  • the gradient averaging implementation may present major limitations with respect the computing resources utilization at the computing nodes.
  • a first limitation results from the fact that each of the computing nodes may have availability of different computing resources. Therefore the slowest processing node in the system may dictate the duration of the training iteration since the server waits until all of the computing nodes have completed computing their respective gradient. The higher performance computing nodes may therefore wait idle until the beginning of the next training iteration thus wasting valuable computing resources.
  • Each of the training iterations comprises two main phases.
  • the first phase is a local computing phase in which the computing nodes locally compute their respective gradients.
  • the second phase is a communication phase in which the computing nodes upload their respective gradients, the server creates the updated machine learning model (advances the model) and the computing nodes download the updated version of the machine learning model from the server.
  • the computing nodes may also wait idle until the upload and/or download process are complete as they must use the most up to date version of the machine learning model. This may naturally impact utilization of computing resources at the idle computing nodes.
  • the gradient delay implementation is typically an asynchronous iterative process in which each of the computing nodes may locally train its local copy of the machine learning model at its own speed according to the computing resources available to each computing node. Whenever one of the computing nodes completes computing a gradient it may upload it to the server which may merge it with the current version of the (global) machine learning model.
  • the gradient delay implementation may therefore significantly increase utilization of the computing resources at each of the computing nodes since each computing node does not need to wait for the other computing nodes to complete their local computation.
  • the gradient delay implantation may present a major degradation in the convergence rate which may lead to increased time for training the machine learning model and possibly to inability to converge to an optimized trained machine learning model.
  • the convergence limitation may result from the fact that during each of the training iterations only one gradient obtained from a single computing node is merged by the server with the current version of the (global) machine learning model. This may expose the optimization path for training the machine learning model to local irregularities inflicted by single computing nodes which are not regulated with the results of other computing nodes.
  • a certain gradient provided to the server by a respective computing node may be delayed meaning that while the certain gradient is computed according to a certain local copy, the global version of the machine learning model has advanced as it may have been merged with gradients computed by other computing nodes. As result the certain gradient may be merged with a version of the machine learning model which is different from the version of the machine learning model used to compute the certain gradient. Such delayed merging may further limit the convergence.
  • Some gradient delay methods have further evolved such that each of the computing nodes may locally compute multiple gradients before uploading them to merge with the global machine learning model.
  • a staleness threshold is introduced to limit the number of gradients each of the computing nodes may compute before merging with the global machine learning model.
  • the asynchronous gradient averaging implementation introduced in the present invention aims to overcome the limitations of the existing distributed training methods and significantly increase the computation resources utilization of the computing nodes while maintaining a high convergence rate.
  • each of the plurality of computing nodes obtains (e.g. downloads) the local copy of the (global) machine learning model from the server.
  • Each computing node may locally train its respective local copy by computing a respective cumulative gradient.
  • the cumulative gradient may comprise one or more gradients, i.e. the result of several local training iterations conducted locally by a respective computing node to create an updated local copy of the machine learning model. Since each of the computing nodes may train its local copy asynchronously and independently of the other computing nodes, the utilization of the computing resources at each of the computing nodes may be significantly increased.
  • the server may obtain the plurality of cumulative gradients provided by the plurality of computing nodes.
  • the server may then aggregate the plurality of cumulative gradients, for example, average to produce an updated version of the machine learning model.
  • a new training iteration may start. This may significantly increase the convergence rate since the global machine learning model is merged with the aggregated value which may regulate irregularities exhibited by one or more of the cumulative gradients.
  • the convergence rate may be further increased since all the computing nodes start the next training iteration with the same version of the machine learning model.
  • the download timing of the updated version of the machine learning model to the plurality of computing nodes is relaxed.
  • the server may notify each of the plurality of computing nodes of availability of the newly updated version of the machine learning model such that each of the computing nodes may obtain the newly updated version at its own timing.
  • the computing nodes may continue training their local copy (which is not yet updated) and compute a new cumulative gradient. This may further increase utilization of the computing resources at each of the computing nodes since the computing nodes are not idle during the communication phase but rather compute additional gradients. Since the new cumulative gradient was not used by the server to update the machine learning model during the previous training iteration, in order to maintain synchronization and effective convergence rate, at the beginning of each training iteration, each computing node may locally merge the local copy of the newly updated version of the machine learning model available from the server with the new cumulative gradient (if exists).
  • each of the computing nodes may continue computing additional gradients for the locally merged local copy.
  • each of the computing nodes prevents the server from obtaining new gradients not used (by the server) to produce the updated version of the machine learning model before first locally merging these new gradients with the local copy of the most updated version of the machine learning model.
  • the staleness threshold may be applied to limit the number of gradients computed by each of the computing nodes from the last global updated model that has been downloaded to each of the computing nodes.
  • the server monitors the network activity to determine utilization of the network. Based on the determined network utilization, the server may define a frequency and/or duration of the training iterations.
  • the asynchronous gradient averaging method may significantly increase utilization of the computation resources at the computing nodes.
  • the asynchronous gradient averaging method may maintain a high convergence rate significantly similar to the state of the art gradient averaging implementation which is typically synchronous.
  • the present invention may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • FIG. 1 illustrates a flowchart of an exemplary process of distributed training of a machine learning model in a distributed system comprising a plurality of computing nodes, according to some embodiments of the present invention.
  • An exemplary process 100 may be executed to train a machine learning model, for example, a deep learning model using a distributed system comprising a plurality of computing nodes.
  • the process 100 is based on a plurality of training iterations in which the machine learning model is updated and optimized with an aggregated value of gradients computed locally and asynchronously by a plurality of computing nodes.
  • each of the computing nodes downloads a local copy of the machine learning model from a central server and trains the local copy with a subset of the overall training dataset which is also stored locally at the computing node.
  • Each of the computing nodes trains its respective local copy and computes a respective cumulative gradient comprising one or more gradients computed using a stochastic gradient descent for minimizing (optimizing) a loss function adapted for the machine learning model.
  • the computing nodes may have different computing resources capabilities and/or resources, in particular processing resources, communication resources and/or the like, each of the computing nodes may compute the cumulative gradient at different speeds and asynchronously to each other. Therefore the cumulative gradient of the different computing nodes may include a different number of computed gradients.
  • the server may obtain the cumulative gradients.
  • the server may create an updated machine learning model (advance the model) by merging the current machine learning model with an aggregated value of the cumulative gradients obtained from all of the computing nodes.
  • the aggregated value may be, for example, an average of the cumulative gradients obtained from all of the computing nodes.
  • each of the computing nodes may continue computing the gradients locally to create a new cumulative gradient which is not included in the updated machine learning model created during the current training iteration.
  • the training iterations may be repeated until one or more optimization criteria defined for optimizing the machine learning model are satisfied.
  • FIG. 2 is a schematic illustration of an exemplary distributed system comprising a plurality of computing nodes for distributed training of a machine learning model, according to some embodiments of the present invention.
  • a distributed training process such as the 100 for training a machine learning model may be executed by an exemplary system 200.
  • the system 200 comprises a server 202 communicating with a plurality of computing nodes 204, such as the computing node 204_1 through a computing node 204_N over a network 250 comprising one or more wired and/or wireless networks.
  • the server 202 as well as any of the computing nodes 204 may be, for example, a computer, a server, a cluster of processing nodes and/or any processing device having one or more processors.
  • the server 202 may typically include a network interface 210 for connecting to the network 250, a processor(s) 212 and storage 214.
  • the processor(s) 212 homogenous or heterogeneous, may include one or more processors arranged for parallel processing, as clusters and/or as one or more multi core processor(s).
  • the storage 214 may include one or more non-transitory persistent storage devices, for example, a hard drive, a Flash array and/or the like.
  • the storage 214 may further comprise one or more network storage devices, for example, a storage server, a network accessible storage (NAS), a network drive, and/or the like.
  • the storage 214 may also include one or more volatile devices, for example, a Random Access Memory (RAM) component and/or the like.
  • RAM Random Access Memory
  • Each of the computing nodes 204 may typically include a network interface 220 such as the network interface 210 for connecting to the network 250, a processor(s) 222 such as the processor(s) 212 and a storage 224 such as the storage 214.
  • each of the computing nodes 204 includes its own resources which may typically vary in computing resources, communication resources and/or storage resources.
  • each of the computing nodes 204 is associated with its specific network interface 220, processor(s) 222 and storage 224, for example, the computing node 204_1 is associated with a network interface 220_1, a processor(s) 222_1 and storage 224_1.
  • the computing node 204_N is associated with a network interface 220_N, a processor(s) 222_N and storage 224_N.
  • the server 202 and/or one or more of the computing nodes 204 may further be utilized through one or more virtual machines executed on one or more of the physical processing nodes.
  • virtual machine computing nodes may utilize the hardware resources, i.e. the network interface 210 and/or 220, the processor(s) 212 and/or 222 and the storage 214 and/or 224 of the respective processing node(s) hosting the virtual machine computing node(s).
  • server 202 and/or one or more of the computing nodes 204 may be provided through a cloud computing platform, such as, for example, Amazon Web Service (AWS), Google Cloud, Microsoft Azure and/or the like for example,
  • AWS Amazon Web Service
  • Google Cloud Google Cloud
  • Microsoft Azure Microsoft Azure
  • the server 202 may execute one or more software modules, for example, a process, an application, an agent, a utility, a script, a plug-in and/or the like.
  • a software module may comprise a plurality of program instructions stored in storage such as the storage 214.
  • the server may execute a training manager 230 which controls and manages the process 100 for training a machine learning model 232 using the distributed system 200.
  • the machine learning model 232 which performs as a global copy of the machine learning model being trained may be stored in the storage 214 of the server 202.
  • each of the computing nodes 204 may execute one or more software modules, for example, an instance of a worker 240 for computing gradients for a local copy 242 of the machine learning model 232.
  • Each of the computing nodes 204 executes its own instance of the worker 240, for example, the computing node 204_1 executes a worker 240_1 to compute gradients for a local copy 242_1 while computing node 204_N executes a worker 240_N to compute gradients for a local copy 242_N.
  • FIG. 3 is a schematic illustration of a sequence of an exemplary gradient averaging implementation of distributed training of a machine learning model.
  • An exemplary gradient averaging implementation for training a machine learning model such as the machine learning model 232 may be conducted in a distributed system such as the system 200 comprising a server such as the server 202 executing a training manager such as the training manager 230 and a plurality of computing nodes such as the of computing nodes 204 each executing an instance of a worker such as the worker 240.
  • a distributed system such as the system 200 comprising a server such as the server 202 executing a training manager such as the training manager 230 and a plurality of computing nodes such as the of computing nodes 204 each executing an instance of a worker such as the worker 240.
  • the machine learning model 232 is trained in a plurality of training iterations where in each of the training iterations, the version of the machine learning model 232 M t is updated.
  • the initial machine learning model 232 is designated M 0 .
  • the training manager 230 distributes a local copy (replica) of the machine learning model 232 M 0 to each of three workers 240 designated w 1 , w 2 and w 3 .
  • Each of the three workers 240 ⁇ 1 ; w 2 and w 3 may apply a gradient descent loss function for minimizing (optimizing) the machine learning model 232 M 0 to locally compute a single gradient ⁇ ⁇ 5 ⁇ 2 and ⁇ 3 respectively.
  • the gradient averaging training implementation is synchronous such that the training manager 230 waits for all the workers 240 ⁇ 1 ; w 2 and w 3 to complete computing their gradients ⁇ ⁇ 5 ⁇ 2 and A 3 and collects the gradients ⁇ ⁇ 5 ⁇ 2 and A 3 .
  • the training manager 230 aggregates the gradients ⁇ ⁇ 5 ⁇ 2 and A 3 , for example, averages them to create an average gradient.
  • the training manager 230 may then merge the machine learning model 232 M 0 with the aggregated value of the gradients ⁇ ⁇ 5 ⁇ 2 and ⁇ 3 to create an updated machine learning model 232 M x .
  • the updated machine learning model 232 M x may therefore
  • the training manager 230 distributes a local copy of the updated machine learning model 232 M 1 to each of the workers 240 ⁇ 1 ; w 2 and w 3 which compute the gradients ⁇ ⁇ 5 ⁇ 2 and A 3 by optimizing their local copies of the machine learning model 232 M 1 .
  • the training manager 230 may collect the gradients ⁇ ⁇ 5 ⁇ 2 and A 3 and merge the machine learning model 232 M x with an aggregated value of the gradients ⁇ ⁇ 5 ⁇ 2 and A 3 to create an update machine learning model 232 M 2 , for example, M 2 ⁇ - M 1 + + ⁇ 2 + ⁇ 3 .
  • the gradient averaging distributed training session may continue through a plurality of additional training iterations until meeting one or more optimization criteria for the machine learning model 232.
  • Each of the training iterations comprises two main phases - a local computing phase conducted by the workers 240 and a communication phase controlled by the training manager 230.
  • the communication phase comprises the obtaining the locally computed gradients from the plurality of workers 240, merging the current machine learning model 232 with the collected gradients to create the updated machine learning model 232 and distributing the machine learning model 232 to the workers 240.
  • the collection and distribution may be utilized in one or more schemes.
  • the workers 240 upload their respective gradient to the server 202 and download their local copy of the updated machine learning model 232 from the server 202.
  • the training manager 230 may retrieve the locally computed gradients from the workers 240 and transmit the updated machine learning model 232 to the workers 240.
  • each of the workers 240 computes a single gradient and the aggregated value of the gradients computed by all of the workers 240 is merged with the current version of the machine learning model 232, convergence may be significantly rapid. Moreover, due to the synchronized nature of this implementation, divergence of the gradients computed by the plurality of workers 240 may be significantly reduced.
  • the synchronous implementations may present some limitations and/or drawbacks.
  • the plurality of workers 240 are typically idle as they may be waiting for the training manager 230 to obtain the plurality of locally computed gradients, merge the current machine learning model 232 with the aggregated value of the obtained gradients and distribute the local copies of the updated machine learning model 232 to the plurality of workers 240.
  • the communication phase may further include the communication time required for each of the workers 240 to obtain (download and/or receive) their respective local copies 242 from the server 202. Since different resources, for example, computing resources (e.g.
  • processing power, processing speed, etc.), communication resources (network bandwidth, network availability, etc.) and/or the like may be available for each of the workers 240, the slowest performance worker 240 may dictate the idle time.
  • the idle time during which higher performance workers 240 are idle may be considerable and therefore utilization of the computing and/or processing capabilities of the system 200 may not be optimal and typically significantly low.
  • FIG. 4 is a schematic illustration of a sequence of an exemplary gradient delay implementation of distributed training of a machine learning model.
  • An exemplary gradient delay implementation for training a machine learning model such as the machine learning model 232 may be conducted in a distributed system such as the system 200 comprising a server such as the server 202 executing a training manager such as the training manager 230 and a plurality of computing nodes such as the of computing nodes 204 each executing an instance of a worker such as the worker 240.
  • the machine learning model 232 is trained in a plurality of training iterations where in each of the training iterations, the version of the machine learning model 232 M j is updated.
  • the initial machine learning model 232 is designated M 0 .
  • the training manager 230 distributes a local copy (replica) of the machine learning model 232 M 0 to each of three workers 240 designated l 5 w 2 and w 3 .
  • Each of the three workers 240 ⁇ 1 ; w 2 and w 3 may apply a gradient descent loss function for minimizing (optimizing) the machine learning model 232 M 0 to locally compute a single gradient ⁇ ⁇ 5 ⁇ 2 and ⁇ 3 respectively.
  • the gradient delay training implementation is asynchronous such that each of the workers 240 w 1 , w 2 and w 3 locally compute their respective gradients ⁇ ⁇ 5 ⁇ 2 and A 3 at their own speed (time) which may be dictated by the resources, for example, the computing resources, the communication resources and/or the like available for each of the workers 240.
  • the training manager 230 may obtain the respective gradient ⁇ ⁇ 5 ⁇ 2 and/or ⁇ 3 and merge the current machine learning model 232 M ⁇ with the obtained gradient ⁇ ⁇ .
  • the worker 240 w completes computing its respective gradient and may upload it to the server 202.
  • the training manager 230 may merge the initial machine learning model 232 M 0 with the gradient A 1 to create an updated machine learning model 232 M 1 which may be expressed by the equation M 1 ⁇ - M 0 + A 1 .
  • the worker 240 w 1 may then download a copy of the updated machine learning model 232 M x from the server 202.
  • the worker 240 w 2 completes computing its respective gradient ⁇ 2 and may upload it to the server 202.
  • the training manager 230 may merge the machine learning model 232 M x with the gradient ⁇ 2 to create an updated machine learning model 232 M 2 which may be expressed by the equation M 2 ⁇ - M-L + ⁇ 2 .
  • the worker 240 w 2 may then download a copy of the updated machine learning model 232 M 2 from the server 202.
  • the worker 240 w 3 completes computing its respective gradient ⁇ 3 and may upload it to the server 202.
  • the training manager 230 may merge the machine learning model 232 M 2 with the gradient ⁇ 3 to create an updated machine learning model 232 M 3 which may be expressed by the equation M 3 ⁇ - M 2 + ⁇ 3 .
  • the worker 240 w 3 may then download a copy of the updated machine learning model 232 M 3 from the server 202.
  • the worker 240 incompletes computing its next respective gradient A x and may upload it to the server 202.
  • the training manager 230 may merge the machine learning model 232 M 3 with the gradient A to create an updated machine learning model 232 M 4 which may be expressed by the equation M 4 ⁇ - M 3 + ⁇ -L .
  • the worker 240 w may then download a copy of the updated machine learning model 232 M 4 from the server 202.
  • the worker 240 w 2 completes computing its next respective gradient ⁇ 2 and may upload it to the server 202.
  • the training manager 230 may merge the machine learning model 232 M 4 with the gradient ⁇ 2 to create an updated machine learning model 232 M 5 which may be expressed by the equation M 5 ⁇ - M 4 + ⁇ 2 .
  • the worker 240 w 2 may then download a copy of the updated machine learning model 232 M 5 from the server 202.
  • the gradient delay distributed training session may continue through a plurality of additional training iterations until meeting one or more optimization criteria for the machine learning model 232.
  • each worker 240 may not wait for other workers 240 to complete their local computation of their respective gradients.
  • the respective worker 240 is still idle while it uploads the gradient to the server 202, waits for the training manager 230 to merge the machine learning model 232 with the uploaded gradient and while the respective worker 240 downloads the updated machine learning model 232.
  • FIG. 5 is a schematic illustration of a sequence of an exemplary Stale Synchronous Parallel (SSP) gradient delay implementation of distributed training of a machine learning model.
  • An exemplary SSP gradient delay implementation for training a machine learning model such as the machine learning model 232 may be conducted in a distributed system such as the system 200 comprising a server such as the server 202 executing a training manager such as the training manager 230 and a plurality of computing nodes such as one the of computing nodes 204 each executing an instance of a worker such as the worker 240.
  • the machine learning model 232 is trained in a plurality of training iterations where in each of the training iterations, the version of the machine learning model 232 M t is updated.
  • the SSP gradient delay follows the same implementation as the gradient delay presented herein before. The main difference is that during the communication phase in which the training manager 230 obtains and merges the current locally computed gradient ⁇ (-1) of a certain worker 240 w the certain worker 240 w may continue
  • the SSP gradient delay enforces a staleness threshold N to limit the number of gradients each of the workers 240 may compute using its local copy of the current machine learning model 232 before downloading and/or obtaining an updated version of the machine learning model 232 from the server 202. Enforcing the staleness threshold may prevent divergence of the gradients locally computed by the workers 240. In case no limit is imposed, the gradients locally computed by the workers 240 for an out of date version of the machine learning model 232 may diverge to such an extent that merging them with the (global) version of the machine learning model produced 232 may lead to divergence of the training process, as the local copies of the model 242 may not be synchronized with the updated version(s) of the machine learning model 232.
  • both the gradient delay implementation and the SSP gradient delay implementation may present poor convergence rate due to their asynchronous and independent merging scheme.
  • the asynchronous and independent merging scheme may cause delayed updates to the machine learning model 232.
  • the workers 240 may diverge from each other since the synchronization between them, through a joined updated machine learning model 232 is not frequent.
  • FIG. 6 is a schematic illustration of a convergence of an exemplary gradient delay implementation of distributed training of a machine learning model.
  • FIG. 6 presents a convergence, or more accurately an inherent limitation in the convergence of an exemplary gradient delay and/or SSP gradient delay implementations for training a machine learning model such as the machine learning model 232.
  • the machine learning model 232 may be trained in a distributed system such as the system 200 comprising a server such as the server 202 executing a training manager such as the training manager 230 and a plurality of computing nodes such as the of computing nodes 204 each executing an instance of a worker such as the worker 240.
  • the machine learning model 232 is trained in a plurality of training iterations where in each of the training iterations, the version of the machine learning model 232 M t is updated. Following the previous examples, assuming three workers 240 l 5 w 2 and w 3 are executed by three computing nodes 204. Following the first training iteration t t in which an initial version of the machine learning model 232 M 0 is merged with the gradient provided by the worker 240 w- ⁇ to create an updated version of the machine learning model 232 M 1 ; the worker 240 w- ⁇ continues using the machine learning model 232 However, the machine learning model 232 may advance since it may be updated with gradients provided by the other workers 240 w 2 and/or w 3 .
  • the most up to date version of the machine learning model 232 M 3 may be merged with gradients computed by the worker 240 for the out of date version of the machine learning model 232 M x (from ⁇ ). This may significantly reduce the convergence rate for optimizing the machine learning model 232 using the gradient delay implementations.
  • EASGD Elastic Asynchronous Stochastic Gradient Descent
  • the certain worker 240 may also present the same convergence limitations.
  • the process 100 is an iterative process comprising a plurality of training iterations and may be repeated until satisfying one or more optimization criteria defined for the machine learning model 232. The process 100 may be repeated for each of the training iterations.
  • the training process 100 starts with distributing the local copies 242 of a current version of the machine learning model 232 from the server 202 to the plurality of workers 240.
  • the task manager 230 may notify the workers 240 of availability of a most up to date, typically newly produced version of the machine learning model 232.
  • the workers 240 may access the server 202 to download their local copy 242 to the respective computing node 204.
  • the training manger 230 transmits the local copies 242 to one or more of the workers 240.
  • one or more of the workers 240 control their download timing for obtaining i.e. downloading their respective local copies 232 from the server 202.
  • the worker(s) 240 may define the timing of obtaining (downloading) the updated version from the server 202.
  • the worker(s) 240 may define their download timing according to a plurality of parameters, for example, computing resources availability, a staleness threshold exceeding (as described herein after) and/or the like.
  • each of the plurality of workers 240 locally trains its respective local copy 242 using a subset of the overall training dataset.
  • the subsets of training data used by the plurality of workers 240 typically comprise different training data.
  • some training data may overlap in one or more of the subsets assigned to one or more of the workers 240.
  • Each of the workers 240 trains its local copy 242 and computes a respective one of a plurality of cumulative gradients by applying a stochastic gradient descent for minimizing (optimizing) a loss function for the respective local copy 242.
  • the loss function may be selected according to the type and/or one or more characteristics of the machine learning model 232 as known in the art.
  • the cumulative gradient generated by each of workers 240 includes one or more locally computed gradients.
  • Computation of the cumulative gradient may be regarded as a momentum method in which computing the gradients by each of the workers 240 w may be regarded as computing a velocity v at a time t.
  • a is a normalization value, typically in a range of [0,1]
  • is a step size
  • L is the loss function
  • M is the machine learning model 232.
  • the velocity at certain time (t) equals the velocity at a previous time (t— 1) adjusted with the current acceleration ⁇ ⁇ (t).
  • the resulting velocity v represents an update to the machine learning model.
  • the cumulative gradient may thus be expressed as M(w, t i+1 )— M t ) where M(ti is the local copy 232 at time t ⁇ .
  • the workers 240 may be instructed to start computing their respective cumulative gradient by the training manager 230. However, typically the instruction to start computing the cumulative gradient is not explicit and once a certain worker 240 downloads its respective local copy 242, the certain worker 240 may start computing its respective cumulative gradient.
  • each of the plurality of workers 240 may have different computing resources (e.g. processing resources, communication resources, etc.) at its disposal (available) for locally computing its respective cumulative gradient, the plurality of workers 240 compute their cumulative gradients independently of and asynchronously to each other.
  • computing resources e.g. processing resources, communication resources, etc.
  • a staleness threshold is predefined for the training process 100 to limit the number of gradients computed by each of the workers 240 for a local copy of a certain version of the machine learning model 232. Therefore each of the workers 240 may update its respective cumulative gradient with additional gradients as long as the overall number of gradients does not exceed the staleness threshold. Once a number of gradients locally computed by a certain worker 240 reaches the predefined staleness threshold, the respective worker 240 stops computing additional gradients.
  • FIG. 7 is a schematic illustration of cumulative gradients locally computed by workers during a distributed training of a machine learning model, according to some embodiments of the present invention.
  • a machine learning model such as the machine learning model 232 may be trained in a distributed system such as the system 200 comprising a server such as the server 202 executing a training manager such as the training manager 230 and a plurality of computing nodes such as the computing nodes 204 each executing an instance of a worker such as the worker 240.
  • local copies of an initial version M(t 0 ) of the machine learning model 232 such as the local copies 242 are distributed to three workers 240 designated l 5 w 2 and w 3 .
  • Each of the three workers 240 l5 w 2 and w 3 locally compute a respective cumulative gradient comprising one or more locally computed gradients , 240 w 1 computes a respective cumulative gr the worker 240 w 2 computes a respective cumulative gradient comprising gradients ⁇ , ⁇ and the worker 240 w 3 computes a respective cumulative gradient comprising gradients ⁇ , ⁇ .
  • the number of gradients included in each of the cumulative gradients is bounded and may not exceed the staleness threshold predefined for the process 100.
  • the training manager 230 may check to determine whether each of the plurality of workers 240 has an available respective cumulative gradient, i.e. whether each of the workers 240 completed computing at least one gradient.
  • the training manager 230 may probe each of the plurality of workers 240 to check availability of their respective cumulative gradients.
  • one or more of the workers 240 typically all workers 240 may send an availability message to the training manager 230 at completion of locally computing the first gradient.
  • the training manager 230 identifies that all of the workers 240 have an available cumulative gradient the process 100 branches to 110, otherwise the process 100 branches to 108.
  • one or more of the other workers 240 may continue computing additional gradients and update their respective cumulative gradient(s). However, as described before, the number of gradients computed by each of the workers 240 and included in its respective cumulative gradient may not exceed the staleness threshold.
  • the training manager 230 obtains the plurality of cumulative gradients typically uploaded by the workers 240 to the server 202. For example, once the training manager 230 identifies that all of the workers 240 have an available cumulative gradient, the training manager 230 may instruct all the workers 240 to upload their respective cumulative gradients to the server 202.
  • the training manager 230 merges the current version of the machine learning model 232 with the plurality of cumulative gradients provided by the plurality of workers 240, specifically with an aggregated value of the plurality of cumulative gradients.
  • the training manager 230 may aggregate the plurality of cumulative gradients to create an averaged value that may be merged with current version of the machine learning model 232.
  • the training manager 230 may average the plurality of cumulative gradients.
  • the workers 240 may continue training their respective local copies 242 and compute a new cumulative gradient.
  • the new cumulative gradient is not merged with the current version of the machine learning model 232 (step 112) during the current training iteration.
  • the workers 240 may continue training their respective local copies 242 and compute a new cumulative gradient while obtaining their local copy 242 of the newly updated machine learning model 232 which is another segment of the communication phase.
  • FIG. 8 is a schematic illustration of an exemplary merging sequence of a current version of a machine learning model with a plurality of cumulative gradients locally computed at a plurality of computing nodes, according to some embodiments of the present invention.
  • three workers such as the workers 240 designated l 5 w 2 and w 3 may each locally compute a respective cumulative gradient M w t , t- — M(t 0 ) comprising one or more locally computed gradients , ....
  • a training manager such as the training manager 230 identifies that all of the workers 240 l 5 w 2 and w 3 have an available respective cumulative gradient, the training manager 230 may obtain the cumulative gradients.
  • the training manager 230 may then merge the current version of the machine learning model 232, for example, M(t 0 ) with an aggregated value, for example, an average of the plurality of cumulative gradients M(w i ⁇ t- — M(t 0 ) to create an updated version of the machine learning model 232, for example, M(t 2 ).
  • the updated version of the machine learning model 232 may be expressed as:
  • the workers 240 w 1 , w 2 and w 3 may continue training their respective local copies 242 and compute a new respective cumulative gradient.
  • the new cumulative gradient is not merged with the current version of the machine learning model 232 during the current training iteration.
  • the training manager 230 may check whether one or more optimization criteria predefined for the machine learning model 232 are satisfied. In case the training manager 230 determines that the optimization criteria is satisfied the process 100 branches to 116, otherwise the process 100 branches back to 102 and a new training iteration is started.
  • the training manager 230 may output the trained machine learning model 232, i.e. the most updated version of the machine learning model 232.
  • each of the workers 240 obtains (e.g. download) a local copy 242 of the updated version of the machine learning model 232.
  • each of the workers 240 locally merges the newly acquired local copy 242 with the new cumulative gradient (if exists) that was not merged with the updated version of the machine learning model 232 during the previous training iteration.
  • the respective worker 240 may continue computing gradients for the locally merged local copy 242.
  • the training manager 230 monitors the activity on the network 250. Based on the status of the network, the training manager 230 may adjust the frequency and/or time intervals between consecutive training iterations. For example, when the network 250 is overloaded, the training manager 230 may reduce the frequency of the training iterations which may increase utilization of the computing resources at the computing nodes 204 since they may not be blocked by the high network traffic during the communication phase. Similarly, when the training manager 230 determines that the network activity is low, the training manager 230 may increase the frequency of the training iterations to expedite the training process 100 and achieve fast convergence.
  • FIG. 9 is a schematic illustration of an exemplary local merging sequence of an updated version of a machine learning model at a plurality of computing nodes, according to some embodiments of the present invention.
  • three workers such as the workers 240 designated l5 w 2 and w 3 may download a local copy such as the local copy 242 of an updated version of the machine learning model 232, for example, M(t 2 ).
  • Each of the workers 240 l 5 w 2 and w 3 locally merges the downloaded local copy 242 M(t 2 ) with a new cumulative gradient which was not merged with the updated version M(t 2 ).
  • each of the workers 240 l 5 w 2 and w 3 may each continue training its respective locally merged local copy 242 by computing gradients for their respective locally merged local copy 242.
  • FIG. 10 is a schematic illustration of an exemplary merger prevention measure applied in a distributed training process of training a machine learning model, according to some embodiments of the present invention.
  • three workers such as the workers 240 designated w 1 , w 2 and w 3 may continue training a respective local copy such as the local copy 242 of a current version of the machine learning model 232, for example, M(t 0 ).
  • the workers 240 l5 w 2 and w 3 may each compute a respective new cumulative gradient which was not merged with the updated version M(t 2 ).
  • each of the workers 240 may prevent a training manager such as the training manager 230 from obtaining its respective new cumulative gradient before the respective new cumulative gradient is locally updated with a respective local copy 242 of the most updated version of the machine learning model 232, for example, M(t 2 ). This is done to ensure that machine learning model the complete gradient history is preserved, meaning that after merging the new global model M(t 2 ) with the new cumulative gradient, no new computed gradients are missing in the local model after merging M(t 2 ) with the new computed gradients that are currently present in the worker 240.
  • a training manager such as the training manager 230 from obtaining its respective new cumulative gradient before the respective new cumulative gradient is locally updated with a respective local copy 242 of the most updated version of the machine learning model 232, for example, M(t 2 ).
  • the training manager 230 is conducted in part by the workers 240 (w ⁇ ) and in part by the training manager 230.
  • each of the workers 240 (w ⁇ ) may be expressed by an exemplary pseudocode excerpt 1 below.
  • a certain worker 240 w receives the predefined staleness threshold s and a learning rate parameter ⁇ (line 1).
  • the worker 240 w initializes a counter i with the predefined staleness threshold s and a counter c to 0 (line 2).
  • the worker 240 w then computes a respective cumulative gradient for its respective local copy 242 M w of the machine learning model 232 M by minimizing a loss function gQ (line 5).
  • the worker 240 w may repeat computing gradients while the number of gradients does not exceed the staleness threshold s (line 4).
  • the worker 240 w stops computing additional gradients and waits for notification from the training manager 230 on availability of a new machine learning model 232 M (line 8). Once the notification is received the worker 240 w invokes function downloadModelAndMergeQ (line 9) to download the new machine learning model 232 M and to merge it with locally available cumulative gradient (lines 11-15).
  • the counter c counts the number of gradients locally computed by the worker 240 w which were not merged with the updated version of the machine learning model 232 in the current training iteration, i.e. the new cumulative gradient (line 6).
  • the counter i is reset to the value of c to indicate that c gradients are locally computed and available at the respective worker 240 w while the number of gradients that each worker 240 w is allowed to compute before hitting the staleness threshold s, is s— c.
  • one or more of the workers 240 w may use an event driven implementation to invoke the function downloadModelAndMergeQ asynchronously upon reception of the notification (event) from the training manager 230 on availability of the new updated version of the machine learning model 232. This may be done by an event driven invocation of the function preventiveModelUpdateQ (lines 20-23).
  • Frequent such asynchronous invocations by the respective worker 240 w may reduce the number of times in which the respective worker 240 w is forced to wait (idle) for the notification from the training manager 230 and possibly completely eliminate the waiting period (times).
  • the training manager 230 may invoke the function extractGradsQ remotely for each of the workers 240 w to extract cumulative gradients from the workers 240 w and transfer cumulative gradients to the training manager 230 to be merged with the global machine learning model 232.
  • each worker 240 w computes its respective cumulative gradient as the difference AM between the most up to date model M w locally available to the worker 240 w compared to the downloaded copy M of the updated machine learning model 232 (line 16-17).
  • the worker 240 w may then provide the cumulative gradient AM (line 19) after resetting the counter c to 0 (line 18).
  • the operation of the training manager 230 may be expressed by an exemplary pseudocode excerpt 2 below.
  • the training manager 230 randomly initializes the machine learning model 232 M (line 1). The training manager 230 then waits for a respective cumulative gradient to be available from each of the workers 240 w E W (line 4). Once the plurality of cumulative gradients are available from all workers 240 w , the training manager 230 invokes remotely the function extractGradsQ on each of the workers 240 w which extract their cumulative gradient expressed (as described herein above) as the difference AM between the current version M w of the machine learning model 232 and the respective local copy M w 242 downloaded by a respective worker 240 w (line 12 in Pseudocode Excerpt 1)). The extracted cumulative gradients are then transferred to the server 202. .
  • the training manager 230 then aggregates the plurality of cumulative gradients, for example, averages them to get the aggregated value (line 10).
  • the training manager 230 creates an updated version M of the machine learning model 232 using the aggregated value (line 11).
  • the training manager 230 may then notify the workers 240 w of availability of the newly updated version M of the machine learning model 232 (line 12) to allow the workers 240 w to obtain (e.g. download) the updated version M, for example, using the function downloadModelAndMerge () .
  • the training manager 230 executes a loop comprising a plurality of training iterations (line 13), where in each training iteration the training manager 230 waits until all the workers 240 w have the cumulative gradient (at least one gradient) computed, i.e. in each worker c > 0. When this condition is fulfilled, the training manager 230 extracts the cumulative gradients from all the workers 240 w, computes an aggregated value, for example, an average value of the cumulative gradients and uses the aggregated value, for example the averaged gradient to update the version M of the machine learning model 232.
  • a compound or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Debugging And Monitoring (AREA)
PCT/EP2017/072079 2017-09-04 2017-09-04 DESCENT OF STOCHASTIC GRADIENT DISTRIBUTED TO AVERAGE ASYNCHRONOUS GRADIENT FORMATION WO2019042571A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2017/072079 WO2019042571A1 (en) 2017-09-04 2017-09-04 DESCENT OF STOCHASTIC GRADIENT DISTRIBUTED TO AVERAGE ASYNCHRONOUS GRADIENT FORMATION
CN201780094579.4A CN111052155B (zh) 2017-09-04 2017-09-04 异步梯度平均的分布式随机梯度下降法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2017/072079 WO2019042571A1 (en) 2017-09-04 2017-09-04 DESCENT OF STOCHASTIC GRADIENT DISTRIBUTED TO AVERAGE ASYNCHRONOUS GRADIENT FORMATION

Publications (1)

Publication Number Publication Date
WO2019042571A1 true WO2019042571A1 (en) 2019-03-07

Family

ID=59799368

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2017/072079 WO2019042571A1 (en) 2017-09-04 2017-09-04 DESCENT OF STOCHASTIC GRADIENT DISTRIBUTED TO AVERAGE ASYNCHRONOUS GRADIENT FORMATION

Country Status (2)

Country Link
CN (1) CN111052155B (zh)
WO (1) WO2019042571A1 (zh)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978177A (zh) * 2019-03-19 2019-07-05 腾讯科技(深圳)有限公司 模型训练方法、业务处理方法、装置及相关设备
CN110619388A (zh) * 2019-09-20 2019-12-27 北京金山数字娱乐科技有限公司 一种分布式训练中梯度同步方法及装置
CN111580962A (zh) * 2020-04-29 2020-08-25 安徽理工大学 一种具有权值衰减的分布式自适应在线学习方法
CN111580970A (zh) * 2020-05-07 2020-08-25 电子科技大学 一种联邦学习的模型分发与聚合的传输调度方法
EP3754502A1 (en) * 2019-06-21 2020-12-23 Accenture Global Solutions Limited Coordinated multiple worker node causal inference framework
WO2021056043A1 (en) * 2019-09-23 2021-04-01 Presagen Pty Ltd Decentralised artificial intelligence (ai)/machine learning training system
WO2021090323A1 (en) * 2019-11-05 2021-05-14 Technion Research & Development Foundation Limited Gap-aware mitigation of gradient staleness
US20210166117A1 (en) * 2019-12-02 2021-06-03 Waymo Llc Machine learning training platform
CN113128696A (zh) * 2019-12-31 2021-07-16 香港理工大学深圳研究院 分布式机器学习通信优化方法、装置、服务器及终端设备
WO2022038397A1 (en) * 2020-08-19 2022-02-24 Telefonaktiebolaget Lm Ericsson (Publ) Generating a machine learning model
US20220121974A1 (en) * 2020-10-16 2022-04-21 Ford Global Technologies, Llc Automated synchronization of clone directed acyclic graphs
CN116702885A (zh) * 2023-08-02 2023-09-05 浪潮电子信息产业股份有限公司 同步数据并行训练控制方法、系统、装置、设备及介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113129200A (zh) * 2019-12-30 2021-07-16 中兴通讯股份有限公司 深度学习方法、装置、网络设备和可读存储介质
CN111523686B (zh) * 2020-04-23 2021-08-03 支付宝(杭州)信息技术有限公司 一种模型联合训练的方法和系统
CN112598118B (zh) * 2021-03-03 2021-06-25 成都晓多科技有限公司 有监督学习的标注异常处理方法、装置、存储介质及设备
CN112861991B (zh) * 2021-03-09 2023-04-14 中山大学 一种面向神经网络异步训练的学习率调整方法
CN113327598B (zh) * 2021-06-30 2023-11-14 北京有竹居网络技术有限公司 模型的训练方法、语音识别方法、装置、介质及设备

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8768870B1 (en) * 2012-05-22 2014-07-01 Google Inc. Training a model using parameter server shards

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463324A (zh) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 一种基于大规模高性能集群的卷积神经网络并行处理方法
CN106951926B (zh) * 2017-03-29 2020-11-24 山东英特力数据技术有限公司 一种混合架构的深度学习方法及装置

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8768870B1 (en) * 2012-05-22 2014-07-01 Google Inc. Training a model using parameter server shards

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AMR AHMED ET AL: "Scalable inference in latent variable models", PROCEEDINGS OF THE FIFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM '12, 8 February 2012 (2012-02-08), New York, New York, USA, pages 123, XP055470119, ISBN: 978-1-4503-0747-5, DOI: 10.1145/2124295.2124312 *
HENGGANG CUI ET AL: "Exploiting bounded staleness to speed up Big Data analytics", 19 June 2014 (2014-06-19), XP055470122, Retrieved from the Internet <URL:http://www.cs.cmu.edu/~seunghak/Cui_etal_ATC14.pdf> [retrieved on 20180424] *
JAMES CIPAR ET AL: "Solving the straggler problem with bounded staleness", USENIX,, 14 May 2013 (2013-05-14), pages 1 - 6, XP061008417 *
ZHONGYANG ZHENG ET AL: "SpeeDO: Parallelizing Stochastic Gradient Descent for Deep Convolutional Neural Network", 25 May 2016 (2016-05-25), XP055472972, Retrieved from the Internet <URL:https://pdfs.semanticscholar.org/d998/2c505ee7af8ca42c911062d32831eb492c1a.pdf> [retrieved on 20180507] *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978177B (zh) * 2019-03-19 2023-06-23 腾讯科技(深圳)有限公司 模型训练方法、业务处理方法、装置及相关设备
CN109978177A (zh) * 2019-03-19 2019-07-05 腾讯科技(深圳)有限公司 模型训练方法、业务处理方法、装置及相关设备
US11574216B2 (en) 2019-06-21 2023-02-07 Accenture Global Solutions Limited Coordinated multiple worker node causal inference framework
EP3754502A1 (en) * 2019-06-21 2020-12-23 Accenture Global Solutions Limited Coordinated multiple worker node causal inference framework
CN110619388A (zh) * 2019-09-20 2019-12-27 北京金山数字娱乐科技有限公司 一种分布式训练中梯度同步方法及装置
CN110619388B (zh) * 2019-09-20 2024-04-02 北京金山数字娱乐科技有限公司 一种分布式训练中梯度同步方法及装置
WO2021056043A1 (en) * 2019-09-23 2021-04-01 Presagen Pty Ltd Decentralised artificial intelligence (ai)/machine learning training system
US11631035B2 (en) 2019-11-05 2023-04-18 Technion Research & Development Foundation Limited Gap-aware mitigation of gradient staleness
WO2021090323A1 (en) * 2019-11-05 2021-05-14 Technion Research & Development Foundation Limited Gap-aware mitigation of gradient staleness
US20210166117A1 (en) * 2019-12-02 2021-06-03 Waymo Llc Machine learning training platform
US11941519B2 (en) * 2019-12-02 2024-03-26 Waymo Llc Machine learning training platform
CN113128696A (zh) * 2019-12-31 2021-07-16 香港理工大学深圳研究院 分布式机器学习通信优化方法、装置、服务器及终端设备
CN111580962A (zh) * 2020-04-29 2020-08-25 安徽理工大学 一种具有权值衰减的分布式自适应在线学习方法
CN111580970A (zh) * 2020-05-07 2020-08-25 电子科技大学 一种联邦学习的模型分发与聚合的传输调度方法
WO2022038397A1 (en) * 2020-08-19 2022-02-24 Telefonaktiebolaget Lm Ericsson (Publ) Generating a machine learning model
US20220121974A1 (en) * 2020-10-16 2022-04-21 Ford Global Technologies, Llc Automated synchronization of clone directed acyclic graphs
CN116702885A (zh) * 2023-08-02 2023-09-05 浪潮电子信息产业股份有限公司 同步数据并行训练控制方法、系统、装置、设备及介质
CN116702885B (zh) * 2023-08-02 2023-11-07 浪潮电子信息产业股份有限公司 同步数据并行训练控制方法、系统、装置、设备及介质

Also Published As

Publication number Publication date
CN111052155A (zh) 2020-04-21
CN111052155B (zh) 2024-04-16

Similar Documents

Publication Publication Date Title
WO2019042571A1 (en) DESCENT OF STOCHASTIC GRADIENT DISTRIBUTED TO AVERAGE ASYNCHRONOUS GRADIENT FORMATION
US11296923B2 (en) Network fault originator identification for virtual network infrastructure
US10505818B1 (en) Methods for analyzing and load balancing based on server health and devices thereof
US10949746B2 (en) Efficient parallel training of a network model on multiple graphics processing units
US10348825B2 (en) Network platform-as-a-service for creating and inserting virtual network functions into a service provider network
US9747093B2 (en) Device driver aggregation in operating system deployment
US8739157B2 (en) System and method for managing cloud deployment configuration of an application
US10348628B2 (en) Placement of virtual machines in a virtualized computing environment
US9785522B2 (en) Adaptive datacenter topology for distributed frameworks job control through network awareness
US20150006705A1 (en) Network device load balancing in a virtualized computing environment
US9756099B2 (en) Streams optional execution paths depending upon data rates
US11507359B2 (en) Performing firmware updates using blockchain
CN103701661A (zh) 一种实现节点监控的方法及系统
US10355929B2 (en) Mitigating network impact of disruptive device changes
US10218622B2 (en) Placing a network device into a maintenance mode in a virtualized computing environment
Loughran et al. Dynamic cloud deployment of a mapreduce architecture
Mandal et al. Heterogeneous bandwidth provisioning for virtual machine migration over SDN-enabled optical networks
US9866462B2 (en) Information processing system and information processing method
CN104219226A (zh) 一种确定云平台中最优通信代理节点数目的方法
US9280383B2 (en) Checkpointing for a hybrid computing node
US11886861B2 (en) Propagating application properties to multiple instances
CN105868012B (zh) 处理用户请求的方法和装置
US9722897B2 (en) Managing isolation requirements of a multi-node workload application
Ouimet et al. Game servers deployment automation case study
US20160274886A1 (en) Performing code load operations on managed components in a system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17762101

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17762101

Country of ref document: EP

Kind code of ref document: A1