WO2018077236A1 - 分布式机器学习方法和系统 - Google Patents

分布式机器学习方法和系统 Download PDF

Info

Publication number
WO2018077236A1
WO2018077236A1 PCT/CN2017/108036 CN2017108036W WO2018077236A1 WO 2018077236 A1 WO2018077236 A1 WO 2018077236A1 CN 2017108036 W CN2017108036 W CN 2017108036W WO 2018077236 A1 WO2018077236 A1 WO 2018077236A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameter
current
global
computing node
update
Prior art date
Application number
PCT/CN2017/108036
Other languages
English (en)
French (fr)
Inventor
江佳伟
崔斌
黄明
肖品
胡奔龙
余乐乐
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018077236A1 publication Critical patent/WO2018077236A1/zh
Priority to US16/266,559 priority Critical patent/US11263539B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/043Distributed expert systems; Blackboards
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of distributed computing and machine learning crossover technology, and in particular to a parameter synchronization optimization method and system suitable for distributed machine learning.
  • Known distributed machine learning includes distributed machine learning based on synchronous parallel protocols and distributed machine learning based on asynchronous parallel protocols.
  • a typical distributed machine learning system includes a parameter server and a compute node.
  • distributed machine learning based on synchronous parallel protocol means that in a distributed machine learning task, all computing nodes send parameter updates to the parameter server after completing the same number of iterations, and the parameter server updates according to the parameters of all computing nodes. Get the new global parameters and broadcast the new global parameters to all compute nodes. The compute node accepts the new global parameters before starting the next iteration.
  • Distributed machine learning based on asynchronous parallel protocol means that in a distributed machine learning task, each computing node sends a parameter update to the parameter server after completing one iteration, and the parameter server directly according to the The parameter update gets a new global parameter, and the computing node starts the next iteration directly from the parameter server to obtain the updated global parameter without waiting for other computing nodes.
  • the parameter server can only be a physical server.
  • the model parameters When the model parameters are large, it will become a single point bottleneck.
  • part Compute nodes are significantly slower than other compute nodes, so the speed of the entire system is limited by the slowest compute nodes.
  • distributed machine learning of asynchronous parallel protocol because there is a speed difference between different computing nodes, there will be inconsistencies between the global parameters of the parameter server and the parameter copies of the computing nodes, and the calculations calculated by different computing nodes with inconsistent parameter copies. It will disturb the global parameters, resulting in instability of the global convergence of the learning model.
  • a distributed machine learning method comprising:
  • the parameter server receives a global parameter acquisition instruction of the current computing node
  • the parameter server determines whether a difference between a current iteration number of the current computing node and a current number of iterations of other computing nodes is within a preset range
  • the parameter server sends the global parameter to the current computing node
  • a distributed machine learning system comprising: a processor and a memory coupled to the processor; wherein the memory stores an instruction unit executable by the processor, the instruction unit comprising:
  • An instruction receiving module configured to receive a global parameter obtaining instruction of a current computing node
  • a determining module configured to determine whether a difference between a current iteration number of the current computing node and a current number of iterations of other computing nodes is within a preset range
  • the global parameter sending module sends the global parameter to the current computing node when the difference between the current number of iterations and the current number of iterations of the other computing nodes is within a preset range;
  • an update module configured to receive an update parameter sent by the current computing node after performing iterative learning of the current iteration round according to the global parameter, and receiving, by the current computing node, the global parameter according to the timestamp of receiving the update parameter
  • the timestamp of the parameter calculates a delay parameter, and the global parameter is updated according to the delay parameter and the update parameter to obtain an updated global parameter for storage.
  • a distributed machine learning method comprising:
  • the computing node sends a global parameter acquisition instruction to the parameter server
  • the computing node performs an iterative learning of the current number of iteration rounds according to the global parameter to obtain an update parameter
  • the computing node sends the update parameter to a parameter server.
  • a distributed machine learning system comprising: a processor and a memory coupled to the processor; wherein the memory stores an instruction unit executable by the processor,
  • the instruction unit includes:
  • An instruction sending module configured to send a global parameter obtaining instruction to the parameter server
  • a global parameter receiving module configured to receive a global parameter sent by the parameter server according to whether a difference between a current iteration round and a current iteration number of other computing nodes is within a preset range
  • a learning module configured to perform an iterative learning of the current number of iteration rounds according to the global parameter to obtain an update parameter
  • An update parameter sending module is configured to send the update parameter to the parameter server.
  • a storage medium having stored thereon a computer program; the computer program being executable by a processor and implementing a distributed machine learning method of any of the above implementations.
  • each computing node performs a parallel stochastic gradient descent algorithm with an assigned data subset to iteratively learn to train a machine learning model, and utilizes a parallel acceleration algorithm model training to avoid a single point bottleneck to ensure that the TB can be processed.
  • the computing node obtains the latest global parameters from the parameter server before starting each iteration.
  • the receiving parameter server starts the current round when it determines that the current computing node's iteration speed meets the global parameters sent within the preset range.
  • 1 is a system architecture diagram of a distributed machine learning method in an embodiment
  • FIG. 2 is a schematic diagram showing the internal structure of a parameter server in an embodiment
  • FIG. 3 is a flow chart of a distributed machine learning method in an embodiment
  • FIG. 5 is a schematic structural diagram of a distributed machine learning system in an embodiment
  • FIG. 6 is a schematic structural diagram of a distributed machine learning system in another embodiment
  • FIG. 7 is a schematic diagram showing the internal structure of a computing node in an embodiment
  • FIG. 10 is a schematic structural diagram of a distributed machine learning system in another embodiment
  • FIG. 11 is a schematic structural diagram of a distributed machine learning system in still another embodiment.
  • the distributed machine learning method provided by the embodiment of the present application can be applied to the system shown in FIG. 1.
  • the main control node 100 communicates with the computing node 300 through the parameter server 200 and forms a distributed machine learning system.
  • the master node 100 sends a learning task instruction to the parameter server 200 and monitors the parameter server 200 and the computing node 300.
  • the parameter server 200 sends the global parameters to the computing nodes 300.
  • Each computing node 300 performs iterative learning according to the global parameters and returns the updated parameters to the parameter server 200.
  • the master node 100 can be a smartphone, a tablet, a personal digital assistant (PDA) And personal computers, etc.
  • the parameter server 200 and the compute node 300 are typically physical server clusters, respectively.
  • the internal structure of the parameter server 200 shown in FIG. 1 is as shown in FIG. 2.
  • the parameter server 200 includes a processor 210, a storage medium 220, a memory 230, and a network interface 240 that are linked by a system bus.
  • the storage medium 220 of the parameter server 200 stores an operating system 221, a database 222, and a distributed machine learning system 223.
  • the database 222 is used to store data, such as global parameters, global parameter timestamps, current iteration rounds of the computing nodes, update parameters of the computing nodes, preset ranges that allow the speed difference of the computing nodes, and the like.
  • the processor 210 of the server 200 is used to provide computing and control capabilities to support the operation of the entire parameter server 200.
  • the memory 230 of the parameter server 200 provides an environment for the operation of the distributed machine learning system 223 in the storage medium 220.
  • the network interface 240 of the server 200 is configured to communicate with the external master node 100 and the computing node 300 through a network connection, such as receiving a learning task command sent by the master node 100, sending global parameters to the computing node 300, and receiving the computing node 300. Update parameters, etc.
  • a distributed machine learning method is provided in an embodiment of the present application.
  • the method is applicable to the parameter server shown in FIG. 2, and specifically includes the following steps:
  • Step 101 Receive a global parameter acquisition instruction of a current computing node.
  • Distributed machine learning refers to performing machine learning tasks in a distributed environment, splitting training data into multiple computing nodes, and each computing node performs a stochastic gradient descent (SGD) with the assigned subset of data. ) Iteratively learn to train machine learning models.
  • the stochastic gradient descent algorithm is an optimization algorithm commonly used in iterative machine learning, and will not be described here.
  • the number of computing nodes is multiple, and each computing node sends a global parameter obtaining instruction to the parameter server to obtain the current latest global parameter before performing the iterative learning of the current iteration number of rounds.
  • Step 103 Determine whether the difference between the current number of iterations of the current computing node and the current number of iterations of other computing nodes is within a preset range. If yes, go to step 105.
  • the preset range is used to limit the speed difference between performing different iterations between different computing nodes without exceeding the corresponding range.
  • the iteration speed of some computing nodes is too fast, so that the update speed of the global parameter copy used by different computing nodes is too large, resulting in different
  • the computing node disturbs the global parameters with the updated parameter calculated by the inconsistent global parameter copy, and limits the speed difference between different computing nodes to the predetermined range to form distributed machine learning of the finite asynchronous parallel protocol to reduce different computing nodes. The effect of the resulting update on the perturbation of global parameters.
  • step 105 the global parameter is sent to the current computing node.
  • the currently stored latest global parameter is sent to the current computing node. That is, when the difference between the current iteration number of the current computing node and the current number of iterations of other computing nodes is within a preset range, indicating that the current iteration speed of the computing node meets the requirements, the iterative learning of the current number of rounds can be started. .
  • the global parameter copy of the fastest computing node among all the current computing nodes may also be sent to the current computing node as the latest global parameter, and the global parameter copy of the fastest computing node is usually updated in real time.
  • the global parameter has the smallest gap. Therefore, using the fastest global parameter copy of the compute node as the latest global parameter can improve the training accuracy.
  • Step 107 Receive an update parameter sent by the current computing node after performing iterative learning of the current iteration round according to the global parameter, and calculate a delay parameter according to the timestamp of receiving the update parameter and the timestamp of the current computing node receiving the global parameter, according to the delay.
  • Parameters and update parameters update the global parameters to get updated global parameters for storage.
  • the parameter server sends the current latest global parameter to the current computing node before each round of iterative learning of the current computing node, to provide the computing node to perform the iterative learning of the current number of rounds according to the global parameter, and then obtain the updated parameter and return it to the parameter server.
  • the parameter server can update the global parameters after receiving the update parameters.
  • the process of performing iterative learning after the computing node receives the global parameters in the same number of rounds in different rounds or different computing nodes may generate different degrees of delay, and receive the global parameters by using the timestamp of receiving the update parameter and the computing node.
  • the timestamp calculates a delay parameter, and the global parameter is updated according to the delay parameter and the update parameter, wherein the delay parameter directly reflects the degree of delay of the computing node, and the computing node is updated by jointly updating the delay and the update to the global parameter.
  • the global parameters are updated to varying degrees, and the effects of updates generated by different computing nodes on the perturbation of global parameters are controlled.
  • each computing node performs a parallel stochastic gradient descent algorithm to learn and train the machine learning model by using the distributed data subset, and uses the parallel acceleration algorithm model training to shorten the training time of the original several months to weeks or days.
  • a single point bottleneck is avoided to ensure that the data amount of TB level and above can be processed, and the difference between the number of iterations between different computing nodes is controlled within a preset range, and the timestamp of receiving the update parameter is received by each computing node.
  • the timestamp of the global parameter calculates the delay parameter corresponding to the computing node, and the global parameter is updated according to the delay parameter and the updated parameter, and the delay parameter indicating the different delay degree can adjust the global parameter to a corresponding degree, and reduce the difference.
  • the effect of the update generated by the node on the perturbation of the global parameters ensures the stability of the overall convergence.
  • step 103 the step of determining whether the current number of iterations of the current computing node and the current number of iterations of the other computing nodes are within a preset range is specifically:
  • the difference between the number of iterations of different computing nodes is too large, which causes the server and each computing node to lose the latest parameter information of the other party for a long time and lose some of the updated data, which ultimately reduces the training accuracy.
  • the current minimum number of iterations in all compute nodes represents the real-time iteration of the slowest iterative compute node. By comparing with the real-time iteration of the slowest compute node, it is determined whether the difference is within the first preset range. The difference in iteration speed between the compute nodes does not exceed the preset range.
  • step 103 the step of determining whether the current number of iterations of the current computing node and the current number of iterations of the other computing nodes are within a preset range is specifically:
  • step 105 the step of sending the global parameter to the current computing node includes:
  • the global parameter is sent to the current computing node, and the timestamp of the current computing node receiving the global parameter is obtained and stored.
  • the timestamp of the current computing node receiving the global parameter indicates the time at which the computing node performs the iterative learning of the current round number to read the global parameter. Since each round of iterative learning will generate an update parameter, that is, the global parameter will generate a new update, so the acquisition
  • the current computing node receives the timestamp of the global parameter for storage, which may be convenient for the current computing node to receive the timestamp of the global parameter as the starting time of the delay of the iterative learning for calculating the current number of rounds, and the subsequent calculation according to the timestamp of the received update parameter and the current calculation.
  • the node obtains the timestamp of the global parameter to calculate the delay parameter. In the step, a more accurate result can be obtained when calculating the delay degree of the iterative learning of the current calculation node performing the current number of rounds.
  • a global parameter ⁇ is saved, and a timestamp t with the global parameter is maintained, and the parameter server monitors the number of iterations of the fastest and slowest computing nodes, using C max And C min to indicate that the parameter server monitors the timestamp of the most recent global parameter acquisition by all compute nodes, denoted by r[], and initializes C max , C min and r[] to 0.
  • the parameter server provides a pull function interface and a push function interface for the compute node.
  • the computing node issues a global parameter acquisition instruction to the parameter server by the pull function, and the parameter server acquires the global parameter acquisition instruction of the calculation node m.
  • the specific implementation manner of determining whether the current iteration round number c meets the preset range and then sending the global parameter to the current computing node is schematically represented as follows:
  • the current computing node is the computing node m
  • the current iteration number is c
  • C min refers to the current minimum iteration number of all computing nodes
  • r[] refers to the time stamp of all computing nodes to obtain global parameters
  • t refers to the global Parameter timestamp
  • refers to the global parameter
  • S refers to the preset range.
  • the parameter server provides a pull interface for the compute node.
  • the mth compute node obtains a brand new global parameter from the parameter server through the pull interface when starting the c-th iteration.
  • the parameter server compares the current iteration round number c with the minimum number of iteration rounds in all current nodes, and detects whether the difference between the iteration rounds is within a preset range, and starts the c-th iteration under the finite asynchronous constraint, if the The c-round iteration is to update the timestamp r[m] of the current iteration round of the current computing node to the global parameter timestamp, that is, the current global parameter timestamp is used as the current iteration number of the current computing node to obtain the global parameter.
  • the timestamp r[m] and return the latest global parameters to the current compute node.
  • step 107 receiving an update parameter sent by the current computing node after the iterative learning of the current iteration round according to the global parameter, and obtaining the global according to the timestamp of the received update parameter and the current computing node.
  • the timestamp of the parameter is used to calculate the delay parameter, and the global parameter is updated according to the delay parameter and the update parameter to obtain the updated global parameter for storage, including:
  • Step 1071 Receive an update parameter sent after the iterative learning of the current iteration round by the current computing node according to the global parameter.
  • Step 1072 Obtain a timestamp of receiving the update parameter as a current timestamp of the global parameter, and calculate a difference between a current timestamp of the global parameter and a timestamp of the global parameter obtained by the current computing node as a delay parameter.
  • Step 1073 Update the global parameter by updating the ratio of the parameter to the delay parameter to obtain the updated global parameter for storage.
  • the calculation node After the calculation node completes the iteration of the current number of rounds, it sends an update parameter to the global parameter to the parameter server.
  • the parameter server receives the timestamp of the update parameter as the current timestamp of the global parameter, and calculates a difference between the current timestamp of the global parameter and the timestamp of the global parameter obtained by the computing node as a delay parameter, and the delay parameter is current with the current computing node.
  • Update of the number of rounds Correspondingly, reflecting the delay degree of the current number of iterations of the current calculation node, the global parameter is updated by updating the ratio of the parameter and the delay parameter, wherein the larger the delay parameter, the smaller the influence of the corresponding update parameter on the global parameter update.
  • the equivalent is to apply the penalty to the update parameter after the delay parameter, and then update the global parameter, so that the intelligent perception is in the limited asynchronous parallel learning process. Delaying and controlling the update of the global parameters based on different degrees of delay further reduces the perturbation effect of the update generated by the iterative learning of different numbers of different computing nodes within the range of iterative speed differences on the global parameters.
  • the parameter server uses the number of parameter updates as the global timestamp of the global parameter, that is, each time the update parameter is received, the parameter server increments the timestamp of the global parameter by 1 as the current timestamp of the global parameter.
  • the computing node m sends the update parameter to the parameter server after completing the c-th iteration, and the parameter server obtains the timestamp of the receiving update parameter to calculate the delay parameter and according to the delay parameter and the update.
  • the current computing node is the computing node m
  • the current iteration number is c
  • t refers to the global parameter timestamp
  • refers to the global parameter
  • r[] refers to the timestamp of all computing nodes reading the global parameter
  • d refers to Delay parameter
  • u refers to the update parameter.
  • the parameter server provides a push interface for the compute node. After completing the c-th iteration, the mth computing node sends the update parameter u generated by the current iteration to the parameter server through the push interface. The parameter server adds the global parameter timestamp to 1 as the global parameter.
  • the current timestamp indicates the timestamp of receiving the update parameter
  • the current parameter timestamp of the global parameter is subtracted from the timestamp of the global parameter obtained by the computing node through the Pull interface to obtain the delay parameter d.
  • the update parameter u is divided by the delay parameter d as a penalty for updating the parameter Penalties are then added to the global parameters to get the latest global parameters.
  • the parameter server uses the number of parameter updates as the global timestamp of the global parameter.
  • a distributed machine learning system including an instruction receiving module 10, a judging module 13, a global parameter sending module 15, and an updating module 17.
  • the instruction receiving module 10 is configured to receive a global parameter acquisition instruction of the current computing node.
  • the determining module 13 is configured to determine whether the difference between the current number of iterations of the current computing node and the current number of iterations of the other computing nodes is within a preset range.
  • the global parameter sending module 15 is configured to send the global parameter to the current computing node when the difference between the current number of iterations and the current number of iterations of the other computing nodes is within a preset range.
  • the update module 17 is configured to receive an update parameter sent by the current calculation node after performing iterative learning of the current iteration round according to the global parameter, and calculate a delay parameter according to the timestamp of receiving the update parameter and the timestamp of the current calculation node acquiring the global parameter, according to the delay. Parameters and update parameters update the global parameters to get updated global parameters for storage.
  • the determining module 13 is specifically configured to determine whether the difference between the current number of iterations of the current computing node and the current minimum number of iterations in all computing nodes is within a first preset range. As another optional embodiment, the determining module 13 is configured to determine whether the difference between the current iteration number of the current computing node and the current maximum number of iteration rounds in all computing nodes is within a second preset range.
  • the global parameter sending module 15 is specifically configured to send the global parameter to the current computing node, and obtain a timestamp of the current computing node to receive the global parameter for storage.
  • the parameter server uses the global parameter timestamp maintained by itself as a timestamp for the computing node to receive the global parameter.
  • the update module 17 specifically includes a receiving unit 171 , a computing unit 172 , and an updating unit 173 .
  • the receiving unit 171 is configured to receive, after the iterative learning that the current computing node performs the current iteration round according to the global parameter, Update parameters.
  • the calculating unit 172 is configured to obtain a timestamp of receiving the update parameter as a current timestamp of the global parameter, and calculate a difference between the current timestamp of the global parameter and a timestamp of the current computing node acquiring the global parameter as a delay parameter.
  • the updating unit 173 is configured to store the updated global parameter by updating the global parameter with a ratio of the update parameter and the delay parameter.
  • the internal structure of the computing node 300 shown in FIG. 1 is as shown in FIG. 7.
  • the computing node 300 includes a processor 310, a storage medium 320, a memory 330, and a network interface 340 connected by a system bus.
  • the storage medium 320 of the computing node 300 stores an operating system 321, a database 322, and a distributed machine learning system 323.
  • the database 322 is used to store local data, such as storing global parameters obtained from the parameter server 200 as global parameter copies and the like.
  • the processor 310 of the computing node 300 is used to provide computing and control capabilities to support the operation of the entire access distributed machine learning system.
  • the memory 330 of the compute node 300 provides an operating environment for a distributed machine learning system in a storage medium.
  • the network interface 340 of the computing node 300 is configured to communicate with the external parameter server 200 through a network connection, such as sending a global parameter acquisition instruction to the parameter server 200, receiving a global parameter sent by the parameter server 200, and transmitting the update parameter to the parameter server 200, etc. .
  • a distributed machine learning method is provided in another embodiment of the present application.
  • the method is applicable to the computing node shown in FIG. 1 , and specifically includes the following steps:
  • Step 201 Send a global parameter acquisition instruction to the parameter server.
  • the parameter server provides a pull interface for the computing node, and the computing node sends a global parameter obtaining instruction to the parameter server through the pull interface to obtain the latest global parameter before starting the iterative learning of the current iteration number.
  • Step 203 Receive a global parameter sent by the parameter server according to whether the difference between the current iteration number and the current iteration number of other computing nodes is within a preset range.
  • the parameter server judges whether the iteration speed meets the requirements according to the current iteration number of the computing node, and sends the latest global parameter to the computing node on the basis of judging the requirements, thereby implementing the distributed machine learning method based on the asynchronous parallel protocol.
  • the difference of iteration speed of different computing nodes is controlled to realize distributed machine learning of finite asynchronous parallel protocol. Therefore, there is a certain difference between the speeds of the computing nodes, and the fast computing nodes can not wait for the slow computing nodes, avoiding the entire system waiting for the slowest computing nodes and effectively reducing the waiting time.
  • the parameter server determines whether the current iteration speed of the current iteration number satisfies the requirement by determining whether the difference between the current iteration number and the current iteration number of the other computing nodes is within a preset range.
  • the latest global parameter refers to the updated global parameter obtained by the parameter server in real time updating the global parameter according to the update generated by each round of iterative learning of the computing node.
  • the parameter server is based on the current computing node.
  • the global parameter acquisition instruction of the number of rounds sends the global parameter copy of the fastest computing node among all the current computing nodes as the latest global parameter to the current computing node when determining that the range of the iterative speed difference is met, usually the fastest computing node
  • the global parameter copy has the smallest gap with the global parameters of the real-time collaborative update. Therefore, using the fastest global parameter copy of the compute node as the latest global parameter can improve the training accuracy.
  • Step 205 Perform an iterative learning of the current number of iteration rounds according to the global parameter to obtain an update parameter.
  • the computing node receives the global parameters sent by the parameter server, performs a parallel stochastic gradient descent algorithm with the assigned subset of data, and iteratively learns to train the machine learning model to obtain updated parameters.
  • Step 207 sending the update parameter to the parameter server.
  • the computing node invokes the push interface of the parameter server to send the update parameters to the parameter server to provide the parameter server to update the global parameters.
  • each computing node performs a parallel stochastic gradient descent algorithm with a distributed data subset to iteratively learn to train a machine learning model, and utilizes a parallel acceleration algorithm.
  • the model training avoids the single point bottleneck to ensure the processing of TB level and above data.
  • the computing node obtains the latest global parameters from the parameter server before starting each iteration.
  • the receiving parameter server determines that the current computing node's iteration speed is consistent.
  • the global parameters sent within the preset range begin the iterative learning of the current round, so that the speed difference between different computing nodes is limited to a preset range, and distributed machine learning of limited asynchronous parallel protocol is formed to reduce different computing nodes. The effect of the resulting update on the perturbation of global parameters.
  • step 203 after receiving the global parameter sent by the parameter server according to the current iteration round number and the current iteration number of other computing nodes, whether the current iteration number is within a preset range, Also includes:
  • Step 204 Send a timestamp of receiving the global parameter to the parameter server.
  • the computing node sends the timestamp of receiving the global parameter to the parameter server, and the parameter server stores the timestamp of the global parameter received by the computing node as a starting point for calculating the delay parameter corresponding to the update parameter generated by the current iterative learning of the current computing node.
  • Time in order to support the parameter server to obtain the timestamp of the global parameter received by the computing node, calculate the delay parameter corresponding to the computing node when the update parameter of each iteration is learned, and apply a corresponding degree to the update corresponding to the number of iterations with a large delay Penalty to avoid delaying the generation of updates by a large number of iterations to disturb global parameters, and to control the effects of updates generated by different compute nodes on global parameters.
  • the timestamp of the computing node receiving the global parameter may be a global parameter timestamp maintained by the parameter server, and the global parameter timestamp may be a global parameter timestamp determined by the parameter server according to the number of times the update parameter is received.
  • the computing node issues a global parameter obtaining instruction to the parameter server, and the computing node m receives the parameter server according to the current iteration number of rounds.
  • the latest global parameters sent after meeting the preset range are iteratively learned, and the specific implementation manner of returning the parameter returning parameter server is schematically represented as follows:
  • ⁇ m pull(m,c)//Get the latest global parameters to the parameter server before starting the c-th iteration
  • ⁇ m represents a copy of the global parameter saved on the compute node
  • C represents the maximum number of iterations
  • um represents the local update parameter of the compute node m.
  • the compute node calls the pull interface of the parameter server to obtain the latest global parameters, and initializes the local update parameter to 0.
  • the compute node obtains the latest global parameters sent by the parameter server and traverses the allocation with the parallel stochastic gradient descent algorithm.
  • the data subset performs iteratively learning to train the machine learning model to obtain update parameters, and then calls the push interface of the parameter server to send update parameters to the parameter server to support the parameter data server to update the global parameters in real time according to the update parameters.
  • a distributed machine learning system including an instruction sending module 21, a global parameter receiving module 23, a learning module 25, and an update parameter sending module 27.
  • the instruction sending module 21 is configured to send a global parameter obtaining instruction to the parameter server.
  • the global parameter receiving module 23 is configured to receive a global parameter sent by the parameter server according to whether the difference between the current iteration round number and the current iteration number of other computing nodes is within a preset range.
  • the learning module 25 is configured to perform an iterative learning of the current number of iteration rounds according to the global parameters to obtain an update parameter.
  • the update parameter sending module 27 is configured to send the update parameters to the parameter server.
  • the distributed machine learning system further includes a timestamp sending module 24.
  • the timestamp sending module 24 is configured to: after the global parameter receiving module 23 receives the global parameter sent by the parameter server according to the current iteration round number and the current iteration number of other computing nodes, whether the current iteration number is within a preset range, the global parameter Receiving module 23 The timestamp of receiving the global parameter is sent to the parameter server.
  • the computer program can be stored in a computer readable storage medium by directly reading the computer program out of the storage medium or by installing or copying the computer program into a storage device (such as a hard disk and or a memory) of the data processing device. carried out.
  • a storage device such as a hard disk and or a memory
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Transfer Between Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

一种分布式机器学习方法,包括:接收当前计算节点的全局参数获取指令(101);判断当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内(103);若是,将全局参数发送给当前计算节点(105);接收当前计算节点根据全局参数执行当前的迭代轮数的迭代学习之后发送的更新参数,根据接收更新参数的时间戳与所述当前计算节点接收所述全局参数的时间戳计算延迟参数,根据延迟参数及更新参数对全局参数进行更新得到更新的全局参数进行存储(107)。还进一步提供一种分布式机器学习系统。

Description

分布式机器学习方法和系统
本申请要求于2016年10月31日提交中国专利局、申请号为201610968121.4、发明名称为“分布式机器学习方法和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及分布式计算和机器学习交叉技术领域,特别是涉及一种适合分布式机器学习的参数同步优化方法和系统。
发明背景
随着大数据时代的到来,机器学习算法尤其是适用于大规模数据的深度学习算法正得到越来越广泛的关注和应用,包括图像识别、推荐以及用户行为预测等。然而,随着输入训练数据(机器学习中用于求解神经网络模型的数据)和神经网络模型的增大,单节点进行机器学习训练存在内存限制以及数周甚至数月的训练耗时等问题,分布式机器学习应运而生。
已知的分布式机器学习包括基于同步并行协议的分布式机器学习和基于异步并行协议的分布式机器学习。典型的分布式机器学习系统包括参数服务器和计算节点。其中,基于同步并行协议的分布式机器学习是指在一个分布式机器学习任务中,所有的计算节点完成相同轮数的迭代后将参数更新发送给参数服务器,参数服务器根据所有计算节点的参数更新得到新的全局参数,并将新的全局参数广播给所有计算节点,计算节点接受到新的全局参数后才能开始下一轮迭代。基于异步并行协议的分布式机器学习是指在一个分布式机器学习任务中,每个计算节点在完成一轮迭代后将参数更新发送给参数服务器,参数服务器直接根据该 参数更新得到新的全局参数,该计算节点直接从参数服务器获取更新的全局参数开始下一轮迭代而不需要等待其它计算节点。
然而以上分布式机器学习存在以下缺陷:
同步并行协议的分布式机器学习中参数服务器只能是一个物理服务器,在模型参数很大时会成为单点瓶颈,在工业界的分布式环境中,因为计算节点的性能差异和网络延迟,部分计算节点的速度会明显比其它计算节点慢,从而整个系统的速度会被最慢的计算节点所限制。异步并行协议的分布式机器学习中因为允许不同计算节点之间存在速度差别,参数服务器的全局参数和计算节点的参数副本之间会存在不一致性,不同计算节点用不一致的参数副本计算出的更新会扰乱全局参数,造成学习模型全局收敛的不稳定。
发明内容
基于此,有必要提供一种无单点瓶颈、收敛稳定的分布式机器学习方法和系统。
一种分布式机器学习方法,包括:
参数服务器接收当前计算节点的全局参数获取指令;
所述参数服务器判断所述当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内;
若是,所述参数服务器将全局参数发送给所述当前计算节点;
所述参数服务器接收所述当前计算节点根据所述全局参数执行当前的迭代轮数的迭代学习之后发送的更新参数,根据接收所述更新参数的时间戳与所确定的所述当前计算节点接收所述全局参数的时间戳计算延迟参数,根据所述延迟参数及所述更新参数对所述全局参数进行更新得到更新的全局参数进行存储。
一种分布式机器学习系统,包括:处理器和与所述处理器相连的存储器;其中,所述存储器存储有可被所述处理器执行的指令单元,所述指令单元包括:
指令接收模块,用于接收当前计算节点的全局参数获取指令;
判断模块,用于判断所述当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内;
全局参数发送模块,当当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值在预设范围内时,将全局参数发送给所述当前计算节点;
更新模块,用于接收所述当前计算节点根据所述全局参数执行当前的迭代轮数的迭代学习之后发送的更新参数,根据接收所述更新参数的时间戳与所述当前计算节点接收所述全局参数的时间戳计算延迟参数,根据所述延迟参数及所述更新参数对所述全局参数进行更新得到更新的全局参数进行存储。
一种分布式机器学习方法,包括:
计算节点向参数服务器发送全局参数获取指令;
所述计算节点接收所述参数服务器根据所述计算节点的当前迭代轮数与其它计算节点的当前迭代轮数的差值是否在预设范围内的判断结果所发送的全局参数;
所述计算节点根据所述全局参数执行当前的迭代轮数的迭代学习得到更新参数;
所述计算节点将所述更新参数发送至参数服务器。
一种分布式机器学习系统,包括:处理器和与所述处理器相连的存储器;其中,所述存储器存储有可被所述处理器执行的指令单元,所述 指令单元包括:
指令发送模块,用于向参数服务器发送全局参数获取指令;
全局参数接收模块,用于接收所述参数服务器根据当前的迭代轮数与其它计算节点的当前迭代轮数的差值是否在预设范围内的判断结果所发送的全局参数;
学习模块,用于根据所述全局参数执行当前的迭代轮数的迭代学习得到更新参数;
更新参数发送模块,用于将所述更新参数发送至参数服务器。
一种存储介质,其上存储有计算机程序;所述计算机程序能够被一处理器执行并实现上述任一实现方式的分布式机器学习方法。
上述分布式机器学习方法和系统,各个计算节点用分配的数据子集执行并行随机梯度下降算法迭代地学习训练机器学习模型,利用并行加速算法模型训练,避免了单点瓶颈,以保证能够处理TB级及以上的数据量,计算节点在开始每轮迭代之前向参数服务器获取最新的全局参数,接收参数服务器在判断当前计算节点的迭代速度符合预设范围内所发送的全局参数才开始本轮的迭代学习,从而不同计算节点之间的速度差异限制在预设范围内,形成有限异步并行协议的分布式机器学习,以减小不同计算节点产生的更新对全局参数的扰动的影响,确保收敛稳定。
附图简要说明
图1为一实施例中分布式机器学习方法的系统架构图;
图2为一实施例中参数服务器的内部结构示意图;
图3为一实施例中分布式机器学习方法的流程图;
图4为另一实施例中分布式机器学习方法的流程图;
图5为一实施例中分布式机器学习系统的结构示意图;
图6为另一实施例中分布式机器学习系统的结构示意图;
图7为一实施例中计算节点的内部结构示意图;
图8为另一实施例中分布式机器学习方法的流程图;
图9为又一实施例中分布式机器学习方法的流程图;
图10为另一实施例中分布式机器学习系统的结构示意图;
图11为又一实施例中分布式机器学习系统的结构示意图。
实施本申请的方式
为了使本申请实施例的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请实施例进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请实施例,并不用于限定本申请。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。本文所使用的术语“及/或”包括一个或多个相关的所列项目的任意的和所有的组合。
本申请实施例提供的分布式机器学习方法可应用于图1所示的系统中,如图1所示,主控节点100通过参数服务器200与计算节点300进行通信并组成分布式机器学习系统,主控节点100发送学习任务指令给参数服务器200,并对参数服务器200和计算节点300进行监控。参数服务器200接收到学习任务指令后发送全局参数给各计算节点300,各计算节点300根据全局参数执行迭代学习并将更新参数返回给参数服务器200。主控节点100可以为智能手机、平板电脑、个人数字助理(PDA) 及个人计算机等。参数服务器200和计算节点300通常分别为物理服务器集群。
图1所示的参数服务器200的内部结构如图2所示,该参数服务器200包括通过系统总线链接的处理器210、存储介质220、内存230和网络接口240。其中,该参数服务器200的存储介质220存储有操作系统221、数据库222和一种分布式机器学习系统223。数据库222用于存储数据,如全局参数、全局参数时间戳、计算节点的当前迭代轮数、计算节点的更新参数、允许计算节点的速度差值的预设范围等。该服务器200的处理器210用于提供计算和控制能力,支撑整个参数服务器200的运行。该参数服务器200的内存230为存储介质220中的分布式机器学习系统223的运行提供环境。该服务器200的网络接口240用于与外部的主控节点100和计算节点300通过网络连接通信,比如接收主控节点100发送的学习任务指令、发送全局参数给计算节点300及接收计算节点300发送的更新参数等。
如图3所示,为本申请一实施例提供的一种分布式机器学习方法,该方法可应用于图2所示的参数服务器中,具体包括如下步骤:
步骤101,接收当前计算节点的全局参数获取指令。
分布式机器学习是指将机器学习任务在分布式环境中执行,将训练数据切分到多个计算节点,每个计算节点用分配的数据子集执行并行随机梯度下降算法(stochastic gradient descent,SGD)迭代地学习训练机器学习模型。其中随机梯度下降算法是迭代式机器学习中常用的优化算法,在此不赘述。计算节点的数量为多个,各计算节点在执行当前的迭代轮数的迭代学习之前,向参数服务器发送全局参数获取指令以获取当前最新的全局参数。
步骤103,判断当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内。若是,则执行步骤105。
预设范围用于限制不同计算节点之间执行相应迭代学习的速度差异不超出相应的范围。通过将不同计算节点之间的速度差异限制在该预设范围内,以避免出现部分计算节点的迭代速度过快,从而不同计算节点迭代学习使用的全局参数副本的更新速度差异过大,导致不同计算节点用不一致的全局参数副本计算出的更新参数扰乱全局参数,将不同计算节点之间的速度差异限制在该预设范围内形成有限异步并行协议的分布式机器学习,以减小不同计算节点产生的更新对全局参数的扰动的影响。
此外,当当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值超出预设范围时,表示该计算节点的当前迭代速度过快或者过慢,则等待至满足条件后才能开始当前轮数的迭代学习。
步骤105,将全局参数发送给当前计算节点。
当当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值在预设范围内时,将目前存储的最新的全局参数发送给当前计算节点。即当当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值在预设范围内时,表示该计算节点的当前迭代速度符合要求,可以开始当前轮数的迭代学习。在另一实施方式中,也可以将当前所有计算节点中最快的计算节点的全局参数副本作为最新的全局参数发送给当前计算节点,通常最快的计算节点的全局参数副本与实时协同更新的全局参数的差距最小,因此,将最快的计算节点的全局参数副本作为最新的全局参数可以提高训练准确率。
步骤107,接收当前计算节点根据全局参数执行当前的迭代轮数的迭代学习之后发送的更新参数,根据接收更新参数的时间戳与当前计算节点接收所述全局参数的时间戳计算延迟参数,根据延迟参数及更新参数对全局参数进行更新得到更新的全局参数进行存储。
参数服务器在当前计算节点每一轮迭代学习之前发送当前最新的全局参数给当前计算节点,以提供计算节点根据全局参数执行当前轮数的迭代学习之后得到更新参数返回给参数服务器。参数服务器接收到更新参数后即可对全局参数进行更新。其中,计算节点在不同轮数或者不同计算节点在同一轮数中接收到全局参数后执行迭代学习的过程会产生不同程度的延迟,通过根据接收更新参数的时间戳与所述计算节点接收全局参数的时间戳计算延迟参数,根据延迟参数和更新参数共同对所述全局参数进行更新,其中延迟参数直接反映计算节点的延迟程度,通过将延迟与更新共同约束对全局参数的更新使得计算节点每一轮迭代学习产生不同延迟程度时对全局参数的更新有相应不同程度的影响,控制不同计算节点产生的更新对全局参数的扰动的影响。
以上分布式机器学习方法,各个计算节点用分配的数据子集执行并行随机梯度下降算法迭代地学习训练机器学习模型,利用并行加速算法模型训练,将原有数月的训练时间缩短为周或者数天,避免了单点瓶颈,以保证能够处理TB级及以上的数据量,通过将不同计算节点之间迭代轮数的差异控制在预设范围内,根据接收更新参数的时间戳与各计算节点接收全局参数的时间戳计算该计算节点对应的延迟参数,根据延迟参数及更新参数共同约束对全局参数进行更新,表示不同延迟程度的延迟参数对全局参数的更新可进行相应程度的调整,减小不同计算节点产生的更新对全局参数的扰动的影响,确保整体收敛的稳定性。
在一个实施方式中,步骤103,判断当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内的步骤具体为:
判断当前计算节点当前的迭代轮数与所有计算节点中当前的最小迭代轮数的差值是否在第一预设范围内。
在分布式机器学习中,不同计算节点的迭代轮数之间的差值过大,会导致服务器和各个计算节点长时间不能得到对方最新的参数信息而损失部分更新数据,最终使得训练准确率下降。所有计算节点中当前的最小迭代轮数表示迭代最慢的计算节点的实时迭代情况,通过与最慢的计算节点的实时迭代情况进行比较判断差值是否在第一预设范围内,可以确保所有计算节点之间的迭代速度的差值不超过预设的范围。
在另一具体的实施例中,步骤103,判断当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内的步骤具体为:
判断当前计算节点当前的迭代轮数与所有计算节点中当前的最大迭代轮数的差值是否在第二预设范围内。
由于所有计算节点中当前的最大迭代轮数表示迭代最快的计算节点的实时迭代情况,通过与最快的计算节点的实时迭代情况进行比较判断差值是否在第二预设范围内,可以确保所有计算节点之间的迭代速度的差值不超过预设的范围。
在一个实施方式中,步骤105,将全局参数发送给当前计算节点的步骤包括:
将全局参数发送给当前计算节点,并获取当前计算节点接收全局参数的时间戳进行存储。
当前计算节点接收全局参数的时间戳表示计算节点执行当前轮数的迭代学习前读取全局参数的时间,由于每一轮迭代学习将产生更新参数,即表示全局参数将产生新的更新,因此获取当前计算节点接收全局参数的时间戳进行存储,可便于将当前计算节点接收全局参数的时间戳作为计算当前轮数的迭代学习的延迟的起点时间,在后续根据接收更新参数的时间戳与当前计算节点获取全局参数的时间戳计算延迟参数的 步骤中,计算当前计算节点执行当前轮数的迭代学习的延迟程度时可以获得更为准确的结果。
在一具体的实施例中,在参数服务器上,保存有全局参数θ,并维护有该全局参数的时间戳t,该参数服务器监控最快和最慢的计算节点的迭代轮数,用Cmax和Cmin来表示,参数服务器监控所有计算节点最近一次获取全局参数的时间戳,用r[]来表示,初始化Cmax、Cmin和r[]为0。参数服务器为计算节点提供拉取(pull)函数接口和推送(Push)函数接口。
以第m个计算节点开始第c轮迭代为例,第c轮迭代开始之前计算节点通过拉取(pull)函数向参数服务器发出全局参数获取指令,参数服务器获取计算节点m的全局参数获取指令,判断当前迭代轮数c是否符合预设范围之后发送全局参数给当前计算节点的具体实现方式示意性表示如下:
function Pull(m,c):
If c<=Cmin+S:
r[m]=t
return θ
其中,当前计算节点为计算节点m,当前迭代轮数为c,Cmin是指所有计算节点中当前最小迭代轮数,r[]是指所有计算节点获取全局参数的时间戳,t是指全局参数时间戳,θ是指全局参数,S是指预设范围。可以理解的,以上判断当前迭代轮数c是否符合预设范围的步骤中Cmin也可以用所有计算节点中当前最大迭代轮数Cmax代替,相应具体实现方式可示意性表示如下:
function Pull(m,c):
If c<=Cmax-S:
r[m]=t
return θ
参数服务器为计算节点提供pull接口。第m个计算节点在开始第c轮迭代时,通过pull接口向参数服务器获取全新的全局参数。参数服务器比较当前迭代轮数c与当前所有节点中最小迭代轮数的大小,检测迭代轮数的差值是否在预设范围内,实现在有限异步约束下开始第c轮迭代,如果能够开始第c轮迭代,则将当前计算节点的当前迭代轮数获取全局参数的时间戳r[m]更新为全局参数时间戳,即将当前的全局参数时间戳作为当前计算节点的当前迭代轮数获取全局参数的时间戳r[m],并返回最新的全局参数给当前计算节点。
请参阅图4,在另一实施例中,步骤107,接收当前计算节点根据全局参数执行当前的迭代轮数的迭代学习之后发送的更新参数,根据接收更新参数的时间戳与当前计算节点获取全局参数的时间戳计算延迟参数,根据延迟参数及更新参数对全局参数进行更新得到更新的全局参数进行存储具体包括:
步骤1071,接收当前计算节点根据全局参数执行当前的迭代轮数的迭代学习之后发送的更新参数。
步骤1072,获取接收所述更新参数的时间戳作为全局参数当前时间戳,计算全局参数当前时间戳与当前计算节点获取全局参数的时间戳的差值作为延迟参数。
步骤1073,以更新参数与延迟参数的比值对全局参数进行更新得到更新的全局参数进行存储。
计算节点完成当前轮数的迭代后,向参数服务器发送对全局参数的更新参数。参数服务器将接收更新参数的时间戳作为全局参数当前时间戳,并计算全局参数当前时间戳与所述计算节点获取全局参数的时间戳的差值作为延迟参数,该延迟参数与当前计算节点的当前轮数的更新相 对应,反映了当前计算节点当前轮数迭代学习的延迟程度,以更新参数与延迟参数的比值对全局参数进行更新,其中延迟参数越大,则对应的更新参数对全局参数的更新影响越小,延迟参数越小,则对应的更新参数对全局参数的更新影响则越大,相当于通过延迟参数对更新参数施加惩罚后再对全局参数进行更新,从而智能感知在有限的异步并行学习过程中的延迟并基于不同程度的延迟对全局参数的更新进行控制调整,进一步减小了在迭代速度差异范围内的不同计算节点的不同次数的迭代学习所产生的更新对于全局参数的扰动影响。
在一具体的实施例中,参数服务器用参数更新的次数作为全局参数当前时间戳,即每接收一次更新参数,参数服务器就将全局参数的时间戳加1作为全局参数的当前时间戳。仍以第m个计算节点开始第c轮迭代为例,计算节点m完成第c轮迭代后将更新参数发送给参数服务器,参数服务器获取接收更新参数的时间戳计算延迟参数并根据延迟参数和更新参数对全局参数进行更新的具体实现方式示意性表示如下:
function Push(m,c,u):
t=t+1:
d=t-r[m]
θ=θ+1/d*u
其中,当前计算节点为计算节点m,当前迭代轮数为c,t是指全局参数时间戳,θ是指全局参数,r[]是指所有计算节点读取全局参数的时间戳,d是指延迟参数,u是指更新参数。参数服务器为计算节点提供push接口。第m个计算节点在完成第c轮迭代后,将本轮迭代所产生的更新参数u通过push接口发送给参数服务器。参数服务器将全局参数时间戳加1作为全局参数当前时间戳表示接收更新参数的时间戳,并将全局参数当前时间戳减去计算节点通过Pull接口获取全局参数的时间戳得到延迟参数d,通过将更新参数u除以延迟参数d作为对更新参数的惩 罚然后增加到全局参数上,得到最新的全局参数。本实施例中,参数服务器用参数更新的次数作为全局参数当前时间戳。
如图5所示,在一个实施例中,提供了一种分布式机器学习系统,包括指令接收模块10、判断模块13、全局参数发送模块15和更新模块17。指令接收模块10用于接收当前计算节点的全局参数获取指令。判断模块13用于判断当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内。全局参数发送模块15用于当当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值在预设范围内时,将全局参数发送给当前计算节点。更新模块17用于接收当前计算节点根据全局参数执行当前的迭代轮数的迭代学习之后发送的更新参数,根据接收更新参数的时间戳与当前计算节点获取全局参数的时间戳计算延迟参数,根据延迟参数及更新参数对全局参数进行更新得到更新的全局参数进行存储。
在一个实施方式中,判断模块13具体用于判断所述当前计算节点当前的迭代轮数与所有计算节点中当前的最小迭代轮数的差值是否在第一预设范围内。作为另一可选的实施例中,判断模块13用于判断所述当前计算节点当前的迭代轮数与所有计算节点中当前的最大迭代轮数的差值是否在第二预设范围内。
在一个实施方式中,全局参数发送模块15具体用于将所述全局参数发送给所述当前计算节点,并获取所述当前计算节点接收所述全局参数的时间戳进行存储。在一个实施方式中,参数服务器将自身维护的全局参数时间戳作为计算节点接收所述全局参数的时间戳。
在一个实施方式中,请参阅图6,更新模块17具体包括接收单元171、计算单元172和更新单元173。接收单元171用于接收所述当前计算节点根据所述全局参数执行当前的迭代轮数的迭代学习之后发送的 更新参数。计算单元172用于获取接收所述更新参数的时间戳作为全局参数当前时间戳,计算所述全局参数当前时间戳与当前计算节点获取所述全局参数的时间戳的差值作为延迟参数。更新单元173用于以所述更新参数与所述延迟参数的比值对所述全局参数进行更新得到更新的全局参数进行存储。
图1所示的计算节点300的内部结构如图7所示,该计算节点300包括通过系统总线连接的处理器310、存储介质320、内存330和网络接口340。其中,该计算节点300的存储介质320存储有操作系统321、数据库322和一种分布式机器学习系统323。数据库322用于存储本地数据,如存储从参数服务器200获得的全局参数作为全局参数副本等。该计算节点300的处理器310用于提供计算和控制能力,支撑整个接入分布式机器学习系统的运行。该计算节点300的内存330为存储介质中的分布式机器学习系统提供运行环境。该计算节点300的网络接口340用于与外部的参数服务器200通过网络连接通信,比如向参数服务器200发送全局参数获取指令、接收参数服务器200发送的全局参数、将更新参数发送给参数服务器200等。
如图8所示,为本申请另一实施例提供的一种分布式机器学习方法,该方法可应用于图1所示的计算节点中,具体包括如下步骤:
步骤201,向参数服务器发送全局参数获取指令。
参数服务器为计算节点提供pull接口,计算节点在开始当前迭代轮数的迭代学习之前通过pull接口向参数服务器发送全局参数获取指令以获取最新的全局参数。
步骤203,接收参数服务器根据当前的迭代轮数与其它计算节点的当前迭代轮数的差值是否在预设范围内的判断结果所发送的全局参数。
参数服务器根据计算节点的当前迭代轮数对其迭代速度是否满足要求进行判断,在判断符合要求的基础上将最新的全局参数发送给计算节点,从而在基于异步并行协议的分布式机器学习方法的同时对不同计算节点的迭代速度的差异进行控制,实现有限异步并行协议的分布式机器学习。因此,允许计算节点的速度之间存在一定差异,快的计算节点可以不等待慢的计算节点,避免了整个系统等待最慢的计算节点而有效减少了等待时间。本实施例中,参数服务器通过判断当前的迭代轮数与其它计算节点的当前迭代轮数的差值是否在预设范围内以判断当前迭代轮数的迭代速度是否满足要求。最新的全局参数是指参数服务器根据计算节点的每一轮迭代学习所产生的更新对全局参数实时进行更新后得到的更新的全局参数,在另一实施例中,参数服务器根据当前计算节点的当前轮数的全局参数获取指令,在判断符合迭代速度差异范围内时将当前所有计算节点中最快的计算节点的全局参数副本作为最新的全局参数发送给当前计算节点,通常最快的计算节点的全局参数副本与实时协同更新的全局参数的差距最小,因此,将最快的计算节点的全局参数副本作为最新的全局参数可以提高训练准确率。
步骤205,根据全局参数执行当前的迭代轮数的迭代学习得到更新参数。
计算节点接收参数服务器所发送的全局参数,用分配的数据子集执行并行随机梯度下降算法迭代地学习训练机器学习模型得到更新参数。
步骤207,将更新参数发送至参数服务器。
完成当前轮数的迭代学习后,计算节点调用参数服务器的push接口将更新参数发送给参数服务器,以提供参数服务器对全局参数进行更新。
以上分布式机器学习方法,各个计算节点用分配的数据子集执行并行随机梯度下降算法迭代地学习训练机器学习模型,利用并行加速算法 模型训练,避免了单点瓶颈,以保证能够处理TB级及以上的数据量,计算节点在开始每轮迭代之前向参数服务器获取最新的全局参数,接收参数服务器在判断当前计算节点的迭代速度符合预设范围内所发送的全局参数才开始本轮的迭代学习,从而不同计算节点之间的速度差异限制在预设范围内,形成有限异步并行协议的分布式机器学习,以减小不同计算节点产生的更新对全局参数的扰动的影响。
在一个实施方式中,请参阅图9,步骤203,接收参数服务器根据当前的迭代轮数与其它计算节点的当前迭代轮数是否在预设范围内的判断结果所发送的全局参数的步骤之后,还包括:
步骤204,将接收全局参数的时间戳发送至参数服务器。
计算节点将接收全局参数的时间戳发送至参数服务器,参数服务器对所述计算节点接收全局参数的时间戳进行存储,以作为计算当前计算节点当前迭代学习产生的更新参数对应的延迟参数的起始时间,以支持参数服务器获取该计算节点接收全局参数的时间戳计算每次迭代学习的更新参数时该计算节点所对应的延迟参数,对延迟较大的迭代轮数所对应的更新施加相应程度的惩罚,以避免延迟较大的迭代轮数产生的更新扰乱全局参数,控制不同计算节点产生的更新对全局参数的扰动的影响。在一个实施方式中,所述计算节点接收全局参数的时间戳可以为参数服务器维护的全局参数时间戳,该全局参数时间戳可以是参数服务器根据接收更新参数的次数确定的全局参数时间戳。
在一具体的实施例中,以第m个计算节点开始第c轮迭代为例,第c轮迭代开始之前计算节点向参数服务器发出全局参数获取指令,计算节点m接收参数服务器根据当前迭代轮数符合预设范围之后发送的最新的全局参数进行迭代学习,得到更新参数返回参数服务器的具体实现方式示意性表示如下:
for c=0 to C:
θm=pull(m,c)//在开始第c轮迭代之前,向参数服务器获取最新的全局参数
um=0//初始化本地的参数更新为0
um=SGD(N,θm)//用SGD训练数据,得到更新参数
push(m,c,um)//调用参数服务器的Push接口,发送本地更新
其中,θm表示计算节点上保存的全局参数副本,C表示最大迭代数,um表示计算节点m的本地更新参数。计算节点在开始第c轮迭代之前,调用参数服务器的pull接口获取最新的全局参数,并初始化本地更新参数为0,计算节点获取到参数服务器发送的最新的全局参数并用并行随机梯度下降算法遍历分配的数据子集执行迭代地学习训练机器学习模型得到更新参数,再调用参数服务器的push接口发送更新参数至参数服务器,以支持参数据服务器根据更新参数对全局参数进行实时更新。
在一个实施方式中,请参阅图10,在一个实施例中,提供了一种分布式机器学习系统,包括指令发送模块21、全局参数接收模块23、学习模块25及更新参数发送模块27。指令发送模块21用于向参数服务器发送全局参数获取指令。全局参数接收模块23用于接收参数服务器根据当前的迭代轮数与其它计算节点的当前迭代轮数的差值是否在预设范围内的判断结果所发送的全局参数。学习模块25用于根据全局参数执行当前的迭代轮数的迭代学习得到更新参数。更新参数发送模块27用于将更新参数发送至参数服务器。
在一个实施方式中,请参阅图11,该分布式机器学习系统还包括时间戳发送模块24。时间戳发送模块24用于在全局参数接收模块23接收参数服务器根据当前的迭代轮数与其它计算节点的当前迭代轮数是否在预设范围内的判断结果所发送的全局参数之后,将全局参数接收模块 23接收全局参数的时间戳发送至参数服务器。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,显然,该计算机程序构成了本申请。所述的计算机程序可存储于一计算机可读取存储介质中,通过直接将计算机程序读取出存储介质或者通过将计算机程序安装或复制到数据处理设备的存储设备(如硬盘和或内存)中执行。该程序在执行时,可实现如上述各方法的实施例的流程。因此,这样的存储介质也构成了本申请。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所述实施例仅库达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。

Claims (15)

  1. 一种分布式机器学习方法,其特征在于:包括:
    参数服务器接收当前计算节点的全局参数获取指令;
    所述参数服务器判断所述当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内;
    若是,所述参数服务器将全局参数发送给所述当前计算节点;
    所述参数服务器接收所述当前计算节点根据所述全局参数执行当前的迭代轮数的迭代学习之后发送的更新参数,根据接收所述更新参数的时间戳与所确定的所述当前计算节点接收所述全局参数的时间戳计算延迟参数,根据所述延迟参数及所述更新参数对所述全局参数进行更新得到更新的全局参数进行存储。
  2. 如权利要求1所述的分布式机器学习方法,其特征在于:所述参数服务器判断所述当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内的步骤包括:
    所述参数服务器判断所述当前计算节点当前的迭代轮数与所有计算节点中当前的最小迭代轮数的差值是否在第一预设范围内。
  3. 如权利要求1所述的分布式机器学习方法,其特征在于:所述参数服务器判断所述当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内的步骤包括:
    所述参数服务器判断所述当前计算节点当前的迭代轮数与所有计算节点中当前的最大迭代轮数的差值是否在第二预设范围内。
  4. 如权利要求1所述的分布式机器学习方法,其特征在于:所述参数服务器将全局参数发送给所述当前计算节点之后,进一步包括:
    所述参数服务器获取所述当前计算节点接收所述全局参数的时 间戳。
  5. 如权利要求1所述的分布式机器学习方法,其特征在于:所述参数服务器根据接收所述更新参数的时间戳与所确定的所述当前计算节点接收所述全局参数的时间戳计算延迟参数,根据所述延迟参数及所述更新参数对所述全局参数进行更新得到更新的全局参数进行存储的步骤包括:
    所述参数服务器获取接收所述更新参数的时间戳作为全局参数当前时间戳,计算所述全局参数当前时间戳与所确定的所述当前计算节点接收所述全局参数的时间戳的差值作为延迟参数;
    所述参数服务器以所述更新参数与所述延迟参数的比值对所述全局参数进行更新得到更新的全局参数进行存储。
  6. 一种分布式机器学习系统,其特征在于:包括:处理器和与所述处理器相连的存储器;其中,所述存储器存储有可被所述处理器执行的指令单元,所述指令单元包括:
    指令接收模块,用于接收当前计算节点的全局参数获取指令;
    判断模块,用于判断所述当前计算节点当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值是否在预设范围内;
    全局参数发送模块,当当前的迭代轮数与其它计算节点当前的迭代轮数之间的差值在预设范围内时,将全局参数发送给所述当前计算节点;
    更新模块,用于接收所述当前计算节点根据所述全局参数执行当前的迭代轮数的迭代学习之后发送的更新参数,根据接收所述更新参数的时间戳与所述当前计算节点接收所述全局参数的时间戳计算延迟参数,根据所述延迟参数及所述更新参数对所述全局参数进行更新得到更新的全局参数进行存储。
  7. 如权利要求6所述的分布式机器学习系统,其特征在于:所 述判断模块,具体用于:判断所述当前计算节点当前的迭代轮数与所有计算节点中当前的最小迭代轮数的差值是否在第一预设范围内。
  8. 如权利要求6所述的分布式机器学习系统,其特征在于:所述判断模块,具体用于:判断所述当前计算节点当前的迭代轮数与所有计算节点中当前的最大迭代轮数的差值是否在第二预设范围内。
  9. 如权利要求6所述的分布式机器学习系统,其特征在于:所述全局参数发送模块,进一步用于在将所述全局参数发送给所述当前计算节点之后,获取所述当前计算节点接收所述全局参数的时间戳。
  10. 如权利要求6所述的分布式机器学习系统,其特征在于:所述更新模块具体包括:
    接收单元,用于接收所述当前计算节点根据所述全局参数执行当前的迭代轮数的迭代学习之后发送的更新参数;
    计算单元,用于获取接收所述更新参数的时间戳作为全局参数当前时间戳,计算所述全局参数当前时间戳与所述当前计算节点接收所述全局参数的时间戳的差值作为延迟参数;
    更新单元,用于以所述更新参数与所述延迟参数的比值对所述全局参数进行更新得到更新的全局参数进行存储。
  11. 一种分布式机器学习方法,其特征在于:包括:
    计算节点向参数服务器发送全局参数获取指令;
    所述计算节点接收所述参数服务器根据所述计算节点的当前迭代轮数与其它计算节点的当前迭代轮数的差值是否在预设范围内的判断结果所发送的全局参数;
    所述计算节点根据所述全局参数执行当前的迭代轮数的迭代学习得到更新参数;
    所述计算节点将所述更新参数发送至参数服务器。
  12. 如权利要求11所述的分布式机器学习方法,其特征在于:所述计算节点接收所述参数服务器根据所述计算节点的当前迭代轮数与其它计算节点的当前迭代轮数是否在预设范围内的判断结果所发送的全局参数的步骤之后,还包括:
    所述计算节点将接收所述全局参数的时间戳发送至参数服务器。
  13. 一种分布式机器学习系统,其特征在于:包括:处理器和与所述处理器相连的存储器;其中,所述存储器存储有可被所述处理器执行的指令单元,所述指令单元包括:
    指令发送模块,用于向参数服务器发送全局参数获取指令;
    全局参数接收模块,用于接收所述参数服务器根据当前的迭代轮数与其它计算节点的当前迭代轮数的差值是否在预设范围内的判断结果所发送的全局参数;
    学习模块,用于根据所述全局参数执行当前的迭代轮数的迭代学习得到更新参数;
    更新参数发送模块,用于将所述更新参数发送至参数服务器。
  14. 如权利要求13所述的分布式机器学习系统,其特征在于:所述指令单元还包括时间戳发送模块:用于将所述全局参数接收模块接收所述全局参数的时间戳发送至参数服务器。
  15. 一种存储介质,其上存储有计算机程序;其特征在于:所述计算机程序能够被一处理器执行并实现如权利要求1至5、11、12中任一项所述的分布式机器学习方法。
PCT/CN2017/108036 2016-10-31 2017-10-27 分布式机器学习方法和系统 WO2018077236A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/266,559 US11263539B2 (en) 2016-10-31 2019-02-04 Distributed machine learning method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610968121.4 2016-10-31
CN201610968121.4A CN108009642B (zh) 2016-10-31 2016-10-31 分布式机器学习方法和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/266,559 Continuation US11263539B2 (en) 2016-10-31 2019-02-04 Distributed machine learning method and system

Publications (1)

Publication Number Publication Date
WO2018077236A1 true WO2018077236A1 (zh) 2018-05-03

Family

ID=62023145

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/108036 WO2018077236A1 (zh) 2016-10-31 2017-10-27 分布式机器学习方法和系统

Country Status (3)

Country Link
US (1) US11263539B2 (zh)
CN (1) CN108009642B (zh)
WO (1) WO2018077236A1 (zh)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829517B (zh) * 2018-05-31 2021-04-06 中国科学院计算技术研究所 一种用于在集群环境下进行机器学习的训练方法和系统
DE102018209595A1 (de) * 2018-06-14 2019-12-19 Robert Bosch Gmbh Verfahren zum automatischen Bestimmen eines Straßenzustands
CN109102075A (zh) * 2018-07-26 2018-12-28 联想(北京)有限公司 一种分布式训练中的梯度更新方法及相关设备
CN109445953A (zh) * 2018-08-30 2019-03-08 北京大学 一种面向大规模机器学习系统的机器学习模型训练方法
US11625640B2 (en) * 2018-10-05 2023-04-11 Cisco Technology, Inc. Distributed random forest training with a predictor trained to balance tasks
CN110400064A (zh) * 2019-07-10 2019-11-01 江苏博子岛智能科技有限公司 一种具备人工智能的物流控制系统及方法
CN111210020B (zh) * 2019-11-22 2022-12-06 清华大学 一种加速分布式机器学习的方法及系统
CN111158902B (zh) * 2019-12-09 2022-05-10 广东工业大学 一种移动边缘分布式机器学习系统和方法
CN111144584B (zh) * 2019-12-31 2024-01-19 深圳Tcl新技术有限公司 参数调优方法、装置及计算机存储介质
CN113094180B (zh) * 2021-05-06 2023-10-10 苏州联电能源发展有限公司 无线联邦学习调度优化方法及装置
CN113361598B (zh) * 2021-06-04 2022-10-11 重庆大学 基于分布式学习的模型训练方法、服务器及分布式系统
CN115034356B (zh) * 2022-05-09 2024-08-23 上海大学 一种用于横向联邦学习的模型融合方法及系统
CN116704296B (zh) * 2023-08-04 2023-11-03 浪潮电子信息产业股份有限公司 一种图像处理方法、装置、系统、设备及计算机存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130290223A1 (en) * 2012-04-27 2013-10-31 Yahoo! Inc. Method and system for distributed machine learning
CN103745225A (zh) * 2013-12-27 2014-04-23 北京集奥聚合网络技术有限公司 分布式ctr预测模型训练的方法和系统
CN104714852A (zh) * 2015-03-17 2015-06-17 华中科技大学 一种适用于分布式机器学习的参数同步优化方法及其系统
CN106059972A (zh) * 2016-05-25 2016-10-26 北京邮电大学 一种基于机器学习算法的mimo相关信道下的调制识别方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8027938B1 (en) * 2007-03-26 2011-09-27 Google Inc. Discriminative training in machine learning
CN100583802C (zh) * 2007-12-26 2010-01-20 北京理工大学 基于全局最小访问代价的副本选择方法
US20140279748A1 (en) * 2013-03-15 2014-09-18 Georges Harik Method and program structure for machine learning
US20150324690A1 (en) * 2014-05-08 2015-11-12 Microsoft Corporation Deep Learning Training System
CN104980518B (zh) * 2015-06-26 2018-11-23 深圳市腾讯计算机系统有限公司 多学习主体并行训练模型的方法、装置和系统
CN105956021B (zh) * 2016-04-22 2019-05-21 华中科技大学 一种适用于分布式机器学习的自动化任务并行的方法及其系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130290223A1 (en) * 2012-04-27 2013-10-31 Yahoo! Inc. Method and system for distributed machine learning
CN103745225A (zh) * 2013-12-27 2014-04-23 北京集奥聚合网络技术有限公司 分布式ctr预测模型训练的方法和系统
CN104714852A (zh) * 2015-03-17 2015-06-17 华中科技大学 一种适用于分布式机器学习的参数同步优化方法及其系统
CN106059972A (zh) * 2016-05-25 2016-10-26 北京邮电大学 一种基于机器学习算法的mimo相关信道下的调制识别方法

Also Published As

Publication number Publication date
CN108009642B (zh) 2021-12-14
US20190171952A1 (en) 2019-06-06
CN108009642A (zh) 2018-05-08
US11263539B2 (en) 2022-03-01

Similar Documents

Publication Publication Date Title
WO2018077236A1 (zh) 分布式机器学习方法和系统
US20220391771A1 (en) Method, apparatus, and computer device and storage medium for distributed training of machine learning model
US20200293838A1 (en) Scheduling computation graphs using neural networks
CN108205442B (zh) 边缘计算平台
US11941527B2 (en) Population based training of neural networks
US7453910B1 (en) Synchronization of independent clocks
WO2017223009A1 (en) Multi-domain joint semantic frame parsing
US20230386454A1 (en) Voice command detection and prediction
CN105144102B (zh) 自适应数据同步
CN111369009A (zh) 一种能容忍不可信节点的分布式机器学习方法
US11727925B2 (en) Cross-device data synchronization based on simultaneous hotword triggers
JP2016526719A (ja) 時間調整を使用したストリームデータ処理方法
US11356334B2 (en) Communication efficient sparse-reduce in distributed machine learning
US11341339B1 (en) Confidence calibration for natural-language understanding models that provides optimal interpretability
CN113434282A (zh) 流计算任务的发布、输出控制方法及装置
CN113821317A (zh) 一种边云协同的微服务调度方法、装置及设备
WO2024104232A1 (zh) 用于训练神经网络的方法、装置、设备和存储介质
US20220245401A1 (en) Method and apparatus for training model
CN106502842B (zh) 数据恢复方法及系统
CN111464451B (zh) 一种数据流等值连接优化方法、系统及电子设备
WO2024011908A1 (zh) 网络预测系统、方法、电子设备及存储介质
JP7409326B2 (ja) サーバおよび学習システム
CN109660310B (zh) 一种时钟同步的方法、装置、计算设备及计算机存储介质
WO2019150565A1 (ja) 学習済みモデルアップデートシステム、学習済みモデルアップデート方法、及びプログラム
CN113821313A (zh) 一种任务调度方法、装置及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17864355

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17864355

Country of ref document: EP

Kind code of ref document: A1