WO2019104713A1 - 一种机器学习方法、主节点、工作节点及系统 - Google Patents

一种机器学习方法、主节点、工作节点及系统 Download PDF

Info

Publication number
WO2019104713A1
WO2019104713A1 PCT/CN2017/114228 CN2017114228W WO2019104713A1 WO 2019104713 A1 WO2019104713 A1 WO 2019104713A1 CN 2017114228 W CN2017114228 W CN 2017114228W WO 2019104713 A1 WO2019104713 A1 WO 2019104713A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
parameter
node
global
result
Prior art date
Application number
PCT/CN2017/114228
Other languages
English (en)
French (fr)
Inventor
张本宇
徐昊
刘亚新
Original Assignee
杭州云脑科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州云脑科技有限公司 filed Critical 杭州云脑科技有限公司
Publication of WO2019104713A1 publication Critical patent/WO2019104713A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to the field of computer communication technologies, and in particular, to a machine learning method, a master node, a working node, and a distributed machine learning system.
  • the distributed machine learning system includes a master node and a plurality of working nodes.
  • the core objective is that the master node disassembles the computing task into a plurality of small tasks and allocates them to the processors of the plurality of working nodes for calculation. That is to say, different working nodes correspond to the same training model.
  • the training sub-results are fed back to the main node, and the main node merges the training sub-results of all working nodes. Get the final training results.
  • the sub-training results are usually combined by means of parameter averaging.
  • the training process is specifically: the master node configures global parameters based on the training model, and distributes the global parameters to each working node.
  • the parameter training is performed based on the global parameter and the corresponding data, the parameters of the working node obtained by the working node training are obtained, and the parameters of the working node are fed back to the master node.
  • the master node After receiving the parameters of the working nodes fed back by all working nodes, the master node performs weighted averaging processing, and the average value of the obtained track is the updated global parameter.
  • the master node needs to wait for all the working nodes participating in the training to feed back the training sub-results, and then the final update parameter can be determined.
  • Some working nodes in the working node have strong processing power and complete their own training tasks in a short period of time, while some working nodes have weak processing capabilities and take a long time to complete the training tasks.
  • the working node that completes the training needs to wait for the other working nodes to complete the training before the next round of training can be performed, resulting in a long idle time, resulting in a large synchronization overhead. .
  • Embodiments of the present invention provide a machine learning method, a master node, a work node, and a distributed machine learning system, which are used to reduce synchronization overhead of distributed machine learning.
  • the present invention provides a machine learning method, which is applied to a master node of a distributed machine learning system, where the master node is correspondingly provided with a working node, and the method includes:
  • the training sub-results fed back by the working nodes are received at the end time, and the global parameters are updated based on the obtained training sub-results.
  • the determining to join the working node of the parameter training process includes:
  • the working node that joins the parameter training process is a working node that joins the parameter training process.
  • the method further includes:
  • the training sub-result is a second difference between the training result parameter and the global parameter sent by the working node participating in the parameter training process based on the global parameter parameter training after obtaining the training result parameter.
  • the method further includes:
  • the present invention provides a machine learning method, which is applied to a working node of a distributed machine learning system, where the working node is correspondingly provided with a master node, and the method includes:
  • the method further includes:
  • the parameter training is performed within a time range indicated by the time information, including:
  • the first difference part is that the master node is based on the a first difference between the first global parameter and the global parameter determined by the first identification information and the second identification information of the global parameter locally stored by the primary node;
  • parameter training is performed based on the global parameter, and a training sub-result is obtained, and the training sub-result is fed back to the main node.
  • the parameter training is performed based on the global parameter, and the training sub-result is obtained, and the training sub-result is fed back to the main node, including:
  • the sending the second difference part to the primary node includes:
  • the parameter training is performed based on the global parameter, and the training sub-result is obtained, and the training sub-result is fed back to the main node, including:
  • the training sub-result corresponding to the global training model is fed back to the main node, and the training sub-result corresponding to the personalized model is saved locally.
  • an embodiment of the present invention provides a master node of a distributed machine learning system, where the master node is correspondingly provided with a working node, and the master node includes:
  • a first determining module configured to start a parameter training process, and determine to join the parameter training process As a node
  • a sending module configured to send time information corresponding to the parameter training process to the working node, where the time information includes an end time of the parameter training process, so that the working node will be before the end time
  • the training subroutine result is sent to the primary node
  • an update module configured to receive the training sub-results fed back by the working nodes at the end time, and update the global parameters based on the obtained training sub-results.
  • the first determining module is configured to:
  • the working node that joins the parameter training process is a working node that joins the parameter training process.
  • the sending module is further configured to:
  • the training sub-result is a second difference between the training result parameter and the global parameter sent by the working node participating in the parameter training process based on the global parameter parameter training after obtaining the training result parameter.
  • the working node further includes:
  • the training module is configured to determine, after the updating the global parameter, whether the updated global parameter reaches convergence; if not, restart the parameter training process after the preset time interval.
  • an embodiment of the present invention provides a working node of a distributed machine learning system, where the working node is correspondingly provided with a master node, and the working node includes:
  • a receiving module configured to receive, after receiving the notification that the working node joins the parameter training process, the time information corresponding to the parameter training process sent by the primary node, where the time information includes The end time of the parameter training process;
  • a training module configured to perform parameter training within a time range indicated by the time information, if the training has not been completed before the end time, end the training at the end time, obtain a training sub-score, and the training sub-result Feedback to the primary node; if the training is completed before the end time, and the time interval between the completion of the training and the end time is greater than a preset value, the working node is controlled to repeat the training, and the child obtained based on the repeated training As a result, the training sub-result is determined, and the training sub-result is fed back to the primary node.
  • the working node further includes:
  • the application module is configured to send application information for applying to join the parameter training process to the primary node, where the application information includes resource occupation information of the working node.
  • the training module is used to:
  • the first difference part is that the master node is based on the a first difference between the first global parameter and the global parameter determined by the first identification information and the second identification information of the global parameter locally stored by the primary node;
  • parameter training is performed based on the global parameter, and a training sub-result is obtained, and the training sub-result is fed back to the main node.
  • the training module is used to:
  • the training module is used to:
  • the training module is used to:
  • the training sub-result corresponding to the global training model is fed back to the main node, and the training sub-result corresponding to the personalized model is saved locally.
  • a fifth aspect is a distributed machine learning system, the distributed machine learning system comprising a master node and a work node, including:
  • the master node starts a parameter training process, determines a working node that joins the parameter training process, and sends time information corresponding to the parameter training process to the working node, where the time information includes the end of the parameter training process. time;
  • the primary node determines that the working node joins a parameter training process After the notification, obtain time information corresponding to the parameter training process sent by the primary node; perform parameter training within a time range indicated by the time information; if the training has not been completed before the end time, Ending the training at the end time, obtaining the training sub-results, and feeding back the training sub-results to the main node; if the training is completed before the end time, and the time interval from the completion of the training is greater than the preset time interval a value, controlling the working node to repeatedly perform training, determining the training sub-result based on the sub-result obtained by the repeated training, and feeding back the training sub-result to the main node;
  • the master node After receiving the training sub-results fed back by the working nodes participating in the parameter training process, the master node updates the global parameters based on the obtained training sub-results.
  • the working node after receiving the notification that the working node joins the parameter training process, the working node sends the first identifier information of the stored first global parameter to the primary node;
  • the working node After receiving the first difference part, the working node restores the global parameter based on the first difference part and the first global parameter, performs parameter training based on the global parameter, and obtains a training sub-score, The training sub-results are fed back to the primary node.
  • the master node determines a working node that joins the parameter training process, including:
  • the working node that joins the parameter training process is a working node that joins the parameter training process.
  • the method further includes:
  • the working node is further configured to:
  • the working node performs parameter training based on the global parameter, obtains a training sub-result, and feeds the training sub-result to the main node, including:
  • time information corresponding to the parameter training process sent by the primary node where the time information includes an end time of the parameter training process
  • the training ends at the end time, the training sub-result is obtained, and the training sub-result is fed back to the main node;
  • the time interval is greater than a preset value, and the working node is controlled to repeatedly perform training. Based on the sub-result obtained by the repeated training, the training sub-result is determined, and the training sub-result is fed back to the main node.
  • the working node performs parameter training based on the global parameter, obtains a training sub-result, and feeds the training sub-result to the main node, including:
  • the working node sends the second difference part to the master node, including:
  • the working node performs parameter training based on the global parameter, obtains a training sub-result, and feeds the training sub-result to the main node, including:
  • the training sub-result corresponding to the global training model is fed back to the main node, and the training sub-result corresponding to the personalized model is saved locally.
  • the master node of the distributed machine learning system determines the working node that joins the parameter training process, and the master node also needs to set a time for the parameter training process to end, and will end. The time is sent to each working node participating in the parameter training process. Further, after receiving the end time, the working node stops training at the end time, and feeds back the training sub-result obtained corresponding to the end time to the master node. In this way, since the end time of each parameter training is defined, the time for each training node to end training can be effectively controlled, thereby effectively reducing the synchronization overhead caused by the asynchronous training time of each working node.
  • FIG. 1 is a schematic structural diagram of a machine learning system according to a first embodiment of the present invention
  • FIG. 2 is a sequence diagram of a machine learning system for machine learning in a first embodiment of the present invention
  • FIG. 3 is a flowchart of a machine learning method corresponding to a master node in a second embodiment of the present invention
  • FIG. 4 is a flowchart of a machine learning method corresponding to a working point in a third embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a master node in a fourth embodiment of the present invention.
  • Figure 6 is a schematic diagram of a working node in a fifth embodiment of the present invention.
  • Embodiments of the present invention provide a machine learning method, a master node, a work node, and a distributed machine learning system, which are used to reduce synchronization overhead of distributed machine learning.
  • the distributed machine learning system includes a master node and a working node, and the master node starts a parameter training process, determines a working node that joins the parameter training process, and sends time information corresponding to the parameter training process to the working node, where The time information includes an end time of the parameter training process; after receiving the notification that the working node determines that the working node joins the parameter training process, the working node obtains the parameter sent by the primary node and the parameter a time information corresponding to the training process; performing parameter training within a time range indicated by the time information; if the training has not been completed before the end time, ending the training at the end time, obtaining a training sub-score, the training The sub-result is fed back to the master node; if the training is completed before the end time, and
  • Sub-result determine the training sub-result, and feedback the training sub-result
  • the master node the master node after training sub-node receives the results of the work to participate in the training process parameter feedback, based on the training sub results obtained, update the global parameters.
  • a first embodiment of the present invention provides a machine learning system, where the machine learning system includes a master node and a plurality of working nodes, and the master node and the working node are communicatively connected, including:
  • the master node starts a parameter training process, determines a working node that joins the parameter training process, and sends time information corresponding to the parameter training process to the working node, where the time information includes the end of the parameter training process.
  • the working node After receiving the notification that the working node joins the parameter training process, the working node obtains time information corresponding to the parameter training process sent by the primary node, and the time indicated by the time information. Performing parameter training within the range; if the training has not been completed before the end time, the training ends at the end time, the training sub-result is obtained, and the training sub-result is fed back to the main node; if at the end time Before the training is completed, and the time interval between the completion of the training and the end time is greater than a preset value, the working node is controlled to repeat the training, and the training sub-result is determined based on the sub-result obtained by the repeated training, and the training is performed. Sub-results are fed back to the master node;
  • the working node After receiving the notification that the working node joins the parameter training process, the working node sends the first identifier information of the stored first global parameter to the primary node;
  • the working node After receiving the first difference part, restores the global parameter based on the first difference part and the first global parameter, performs parameter training based on the global parameter, and obtains a training sub-score, Feeding the training sub-results to the primary node;
  • the master node After receiving the training sub-results fed back by the working nodes participating in the parameter training process, the master node updates the global parameters based on the obtained training sub-results.
  • the corresponding notification information is sent to the working node connected thereto, and after receiving the notification, the working node sends the application information of the application for joining the parameter training process to the primary node.
  • the master node determines the working node that joins the training process, and sends a notification to the corresponding working node to determine the joining parameter training process.
  • the working node After receiving the notification that the working node joins the parameter training process, the working node sends the first identifier information of the stored first global parameter to the master node, and the master node compares the first identifier information with the locally stored global parameter. And a second identifier, determining a first difference part of the first global parameter and the global parameter, and sending the first difference part to the corresponding working node. The working node receives the first difference part and restores the global parameter.
  • the master node determines that the working node A participates in the parameter training, and the latest global parameter version number of the master node is V811.
  • the master node also stores the global parameters of the previous versions, including the global parameters of the V810, V809, and V808 versions.
  • the version number of the saved first global parameter sent by the working node A is V810, indicating that the version of the global parameter saved in the working node A is not much different from the version of the global parameter saved by the primary node, and the master node determines the global parameter of the V811 version and the V810.
  • the difference part DA in the global parameter of the version sends the difference part DA to the working node A, and the working node A can restore the global parameter of the latest version in the master node based on the global parameter of the locally stored V810 version and the difference part DA, ie Global parameters for the V811 version.
  • the working node performs parameter training based on the global parameters, obtains the training sub-results, and feeds the training sub-results back to the main node.
  • the master node After receiving the training sub-results fed back by the working nodes participating in the parameter training process, the master node updates the global parameters based on the obtained training sub-results. In this way, the amount of data transmitted by the master node can be greatly reduced, and the communication overhead of machine learning can be effectively reduced.
  • the application information includes resource occupation information of the working node
  • the master node obtains the resource occupation information of the working node that is applied to join the parameter training process; when the resource occupation information meets the preset condition, determining that the working node that joins the parameter training process is the joining node The working node of the parameter training process.
  • the primary node determines whether to pass the resource occupancy status of the working node that is added to the parameter training process.
  • the resource occupation information sent by the working node may include: information such as a memory usage rate and a remaining power consumption, and of course, other information may be used. If the resource occupation information sent by the working node includes two or more, the weighted average method may be used to determine an integrated resource occupation information, for example, the memory occupancy rate sent by the working node is 50%, and the remaining power is 60%, and the comprehensive resource occupation The information is ⁇ *50%+ ⁇ *60%. Further, the master node determines, according to the comprehensive resource occupation information, whether the working node meets the requirement of participating in the parameter training process. In the following embodiment, the resource occupancy information is taken as an example of the memory occupancy rate.
  • the master node starts the parameter training process, waits for the working node to join the training, and the master node receives four working nodes A, B, C, and D to apply for parameter training.
  • the memory occupancy rate sent by node A is 20%, and the memory occupied by node B is occupied.
  • the rate is 28%, the memory usage of the C node is 25%, and the memory usage of the D node is 50%.
  • the preset condition for the active node to allow the working node to join the parameter training process is that the memory usage of the working node needs to be less than 30%.
  • the primary node determines A and B according to the memory usage of the four working nodes A, B, C, and D.
  • the three nodes of C meet the requirements, and then determine the three working nodes A, B, and C as the working nodes participating in the parameter training process.
  • the master node further includes: when the sending the first difference part to the working node;
  • the working node performs parameter training based on the global parameter, obtains a training sub-result, and feeds the training sub-result to the main node, including:
  • time information corresponding to the parameter training process sent by the primary node where the time information includes an end time of the parameter training process
  • the training ends at the end time, the training sub-result is obtained, and the training sub-result is fed back to the main node;
  • the working node is controlled to repeat the training, and the training component is determined based on the sub-result obtained by the repeated training. As a result, the training sub-results are fed back to the primary node.
  • the master node also needs to set a time when the parameter training process ends, and sends the end time to each working node participating in the parameter training process. Furthermore, after receiving the end time, if the working node has weak processing capability or delay due to other reasons, even if the training is not completed, the training will be stopped at the end time, and the training sub-result results corresponding to the end time are fed back to Primary node. For a working node with strong processing capability and completing the training task in advance, if the interval between the completion time of the training and the end time is greater than the preset value, it indicates that the working node has a long spare time, and the working node can be controlled for multiple rounds.
  • the preset value may be set according to actual needs, and the application does not limit this.
  • node D participates in the parameter training process at 1:38AM. When it is told that the current round of the parameter training process will end at 2:00AM, the remaining training time is calculated to be 22 minutes. Assume that since the processor processing speed of node D is slow, the time required for one round of training is 30 minutes. Therefore, node D only trains on 22/30 of both 73.3% of the data to ensure that the results can be sent to the master node before the end of the training process. It is also assumed that node A calculates the remaining training time as 55 minutes. Since the processor processing of node A is faster, the time required for one round of training is 16 minutes. Therefore, before the end of the training process, node A can perform 55/16 training of 3.44 times for the data it owns.
  • the working node performs parameter training based on the global parameter, obtains a training sub-result, and feeds the training sub-result to the main node, including:
  • the working node sends the second difference part to the master node, including:
  • the training result parameter is obtained.
  • the working node only needs to upload the global parameter when uploading the training result parameter to the primary node.
  • the difference part therefore, the training result parameter obtained by the working node training and the second difference part of the global parameter, and the second difference part is uploaded to the master node as the training sub-product result of the working node.
  • the L1 constraint can also be used when the worker node uploads the training sub-process results to the master node. Specifically, it is required to determine whether the data amount of the training result parameter obtained by the working node training and the second difference part of the global parameter is greater than a preset threshold, and the preset threshold value may be set according to actual needs, and the present application does not limit the present application. When the data volume of the second difference part is greater than the preset threshold, it indicates that the training result parameter trained by the working node has a large difference from the global parameter, and can be used for updating the global parameter.
  • the training result parameter obtained by the corresponding training and the second difference part of the global parameter are uploaded to the main node.
  • the working node participating in the parameter training process only needs to upload the second difference part to the master node when the data amount of the training result parameter obtained by the training and the second difference part of the global parameter is greater than a preset threshold, so The amount of data uploaded to the primary node is reduced, effectively reducing communication overhead.
  • the working node feeds back the training sub-result to the main node, including:
  • the training sub-result corresponding to the global training model is fed back to the main node, and the training sub-result corresponding to the personalized model is saved locally.
  • the distributed machine learning system performs parameter training on the premise that the data is randomly distributed on each working node, that is, the data is independently and identically distributed. This is consistent with the independent distribution of data for each worker node in a strongly coupled distributed environment, such as a data center.
  • a strongly coupled distributed environment such as a data center.
  • the premise that data is independent and distributed cannot be met, for example, a distributed environment composed of tens of millions of smartphones.
  • each mobile phone will correspond to some private data, such as user usage habits and interaction behaviors.
  • the distribution of these data varies widely and cannot meet the premise of independent and identical distribution.
  • the working node when the working node performs parameter training, it is first necessary to determine the trained model.
  • the working node may determine the global training model corresponding to the global parameter according to the instruction of the primary node, and further, determine the local corresponding personalized model. That is, each working node uses local data in addition to training the global model, but also trains a local personalized model to characterize the difference in local data distribution.
  • the personalized model can select the appropriate model according to the constraints of the computing node, the memory resource and the storage resource of the computing node, and can be different from the global training model.
  • the working node A participating in the parameter training adds the currently existing global parameter and the first difference part after receiving the first difference part, and obtains the latest global parameter (ie, the global parameter). Furthermore, the global parameters are combined with the local personalized model to obtain a composite model. Then, the composite model is trained with all the data on the working node. Training with a composite model has a global view of the global model, which can ensure faster convergence of training. Moreover, due to the addition of a personalized model, it is possible to converge more quickly in the case of extremely uneven data distribution.
  • the training sub-results corresponding to the global training model and the training sub-results corresponding to the personalized model are obtained. Moreover, when uploading, only the training sub-results corresponding to the global training model are uploaded to the main node, and the training sub-results corresponding to the personalized model are saved locally, and on the basis of saving communication overhead, the convergence of parameter training is also accelerated. speed.
  • the master node After receiving the training sub-results fed back by the working nodes participating in the parameter training process, the master node updates the global parameters based on the obtained training sub-results. After updating the global parameter, it is also determined whether the updated global parameter reaches convergence; if not, the parameter training process is re-opened after the preset time interval.
  • the training sub-results fed back by the working nodes participating in the parameter training process are obtained, that is, the corresponding second difference part, and then the first working node is uploaded.
  • the difference part is weighted and averaged, and the global parameter value is officially updated by using the obtained mean value of the second difference part uploaded by each working node, and the updated global parameter is.
  • the master node After updating the global parameters, it is also necessary to determine whether the updated global parameters have reached convergence. Specifically, when performing the convergence determination, it is necessary to determine whether the deviation between the updated global parameters and the global parameters before the update is less than a predetermined value, if the deviation The value is less than the predetermined value, indicating that the result of the training process of the round parameter has reached convergence, and if the deviation value is greater than or equal to the predetermined value, indicating the knot of the training process of the round parameter If the convergence has not yet been reached, the master node can start the next round of parameter training process after the preset time interval as needed, and further update the global parameters.
  • the master node randomly initializes the network model parameters based on the model configuration and stores them on the master node as the global parameter value. Then the master node starts a parameter training process every fixed time period, waiting for the working node to join. training.
  • Each working node sends a request to the primary node, where the request carries the resource information of the working node (eg, computing power, memory, storage, and power resources), and simultaneously informs the master node of the version number of the global parameter currently owned by the working node.
  • the master node selects the working node to join the training according to the training needs and the resource information of each working node.
  • the master node sends only the difference portion of the global parameter to the working node according to the global parameter version number of the selected working node, so as to reduce the traffic and send the end time of the current training process to the working node.
  • the working node participating in the training adds the difference between the currently owned global parameter and the global parameter to obtain the latest global parameter.
  • the working nodes participating in the training, and then the training corresponding to the global parameters and the local personalized model are combined to obtain a composite training model.
  • the composite model is then trained with all the data on the node, and the training sub-results are returned to the primary node before the end of the training process of the primary node.
  • the update of the training results is divided into two parts, one part is the update of the local personalized model. This part of the update does not need to be uploaded. On the basis of saving communication overhead, the convergence speed of the model is accelerated. Another part of the update is the update of the global model, which needs to be uploaded to the master node.
  • the updated value after the L1 constraint is used here, which has lower communication overhead.
  • the master node After waiting for the current training process to finish, the master node performs a weighted average of the training sub-processes uploaded by each working node, and officially updates the global parameter values with the mean value. If the training result does not reach convergence, the master node begins a new training process.
  • a second embodiment of the present invention provides a machine learning method, which is applied to a master node of a distributed machine learning system, where the master node is correspondingly provided with a working node, and the method includes:
  • S301 Start a parameter training process, and determine a working node that joins the parameter training process
  • S303 Receive the training sub-results fed back by the working nodes at the end time, and update the global parameters based on the obtained training sub-results.
  • the master node determines the working node that joins the parameter training process, including:
  • the working node that joins the parameter training process is a working node that joins the parameter training process.
  • the method further includes:
  • the master node receives the training sub-results fed back by the working nodes participating in the parameter training process; and updates the global parameters based on the obtained training sub-results.
  • the training sub-result is a second difference between the training result parameter and the global parameter sent after the working node participating in the parameter training process performs parameter training based on the global parameter to obtain a training result parameter.
  • the method further includes:
  • a third embodiment of the present invention provides a machine learning method, which is applied to a working node of a distributed machine learning system, where the working node is correspondingly provided with a master node, and the method includes:
  • S402 Perform parameter training within a time range indicated by the time information.
  • the working node needs to send application information for applying to join the parameter training process to the primary node, where the application information includes resource occupation information of the working node.
  • the parameter training is performed within a time range indicated by the time information, including:
  • the first difference part Receiving, by the first node, the first difference part, and restoring the global parameter based on the first difference part and the first global parameter, where the first difference part is based on the master node a first difference between the first global parameter and the global parameter determined by the first identifier information and the second identifier information of the global parameter locally stored by the master node;
  • parameter training is performed based on the global parameter, and a training sub-result is obtained, and the training sub-result is fed back to the main node.
  • the working node performs parameter training based on the global parameter, obtains a training sub-result, and feeds the training sub-result to the main node, including:
  • the working node sends the second difference part to the master node, including:
  • the working node performs parameter training based on the global parameter, obtains a training sub-result, and feeds the training sub-result to the main node, including:
  • the training sub-result corresponding to the global training model is fed back to the main node, and the training sub-result corresponding to the personalized model is saved locally.
  • a fourth embodiment of the present invention provides a master node of a distributed machine learning system, where the master node is correspondingly provided with a working node, and the master node includes:
  • the first determining module 501 is configured to start a parameter training process, and determine a working node that joins the parameter training process;
  • the sending module 502 is configured to send time information corresponding to the parameter training process to the working node, where the time information includes an end time of the parameter training process, so that the working node is before the end time Sending a training sub-process result to the primary node;
  • the updating module 503 is configured to receive the training sub-results fed back by the working nodes at the end time, and update the global parameters based on the obtained training sub-results.
  • the sending module is further configured to:
  • the training sub-result is a second difference between the training result parameter and the global parameter sent after the working node participating in the parameter training process performs parameter training based on the global parameter to obtain a training result parameter.
  • the master node further includes:
  • the training module is configured to determine, after the updating the global parameter, whether the updated global parameter reaches convergence; if not, restart the parameter training process after the preset time interval.
  • a fifth embodiment of the present invention provides a working node of a distributed machine learning system, where the working node is correspondingly provided with a master node, and the working node includes:
  • the receiving module 601 is configured to receive, after receiving the notification that the working node joins the parameter training process, the time information corresponding to the parameter training process sent by the primary node, where the time information includes The end time of the parameter training process;
  • the training module 602 is configured to perform parameter training within a time range indicated by the time information. If the training is not completed before the end time, the training ends at the end time, and the training sub-result is obtained, and the training sub- The result is fed back to the primary node; if the training is completed before the end time, and the time interval between the completion of the training and the end time is greater than a preset value, the working node is controlled to repeatedly perform training based on the repeated training. Sub-results, determining the training sub-results, and feeding back the training sub-results to the primary node.
  • the working node further includes:
  • the application module is configured to send application information for applying to join the parameter training process to the primary node, where the application information includes resource occupation information of the working node.
  • the training module is used to:
  • the first difference part is that the master node is based on the a first difference between the first global parameter and the global parameter determined by the first identification information and the second identification information of the global parameter locally stored by the primary node;
  • the training sub-results are obtained, and the training sub-results are fed back to the main node.
  • the training module is used to:
  • the training module is used to:
  • the training module is used to:
  • the training sub-result corresponding to the global training model is fed back to the main node, and the training sub-result corresponding to the personalized model is saved locally.
  • the master node of the distributed machine learning system determines the working node that joins the parameter training process, and the master node also needs to set a time for the parameter training process to end, and will end. The time is sent to each working node participating in the parameter training process. Further, after receiving the end time, the working node stops training at the end time, and feeds back the training sub-result obtained corresponding to the end time to the master node. In this way, since the end time of each parameter training is defined, the time for each training node to end training can be effectively controlled, thereby effectively reducing the synchronization overhead caused by the asynchronous training time of each working node.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Computer And Data Communications (AREA)

Abstract

一种机器学习方法、主节点、工作节点及分布式机器学习系统,用于减少机器学习的同步开销。分布式机器学习系统的主节点开启参数训练进程,确定加入所述参数训练进程的工作节点(S301),发送所述参数训练进程对应的时间信息至所述工作节点(S302),其中,所述时间信息包括所述参数训练进程的结束时间;所述工作节点在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,获得所述主节点发送的与所述参数训练进程对应的时间信息(S401);在所述时间信息指示的时间范围内进行参数训练,所述主节点在接收参加所述参数训练进程的各工作节点反馈的训练子结果后,基于获得的训练子结果,更新全局参数(S303)。

Description

一种机器学习方法、主节点、工作节点及系统 技术领域
本发明涉及计算机通信技术领域,尤其涉及一种机器学习方法、主节点、工作节点及分布式机器学习系统。
背景技术
随着大数据时代的到来,大数据处理技术逐步发展。随着输入训练数据和数据模型的增大,单个节点进行机器学习训练存在内存限制以及时间限制等问题,所以分布式机器学习应运而生。分布式机器学习系统包括主节点和多个工作节点,其核心目标是主节点把计算任务拆解成多个小的任务,分配到多个工作节点的处理器上做计算。也就是说,不同的工作节点对应同一个训练模型,每个工作节点分配到不同的数据进行参数训练后,将训练子结果反馈到主节点,主节点将所有工作节点的训练子结果进行合并,得到最终的训练结果。
现有技术中,通常采用参数平均的方式进行子训练结果的合并,训练的过程具体是:主节点基于训练模型配置全局参数,并将全局参数分发到各个工作节点。在每个工作节点,基于该全局参数以及对应的数据进行参数训练,获得该工作节点训练得到的工作节点的参数,并将工作节点的参数反馈至主节点。主节点在接收到全部工作节点反馈的工作节点的参数后,进行加权平均处理,得道的平均值即为更新后的全局参数。
由于在现有技术中,在当轮参数训练进程中,主节点需要等待参与训练的所有工作节点反馈训练子结果,才能确定最终的更新参数。工作节点中有的工作节点处理能力强,在较短的时间内完成自己的训练任务,而有的工作节点处理能力弱,需要较长时间才能完成训练任务。此外,一旦一个工作节点由于某些原因产生较大延迟,先完成训练的工作节点需要等待其他工作节点均完成训练后,才能进行下一轮训练,导致其空余时间长,造成较大的同步开销。
发明内容
本发明实施例提供了一种机器学习方法、主节点、工作节点及分布式机器学习系统,用于降低分布式机器学习的同步开销。
第一方面,本发明提供了一种机器学习方法,应用于分布式机器学习系统的主节点,所述主节点对应设置有工作节点,所述方法包括:
开启参数训练进程,确定加入所述参数训练进程的工作节点;
发送所述参数训练进程对应的时间信息至所述工作节点,其中,所述时间信息包括所述参数训练进程的结束时间,以使所述工作节点在所述结束时间前将训练子结果发送至所述主节点
在所述结束时间接收各工作节点反馈的训练子结果,基于获得的训练子结果,更新全局参数。
可选的,所述确定加入所述参数训练进程的工作节点,包括:
获取申请加入所述参数训练进程的工作节点的资源占用信息;
在所述资源占用信息满足预设条件时,确定所述申请加入所述参数训练进程的工作节点为加入所述参数训练进程的工作节点。
可选的,在所述发送所述参数训练进程对应的时间信息至所述工作节点时,所述方法还包括:
获取所述工作节点存储的第一全局参数的第一标识信息;
基于所述第一标识信息与本地存储的全局参数的第二标识信息,确定所述第一全局参数与所述全局参数的第一差异部分;
发送所述第一差异部分至所述工作节点,以使得所述工作节点基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,并基于所述全局参数进行参数训练。
可选的,所述训练子结果为参加所述参数训练进程的工作节点基于所述全局参数进行参数训练获得训练结果参数后发送的所述训练结果参数与所述全局参数的第二差异部分。
可选的,在所述更新所述全局参数之后,所述方法还包括:
判断所更新后的全局参数之是否达到收敛;
如果否,在预设时间间隔后重新开启参数训练进程。
第二方面,本发明提供了一种机器学习方法,应用于分布式机器学习系统的工作节点,所述工作节点对应设置有主节点,所述方法包括:
在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,获得所述主节点发送的与所述参数训练进程对应的时间信息,其中,所述时间信息包括所述参数训练进程的结束时间;
在所述时间信息指示的时间范围内进行参数训练;如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;如果在所述结束时间之前完成训练,且完成训练的时间距所述结束时间的时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至 所述主节点。
可选的,所述方法还包括:
发送申请加入所述参数训练进程的申请信息至所述主节点,所述申请信息包括所述工作节点的资源占用信息。
可选的,所述在所述时间信息指示的时间范围内进行参数训练,包括:
在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,发送存储的第一全局参数的第一标识信息至所述主节点;
接收所述主节点发送的第一差异部分,基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,其中,所述第一差异部分为所述主节点基于所述第一标识信息与所述主节点本地存储的全局参数的第二标识信息确定的所述第一全局参数与所述全局参数的第一差异部分;
在所述时间信息指示的时间范围内,基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点。
可选的,所述基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
基于所述全局参数进行参数训练,获得训练结果参数;
确定所述结果训练参数与所述全局参数的第二差异部分,所述第二差异部分为所述训练子结果;
将所述第二差异部分发送至所述主节点。
可选的,所述将所述第二差异部分发送至所述主节点,包括:
判断所述第二差异部分的数据量是否大于预设阈值;
如果是,将所述第二差异部分发送至所述主节点。
可选的,所述基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
确定与所述全局参数对应的全局训练模型以及与所述工作节点对应的个性化模型;
将所述全局训练模型与所述个性化模型进行复合,获得复合模型;
基于所述复合模型进行参数训练,获得与所述全局训练模型对应的训练子结果以及与所述个性化模型对应的训练子结果;
将与所述全局训练模型对应的训练子结果反馈至所述主节点,将与所述个性化模型对应的训练子结果保存在本地存储。
第三方面,本发明实施例提供一种分布式机器学习系统的主节点,所述主节点对应设置有工作节点,所述主节点包括:
第一确定模块,用于开启参数训练进程,确定加入所述参数训练进程的工 作节点;
发送模块,用于发送所述参数训练进程对应的时间信息至所述工作节点,其中,所述时间信息包括所述参数训练进程的结束时间,以使所述工作节点在所述结束时间前将训练子结果发送至所述主节点;
更新模块,用于在所述结束时间接收各工作节点反馈的训练子结果,基于获得的训练子结果,更新全局参数。
可选的,所述第一确定模块用于:
获取申请加入所述参数训练进程的工作节点的资源占用信息;
在所述资源占用信息满足预设条件时,确定所述申请加入所述参数训练进程的工作节点为加入所述参数训练进程的工作节点。
可选的,所述发送模块还用于:
获取所述工作节点存储的第一全局参数的第一标识信息;
基于所述第一标识信息与本地存储的全局参数的第二标识信息,确定所述第一全局参数与所述全局参数的第一差异部分;
发送所述第一差异部分至所述工作节点,以使得所述工作节点基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,并基于所述全局参数进行参数训练。
可选的,所述训练子结果为参加所述参数训练进程的工作节点基于所述全局参数进行参数训练获得训练结果参数后发送的所述训练结果参数与所述全局参数的第二差异部分。
可选的,所述工作节点还包括:
训练模块,用于在所述更新所述全局参数之后,判断所更新后的全局参数之是否达到收敛;如果否,在预设时间间隔后重新开启参数训练进程。
第四方面,本发明实施例提供一种分布式机器学习系统的工作节点,所述工作节点对应设置有主节点,所述工作节点包括:
接收模块,用于在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,接收所述主节点发送的与所述参数训练进程对应的时间信息,所述时间信息包括所述参数训练进程的结束时间;
训练模块,用于在所述时间信息指示的时间范围内进行参数训练,如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;如果在所述结束时间之前完成训练,且完成训练的时间距所述结束时间的时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至所述主节点。
可选的,所述工作节点还包括:
申请模块,用于发送申请加入所述参数训练进程的申请信息至所述主节点,所述申请信息包括所述工作节点的资源占用信息。
可选的,所述训练模块用于:
在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,发送存储的第一全局参数的第一标识信息至所述主节点;
接收所述主节点发送的第一差异部分,基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,其中,所述第一差异部分为所述主节点基于所述第一标识信息与所述主节点本地存储的全局参数的第二标识信息确定的所述第一全局参数与所述全局参数的第一差异部分;
在所述时间信息指示的时间范围内,基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点。
可选的,所述训练模块用于:
基于所述全局参数进行参数训练,获得训练结果参数;
确定所述结果训练参数与所述全局参数的第二差异部分,所述第二差异部分为所述训练子结果;
将所述第二差异部分发送至所述主节点。
可选的,所述训练模块用于:
判断所述第二差异部分的数据量是否大于预设阈值;
如果是,将所述第二差异部分发送至所述主节点。
可选的,所述训练模块用于:
确定与所述全局参数对应的全局训练模型以及与所述工作节点对应的个性化模型;
将所述全局训练模型与所述个性化模型进行复合,获得复合模型;
基于所述复合模型进行参数训练,获得与所述全局训练模型对应的训练子结果以及与所述个性化模型对应的训练子结果;
将与所述全局训练模型对应的训练子结果反馈至所述主节点,将与所述个性化模型对应的训练子结果保存在本地存储。
第五方面,一种分布式机器学习系统,所述分布式机器学习系统包括主节点和工作节点,包括:
所述主节点开启参数训练进程,确定加入所述参数训练进程的工作节点,发送所述参数训练进程对应的时间信息至所述工作节点,其中,所述时间信息包括所述参数训练进程的结束时间;
所述工作节点在接收到所述主节点确定所述工作节点加入参数训练进程 的通知后,获得所述主节点发送的与所述参数训练进程对应的时间信息;在所述时间信息指示的时间范围内进行参数训练;如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;如果在所述结束时间之前完成训练,且完成训练的时间距所述结束时间的时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至所述主节点;
所述主节点在接收参加所述参数训练进程的各工作节点反馈的训练子结果后,基于获得的训练子结果,更新全局参数。
可选的,所述工作节点在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,发送存储的第一全局参数的第一标识信息至所述主节点;
所述主节点基于所述第一标识信息与本地存储的全局参数的第二标识信息,确定所述第一全局参数与所述全局参数的第一差异部分,发送所述第一差异部分至所述工作节点;
所述工作节点接收到所述第一差异部分后,基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点。
可选的,所述主节点确定加入所述参数训练进程的工作节点,包括:
获取申请加入所述参数训练进程的工作节点的资源占用信息;
在所述资源占用信息满足预设条件时,确定所述申请加入所述参数训练进程的工作节点为加入所述参数训练进程的工作节点。
可选的,所述主节点在更新所述全局参数之后,所述方法还包括:
判断所更新后的全局参数之是否达到收敛;
如果否,在预设时间间隔后重新开启参数训练进程。
可选的,所述工作节点还用于:
发送申请加入所述参数训练进程的申请信息至所述主节点,所述申请信息包括所述工作节点的资源占用信息。
可选的,所述工作节点基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
获得所述主节点发送的与所述参数训练进程对应的时间信息,其中,所述时间信息包括所述参数训练进程的结束时间;
如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;
如果在所述结束时间之前完成训练,且完成训练的时间距所述结束时间的 时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至所述主节点。
可选的,所述工作节点基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
基于所述全局参数进行参数训练,获得训练结果参数;
确定所述结果训练参数与所述全局参数的第二差异部分,所述第二差异部分为所述训练子结果;
将所述第二差异部分发送至所述主节点。
可选的,所述工作节点将所述第二差异部分发送至所述主节点,包括:
判断所述第二差异部分的数据量是否大于预设阈值;
如果是,将所述第二差异部分发送至所述主节点。
可选的,所述工作节点基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
确定与所述全局参数对应的全局训练模型以及与所述工作节点对应的个性化模型;
将所述全局训练模型与所述个性化模型进行复合,获得复合模型;
基于所述复合模型进行参数训练,获得与所述全局训练模型对应的训练子结果以及与所述个性化模型对应的训练子结果;
将与所述全局训练模型对应的训练子结果反馈至所述主节点,将与所述个性化模型对应的训练子结果保存在本地存储。
本申请实施例中的上述一个或多个技术方案,至少具有如下一种或多种技术效果:
在本发明实施例的技术方案中,分布式机器学习系统的主节点在开启参数训练进程后,确定加入参数训练进程的工作节点,主节点还需要设置一个参数训练进程结束的时间,并将结束时间发送至参加参数训练进程的各工作节点。进而,工作节点在接收到结束时间后,在结束时间停止训练,将在结束时间对应得到的训练子结果反馈至主节点。通过这样的方式,由于限定了每次参数训练的结束时间,可以有效控制每个工作节点训练结束的时间,进而有效减少了各工作节点训练时间不同步导致的同步开销。
附图说明
图1为本发明第一实施例中的一种机器学习系统的结构示意图;
图2为本发明第一实施例中的一种机器学习系统进行机器学习时的时序图;
图3为本发明第二实施例中的主节点对应的机器学习方法的流程图;
图4为本发明第三实施例中的工作点对应的机器学习方法的流程图;
图5为本发明第四实施例中的主节点的示意图;
图6为本发明第五实施例中的工作节点的示意图。
具体实施方式
本发明实施例提供了一种机器学习方法、主节点、工作节点及分布式机器学习系统,用于降低分布式机器学习的同步开销。该分布式机器学习系统包括主节点和工作节点,所述主节点开启参数训练进程,确定加入所述参数训练进程的工作节点,发送所述参数训练进程对应的时间信息至所述工作节点,其中,所述时间信息包括所述参数训练进程的结束时间;所述工作节点在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,获得所述主节点发送的与所述参数训练进程对应的时间信息;在所述时间信息指示的时间范围内进行参数训练;如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;如果在所述结束时间之前完成训练,且完成训练的时间距所述结束时间的时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至所述主节点;所述主节点在接收参加所述参数训练进程的各工作节点反馈的训练子结果后,基于获得的训练子结果,更新全局参数。
下面通过附图以及具体实施例对本发明技术方案做详细的说明,应当理解本申请实施例以及实施例中的具体特征是对本申请技术方案的详细的说明,而不是对本申请技术方案的限定,在不冲突的情况下,本申请实施例以及实施例中的技术特征可以相互组合。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
实施例
请参考图1,本发明第一实施例提供一种机器学习系统,该机器学习系统包括主节点和多个工作节点,主节点和工作节点通信连接,包括:
所述主节点开启参数训练进程,确定加入所述参数训练进程的工作节点,发送所述参数训练进程对应的时间信息至所述工作节点,其中,所述时间信息包括所述参数训练进程的结束时间;
所述工作节点在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,获得所述主节点发送的与所述参数训练进程对应的时间信息;在所述时间信息指示的时间范围内进行参数训练;如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;如果在所述结束时间之前完成训练,且完成训练的时间距所述结束时间的时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至所述主节点;
进一步,所述工作节点在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,发送存储的第一全局参数的第一标识信息至所述主节点;
所述主节点基于所述第一标识信息与本地存储的全局参数的第二标识信息,确定所述第一全局参数与所述全局参数的第一差异部分,发送所述第一差异部分至所述工作节点;
所述工作节点接收到所述第一差异部分后,基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点;
所述主节点在接收参加所述参数训练进程的各工作节点反馈的训练子结果后,基于获得的训练子结果,更新所述全局参数。
具体的,在本实施例中,主节点开启参数训练进程后,会发送对应的通知信息至与之连接的工作节点,工作节点接收到通知后,发送申请加入参数训练进程的申请信息至主节点,主节点确定加入该训练进程的工作节点,并向对应的工作节点发送确定加入参数训练进程的通知。
工作节点在接收到主节点确定该工作节点加入参数训练进程的通知后,发送存储的第一全局参数的第一标识信息至主节点,主节点通过对比第一标识信息与本地存储的全局参数的第二标识,确定第一全局参数与全局参数的第一差异部分,将第一差异部分发送至对应的工作节点。工作节点接收到第一差异部分,还原出全局参数。
比如:主节点确定工作节点A参加参数训练,主节点当前最新的全局参数版本号为V811,当然,主节点中还保存有之前各版本的全局参数,包括V810、V809、V808等版本的全局参数。工作节点A发送的保存的第一全局参数的版本号为V810,表明工作节点A中保存的全局参数的版本与主节点保存的全局参数版本相差不大,主节点确定V811版本的全局参数与V810版本的全局参数中的差异部分DA,将差异部分DA发送至工作节点A,工作节点A可基于本地存储的V810版本的全局参数与差异部分DA,还原得到主节点中最新版本的全局参数,即 V811版本的全局参数。
最后,工作节点基于全局参数进行参数训练,获得训练子结果,将训练子结果反馈至主节点。主节点在接收参加参数训练进程的各工作节点反馈的训练子结果后,基于获得的训练子结果,更新全局参数。通过这样的方式,主节点传输的数据量可大大减少,可以有效减少机器学习的通信开销。
进一步,工作节点在申请加入所述参数训练进程的申请信息至主节点时,所述申请信息包括所述工作节点的资源占用信息;
进而,所述主节点获取申请加入所述参数训练进程的工作节点的资源占用信息;在所述资源占用信息满足预设条件时,确定所述申请加入所述参数训练进程的工作节点为加入所述参数训练进程的工作节点。
具体的,在本实施例中,主节点根据申请加入参数训练进程的工作节点的资源占用情况确定是否通过其中请。具体的,工作节点发送的资源占用信息可以包括:内存占用率、剩余电量等信息,当然,也可以是其他信息,在此,本申请不做限制。如果工作节点发送的资源占用信息包含两项及以上时,可以采用加权平均的方式确定一个综合资源占用信息,比如:工作节点发送的内存占用率为50%,剩余电量为60%,综合资源占用信息为α*50%+β*60%。进而,主节点根据综合资源占用信息来确定该工作节点是否满足参加参数训练进程的要求。下面本实施例以资源占用信息为内存占用率为例进行详细阐述。
主节点开启参数训练进程,等待工作节点加入训练,主节点收到A、B、C、D四个工作节点申请加入参数训练,A节点发送的内存占用率为20%,B节点发送的内存占用率为28%,C节点发送的内存占用率为25%,D节点发送的内存占用率为50%。主节点允许工作节点加入参数训练进程的预设条件是工作节点的内存占用率需小于30%,进而,主节点根据A、B、C、D四个工作节点的内存占用率,确定A、B、C三个节点满足要求,进而确定A、B、C三个工作节点为参加参数训练进程的工作节点。
进一步,在本实施例中,为了能减少机器学习的同步开销,所述主节点在所述发送所述第一差异部分至所述工作节点时,还包括:
发送所述参数训练进程对应的时间信息至所述工作节点,其中,所述时间信息包括所述参数训练进程的结束时间,以使所述工作节点在所述结束时间前将训练子结果发送至所述主节点。
所述工作节点基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
获得所述主节点发送的与所述参数训练进程对应的时间信息,其中,所述时间信息包括所述参数训练进程的结束时间;
如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;
如果在所述结束时间之前完成训练,且完成训练的时间距所述结束时间的时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至所述主节点。
由此,为了减少同步开销,在本实施例中,主节点还需要设置一个参数训练进程结束的时间,并将结束时间发送至参加参数训练进程的各工作节点。进而,工作节点在接收到结束时间后,如果工作节点处理能力弱,或因为其他原因发生延时,即使没有完成训练,都会在结束时间停止训练,将在结束时间对应得到的训练子结果反馈至主节点。而对于处理能力较强,提前完成训练任务的工作节点,如果完成训练的时间距离结束时间的间隔大于预设值,表明该工作节点还有较长的空余时间,可以控制该工作节点进行多轮重复训练,综合多轮训练得到的结果,确定该工作节点的训练子结果,并将该训练子结果反馈至主节点。在具体实施过程,预设值可根据实际需要进行设定,在此本申请不做限制。
比如:节点D在1:38AM参加参数训练进程,在被告知本轮参数训练进程将于2:00AM结束时,计算出剩余训练时间为22分钟。假设,由于节点D的处理器处理速度较慢,一轮训练所需时间为30分钟。因此,节点D仅在22/30既73.3%的数据上进行训练,以保证能够在本训练进程结束前将结果发至主节点。又假设,节点A计算出剩余训练时间为55分钟。由于节点A的处理器处理速度较快,一轮训练所需时间为16分钟。因此,在本训练进程结束前,节点A可以对其拥有的数据进行55/16既3.44次的训练。
通过这样的方式,由于限定了每次参数训练的结束时间,可以有效控制每个工作节点训练结束的时间,进而有效减少了各工作节点训练时间不同步导致的同步开销。
进一步,为了进一步减少通信开销,在本实施例中,所述工作节点基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
基于所述全局参数进行参数训练,获得训练结果参数;
确定所述结果训练参数与所述全局参数的第二差异部分,所述第二差异部分为所述训练子结果;
将所述第二差异部分发送至所述主节点。
进一步,所述工作节点将所述第二差异部分发送至所述主节点,包括:
判断所述第二差异部分的数据量是否大于预设阈值;
如果是,将所述第二差异部分发送至所述主节点。
具体的,在本实施例中,在工作节点基于全局参数训练完成后,获得训练结果参数,为了能进一步减少通信开销,工作节点在将训练结果参数上传至主节点时,仅需要上传与全局参数差异的部分,所以,工作节点训练获得的训练结果参数与全局参数的第二差异部分,将第二差异部分作为该工作节点的训练子结果,上传至主节点。
在工作节点将训练子结果上传至主节点时,还可以采用L1约束。具体的,需要判断该工作节点训练获得的训练结果参数与全局参数的第二差异部分的数据量是否大于预设阈值,预设阈值可根据实际需要进行设置,在此,本申请不做限制。在第二差异部分的数据量大于预设阈值时,表明该工作节点训练出的训练结果参数与全局参数的差异较大,可用于全局参数的更新。所以,在工作节点训练获得的训练结果参数与全局参数的第二差异部分的数据量大于预设阈值时,才将自己对应的训练获得的训练结果参数与全局参数的第二差异部分上传至主节点。由于机器学习系统中,参与参数训练进程的工作节点仅需要在训练获得的训练结果参数与全局参数的第二差异部分的数据量大于预设阈值时才将第二差异部分上传至主节点,所以上传至主节点的数据量减少,有效减少了通信开销。
进一步,在本实施例中,所述工作节点将训练子结果反馈至所述主节点,包括:
确定与所述全局参数对应的全局训练模型以及与所述工作节点对应的个性化模型;
将所述全局训练模型与所述个性化模型进行复合,获得复合模型;
基于所述复合模型进行参数训练,获得与所述全局训练模型对应的训练子结果以及与所述个性化模型对应的训练子结果;
将与所述全局训练模型对应的训练子结果反馈至所述主节点,将与所述个性化模型对应的训练子结果保存在本地存储。
在现有技术中,分布式机器学习系统进行参数训练的前提是数据是随机的分布在各个工作节点上,也就是数据独立同分布。这在强耦合的分布式环境中符合各个工作节点的数据独立分布,例如数据中心。但是,在很多分布式环境中,数据独立同分布的前提无法满足,例如上千万部智能手机构成的分布式环境。在这样一个松耦合的分布式环境中,每个手机会对应一些私有的数据,例如用户的使用习惯和交互行为,这些数据的分布千差万别,无法满足独立同分布的前提。
同时,在松耦合的分布式计算场景下,例如上千万部智能手机构成的分布 式环境,参与计算的节点数量巨大(千万级以上),彼此间的数据分布也差别巨大,而且由于隐私和传输带宽的限制而无法同步数据。现有的分布式机器学习系统大多是为强耦合的分布式计算环境设计,例如公司的数据中心,实际中支持的计算节点在千这个量级,每个工作节点的数据也是独立同分布的,所以不适合松耦合的分布式计算环境。
所以,在本实施例中,在工作节点进行参数训练时,首先需要确定训练的模型。工作节点可以根据主节点的指示确定与全局参数对应的全局训练模型,进一步,还需要确定本地对应的个性化模型。即,每个工作节点使用本地数据除了训练全局模型外,还同时训练一个本地的个性化模型,来刻画本地数据分布的差异性部分。个性化模型可以依据计算节点的计算资源、内存资源、存储资源的约束,选择合适的模型,可以不同于全局训练模型。
比如:参与参数训练的工作节点A,在收到第一差异部分后,将当前拥有的全局参数与第一差异部分相加,得到当前最新的全局参数(即全局参数)。进而,将全局参数和本地的个性化模型复合,得到复合模型。然后,用该工作节点上所有数据对复合模型进行训练。采用复合模型进行训练具有全局模型的全局观,可以保证训练更快速的收敛。并且,由于加入了个性化模型,在数据分布极度不均匀的情况下能够更快速的收敛。
在训练结束后,获得与全局训练模型对应的训练子结果以及与个性化模型对应的训练子结果。并且,上传时,仅将全局训练模型对应的训练子结果上传至主节点,与个性化模型对应的训练子结果保存在本地即可,在节省通信开销的基础上,同时也加快参数训练的收敛速度。
最后,所述主节点在接收参加所述参数训练进程的各工作节点反馈的训练子结果后,基于获得的训练子结果,更新所述全局参数。在更新所述全局参数之后,还需判断所更新后的全局参数之是否达到收敛;如果否,在预设时间间隔后重新开启参数训练进程。
具体的,在本实施例中,主节点等待当前参数训练进程结束后,获得参与参数训练进程的各个工作节点反馈的训练子结果,即对应的第二差异部分,再将各个工作节点上传的第二差异部分进行加权平均,再用获取的各个工作节点上传的第二差异部分的均值对全局参数值进行正式更新,更新后的全局参数为。
在更新全局参数后,还需要判断更新后的全局参数是否达到收敛,具体的,在进行收敛判断时,需要判断更新后的全局参数与更新前的全局参数的偏差值是否小于预定值,如果偏差值小于预定值,表明当轮参数训练进程的结果已经达到收敛,如果偏差值大于或等于预定值,表明当轮参数训练进程的结 果还未达到收敛,主节点可根据需要在预设时间间隔后开启下一轮参数训练进程,对全局参数进行进一步的更新。
为了更好地理解本实施例中的方案,下面以一个完整的实施例对本实施例中的机器学习系统进行详细阐述。机器学习系统进行机器学习的时序如图2所示。
首先,在参数训练开始前,主节点基于模型配置随机初始化网络模型参数,并存放在主节点上,作为全局参数值,进而主节点每隔固定时间段开启一个参数训练进程,等待工作节点加入进行训练。各工作节点向主节点发出请求,请求中携带工作节点的资源信息(如:计算能力、内存、存储、电量资源),同时告知主节点该工作节点当前拥有的全局参数的版本号。主节点根据训练需要,以及各工作节点资源信息,选择工作节点加入训练。主节点根据被选中的工作节点的全局参数版本号,仅将全局参数的差异部分发送到该工作节点,以减少通讯量,同时将当前训练进程的结束时间发送至工作节点。
进而,参与训练的工作节点,在收到全局参数的变化值后,将当前拥有的全局参数与全局参数的差异部分相加,得到当前最新的全局参数。参与训练的工作节点,继而将全局参数对应的训练和本地的个性化模型复合,得到复合训练模型。其后用该节点上所有数据对复合模型进行训练,并保证将训练子结果在主节点训练进程结束前返回主节点。
在进行参数训练时,如果有的工作节点训练速度较慢,不能完成所有数据的训练,则在训练进程结束前终止,将对应的训练子结果发送至主节点。如果有的工作节点训练速度较快,则可以将该节点数据训练多轮,将多轮训练子结果综合值发送至主节点。训练结果的更新分两部分,一部分是本地个性化模型的更新,这部分更新不需要上传,在节省通讯开销的基础上,同时加快模型的收敛速度。另一部分更新是对全局模型的更新,需要上传至主节点,这里采用的L1约束后的更新值,具有较低的通讯开销。
最后,主节点等待当前训练进程结束后,再将各个工作节点上传的训练子结果进行加权平均,用均值对全局参数值进行正式更新。如果训练结果没有达到收敛,则主节点开始一个新的训练进程。
请参见图3,本发明的第二实施例提供了一种机器学习方法,应用于分布式机器学习系统的主节点,所述主节点对应设置有工作节点,所述方法包括:
S301:开启参数训练进程,确定加入所述参数训练进程的工作节点;
S302:发送所述参数训练进程对应的时间信息至所述工作节点;
S303:在所述结束时间接收各工作节点反馈的训练子结果,基于获得的训练子结果,更新全局参数。
进一步,主节点确定加入所述参数训练进程的工作节点,包括:
获取申请加入所述参数训练进程的工作节点的资源占用信息;
在所述资源占用信息满足预设条件时,确定所述申请加入所述参数训练进程的工作节点为加入所述参数训练进程的工作节点。
进一步,在所述发送所述参数训练进程对应的时间信息至所述工作节点时,所述方法还包括:
获取所述工作节点存储的第一全局参数的第一标识信息;
基于所述第一标识信息与本地存储的全局参数的第二标识信息,确定所述第一全局参数与所述全局参数的第一差异部分;
发送所述第一差异部分至所述工作节点,以使得所述工作节点基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,并基于所述全局参数进行参数训练。
主节点接收参加所述参数训练进程的各工作节点反馈的训练子结果;基于获得的训练子结果,更新所述全局参数。
其中,所述训练子结果为参加所述参数训练进程的工作节点基于所述全局参数进行参数训练获得训练结果参数后发送的所述训练结果参数与所述全局参数的第二差异部分。
进一步,在所述更新所述全局参数之后,所述方法还包括:
判断所更新后的全局参数之是否达到收敛;
如果否,在预设时间间隔后重新开启参数训练进程。
具体的,在本实施例中的主节点进行机器学习的方法在第一实施例中已经完整地描述,可参见第一实施例,在此,本申请不做赘述。
请参见图4,本发明的第三实施例提供了一种机器学习方法,应用于分布式机器学习系统的工作节点,所述工作节点对应设置有主节点,所述方法包括:
S401:在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,获得所述主节点发送的与所述参数训练进程对应的时间信息;
S402:在所述时间信息指示的时间范围内进行参数训练。
其中,工作节点需要发送申请加入所述参数训练进程的申请信息至所述主节点,所述申请信息包括所述工作节点的资源占用信息。
进而,所述在所述时间信息指示的时间范围内进行参数训练,包括:
在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,发送存储的第一全局参数的第一标识信息至所述主节点;
接收所述主节点发送的第一差异部分,基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,其中,所述第一差异部分为所述主节点基于 所述第一标识信息与所述主节点本地存储的全局参数的第二标识信息确定的所述第一全局参数与所述全局参数的第一差异部分;
在所述时间信息指示的时间范围内,基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点。
进而,工作节点基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
基于所述全局参数进行参数训练,获得训练结果参数;
确定所述结果训练参数与所述全局参数的第二差异部分,所述第二差异部分为所述训练子结果;
将所述第二差异部分发送至所述主节点。
进而,工作节点将所述第二差异部分发送至所述主节点,包括:
判断所述第二差异部分的数据量是否大于预设阈值;
如果是,将所述第二差异部分发送至所述主节点。
进而,工作节点基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
确定与所述全局参数对应的全局训练模型以及与所述工作节点对应的个性化模型;
将所述全局训练模型与所述个性化模型进行复合,获得复合模型;
基于所述复合模型进行参数训练,获得与所述全局训练模型对应的训练子结果以及与所述个性化模型对应的训练子结果;
将与所述全局训练模型对应的训练子结果反馈至所述主节点,将与所述个性化模型对应的训练子结果保存在本地存储。
具体的,在本实施例中的工作节点进行机器学习的方法在第一实施例中已经完整地描述,可参见第一实施例,在此,本申请不做赘述。
请参见图5,本发明的第四实施例提供一种分布式机器学习系统的主节点,所述主节点对应设置有工作节点,所述主节点包括:
第一确定模块501,用于开启参数训练进程,确定加入所述参数训练进程的工作节点;
发送模块502,用于发送所述参数训练进程对应的时间信息至所述工作节点,其中,所述时间信息包括所述参数训练进程的结束时间,以使所述工作节点在所述结束时间前将训练子结果发送至所述主节点;
更新模块503,用于在所述结束时间接收各工作节点反馈的训练子结果,基于获得的训练子结果,更新全局参数。
其中,所述发送模块还用于:
获取所述工作节点存储的第一全局参数的第一标识信息;
基于所述第一标识信息与本地存储的全局参数的第二标识信息,确定所述第一全局参数与所述全局参数的第一差异部分;
发送所述第一差异部分至所述工作节点,以使得所述工作节点基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,并基于所述全局参数进行参数训练。
其中,所述训练子结果为参加所述参数训练进程的工作节点基于所述全局参数进行参数训练获得训练结果参数后发送的所述训练结果参数与所述全局参数的第二差异部分。
其中,所述主节点还包括:
训练模块,用于在所述更新所述全局参数之后,判断所更新后的全局参数之是否达到收敛;如果否,在预设时间间隔后重新开启参数训练进程。
请参见图6,本发明的第五实施例提供一种分布式机器学习系统的工作节点,所述工作节点对应设置有主节点,所述工作节点包括:
接收模块601,用于在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,接收所述主节点发送的与所述参数训练进程对应的时间信息,所述时间信息包括所述参数训练进程的结束时间;
训练模块602,用于在所述时间信息指示的时间范围内进行参数训练,如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;如果在所述结束时间之前完成训练,且完成训练的时间距所述结束时间的时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至所述主节点。
其中,所述工作节点还包括:
申请模块,用于发送申请加入所述参数训练进程的申请信息至所述主节点,所述申请信息包括所述工作节点的资源占用信息。
可选的,所述训练模块用于:
在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,发送存储的第一全局参数的第一标识信息至所述主节点;
接收所述主节点发送的第一差异部分,基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,其中,所述第一差异部分为所述主节点基于所述第一标识信息与所述主节点本地存储的全局参数的第二标识信息确定的所述第一全局参数与所述全局参数的第一差异部分;
在所述时间信息指示的时间范围内,基于所述全局参数进行参数训练,获 得训练子结果,将训练子结果反馈至所述主节点。
其中,所述训练模块用于:
基于所述全局参数进行参数训练,获得训练结果参数;
确定所述结果训练参数与所述全局参数的第二差异部分,所述第二差异部分为所述训练子结果;
将所述第二差异部分发送至所述主节点。
其中,所述训练模块用于:
判断所述第二差异部分的数据量是否大于预设阈值;
如果是,将所述第二差异部分发送至所述主节点。
其中,所述训练模块用于:
确定与所述全局参数对应的全局训练模型以及与所述工作节点对应的个性化模型;
将所述全局训练模型与所述个性化模型进行复合,获得复合模型;
基于所述复合模型进行参数训练,获得与所述全局训练模型对应的训练子结果以及与所述个性化模型对应的训练子结果;
将与所述全局训练模型对应的训练子结果反馈至所述主节点,将与所述个性化模型对应的训练子结果保存在本地存储。
在本发明实施例的技术方案中,分布式机器学习系统的主节点在开启参数训练进程后,确定加入参数训练进程的工作节点,主节点还需要设置一个参数训练进程结束的时间,并将结束时间发送至参加参数训练进程的各工作节点。进而,工作节点在接收到结束时间后,在结束时间停止训练,将在结束时间对应得到的训练子结果反馈至主节点。通过这样的方式,由于限定了每次参数训练的结束时间,可以有效控制每个工作节点训练结束的时间,进而有效减少了各工作节点训练时间不同步导致的同步开销。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (14)

  1. 一种机器学习方法,应用于分布式机器学习系统的主节点,所述主节点对应设置有工作节点,其特征在于,所述方法包括:
    开启参数训练进程,确定加入所述参数训练进程的工作节点;
    发送所述参数训练进程对应的时间信息至所述工作节点,其中,所述时间信息包括所述参数训练进程的结束时间,以使所述工作节点在所述结束时间前将训练子结果发送至所述主节点;
    在所述结束时间接收各工作节点反馈的训练子结果,基于获得的训练子结果,更新全局参数。
  2. 如权利要求1所述的方法,其特征在于,所述确定加入所述参数训练进程的工作节点,包括:
    获取申请加入所述参数训练进程的工作节点的资源占用信息;
    在所述资源占用信息满足预设条件时,确定所述申请加入所述参数训练进程的工作节点为加入所述参数训练进程的工作节点。
  3. 如权利要求1所述的方法,其特征在于,在所述发送所述参数训练进程对应的时间信息至所述工作节点时,所述方法还包括:
    获取所述工作节点存储的第一全局参数的第一标识信息;
    基于所述第一标识信息与本地存储的全局参数的第二标识信息,确定所述第一全局参数与所述全局参数的第一差异部分;
    发送所述第一差异部分至所述工作节点,以使得所述工作节点基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,并基于所述全局参数进行参数训练。
  4. 如权利要求3所述的方法,其特征在于,所述训练子结果为参加所述参数训练进程的工作节点基于所述全局参数进行参数训练获得训练结果参数后发送的所述训练结果参数与所述全局参数的第二差异部分。
  5. 如权利要求1所述的方法,其特征在于,在所述更新全局参数之后,所述方法还包括:
    判断更新后的全局参数之是否达到收敛;
    如果否,在预设时间间隔后重新开启参数训练进程。
  6. 一种机器学习方法,应用于分布式机器学习系统的工作节点,所述工作节点对应设置有主节点,其特征在于,所述方法包括:
    在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,获得所述主节点发送的与所述参数训练进程对应的时间信息,其中,所述时间信息包括所述参数训练进程的结束时间;
    在所述时间信息指示的时间范围内进行参数训练;
    如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;
    如果在所述结束时间之前完成训练,且完成训练的时间距所述结束时间的时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至所述主节点。
  7. 如权利要求6所述的方法,其特征在于,所述方法还包括:
    发送申请加入所述参数训练进程的申请信息至所述主节点,所述申请信息包括所述工作节点的资源占用信息。
  8. 如权利要求6所述的方法,其特征在于,所述在所述时间信息指示的时间范围内进行参数训练,包括:
    在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,发送存储的第一全局参数的第一标识信息至所述主节点;
    接收所述主节点发送的第一差异部分,基于所述第一差异部分和所述第一全局参数,还原出所述全局参数,其中,所述第一差异部分为所述主节点基于所述第一标识信息与所述主节点本地存储的全局参数的第二标识信息确定的所述第一全局参数与所述全局参数的第一差异部分;
    在所述时间信息指示的时间范围内,基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点。
  9. 如权利要求8所述的方法,其特征在于,所述基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
    基于所述全局参数进行参数训练,获得训练结果参数;
    确定所述结果训练参数与所述全局参数的第二差异部分,所述第二差异部分为所述训练子结果;
    将所述第二差异部分发送至所述主节点。
  10. 如权利要求9所述的方法,其特征在于,所述将所述第二差异部分发送至所述主节点,包括:
    判断所述第二差异部分的数据量是否大于预设阈值;
    如果是,将所述第二差异部分发送至所述主节点。
  11. 如权利要求8所述的方法,其特征在于,所述基于所述全局参数进行参数训练,获得训练子结果,将训练子结果反馈至所述主节点,包括:
    确定与所述全局参数对应的全局训练模型以及与所述工作节点对应的个性化模型;
    将所述全局训练模型与所述个性化模型进行复合,获得复合模型;
    基于所述复合模型进行参数训练,获得与所述全局训练模型对应的训练子结果以及与所述个性化模型对应的训练子结果;
    将与所述全局训练模型对应的训练子结果反馈至所述主节点,将与所述个性化模型对应的训练子结果保存在本地存储。
  12. 一种分布式机器学习系统的主节点,所述主节点对应设置有工作节点,其特征在于,所述主节点包括:
    第一确定模块,用于开启参数训练进程,确定加入所述参数训练进程的工作节点;
    发送模块,用于发送所述参数训练进程对应的时间信息至所述工作节点,其中,所述时间信息包括所述参数训练进程的结束时间,以使所述工作节点在所述结束时间前将训练子结果发送至所述主节点;
    更新模块,用于在所述结束时间接收各工作节点反馈的训练子结果,基于获得的训练子结果,更新全局参数。
  13. 一种分布式机器学习系统的工作节点,所述工作节点对应设置有主节点,其特征在于,所述工作节点包括:
    接收模块,用于在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,接收所述主节点发送的与所述参数训练进程对应的时间信息,所述时间信息包括所述参数训练进程的结束时间;
    训练模块,用于在所述时间信息指示的时间范围内进行参数训练,如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;如果在所述结束时间之前完成训练,且完成训练的时间距所述结束时间的时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至所述主节点。
  14. 一种分布式机器学习系统,所述分布式机器学习系统包括主节点和工作节点,其特征在于,包括:
    所述主节点开启参数训练进程,确定加入所述参数训练进程的工作节点,发送所述参数训练进程对应的时间信息至所述工作节点,其中,所述时间信息包括所述参数训练进程的结束时间;
    所述工作节点在接收到所述主节点确定所述工作节点加入参数训练进程的通知后,获得所述主节点发送的与所述参数训练进程对应的时间信息;在所述时间信息指示的时间范围内进行参数训练;如果在所述结束时间之前还未完成训练,在所述结束时间结束训练,获得训练子结果,将所述训练子结果反馈至所述主节点;如果在所述结束时间之前完成训练,且完成训练的时间距所述 结束时间的时间间隔大于预设值,控制所述工作节点重复进行训练,基于重复训练得到的子结果,确定所述训练子结果,将所述训练子结果反馈至所述主节点;
    所述主节点在接收参加所述参数训练进程的各工作节点反馈的训练子结果后,基于获得的训练子结果,更新全局参数。
PCT/CN2017/114228 2017-11-28 2017-12-01 一种机器学习方法、主节点、工作节点及系统 WO2019104713A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711214868.1 2017-11-28
CN201711214868.1A CN107944566B (zh) 2017-11-28 2017-11-28 一种机器学习方法、主节点、工作节点及系统

Publications (1)

Publication Number Publication Date
WO2019104713A1 true WO2019104713A1 (zh) 2019-06-06

Family

ID=61949319

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/114228 WO2019104713A1 (zh) 2017-11-28 2017-12-01 一种机器学习方法、主节点、工作节点及系统

Country Status (2)

Country Link
CN (1) CN107944566B (zh)
WO (1) WO2019104713A1 (zh)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829441B (zh) * 2018-05-14 2022-10-18 中山大学 一种分布式深度学习的参数更新优化系统
CN108924187B (zh) * 2018-06-07 2020-05-08 北京百度网讯科技有限公司 基于机器学习的任务处理方法、装置和终端设备
CN109558909B (zh) * 2018-12-05 2020-10-23 清华大学深圳研究生院 基于数据分布的机器深度学习方法
US20220078637A1 (en) * 2018-12-28 2022-03-10 Telefonaktiebolaget Lm Ericsson (Publ) Wireless device, a network node and methods therein for updating a first instance of a machine learning model
CN110333987B (zh) * 2019-07-04 2020-06-02 湖南大学 设备体检报告生成方法、装置、计算机设备和存储介质
CN110502544A (zh) * 2019-08-12 2019-11-26 北京迈格威科技有限公司 数据整合方法、分布式计算节点及分布式深度学习训练系统
CN110502576A (zh) * 2019-08-12 2019-11-26 北京迈格威科技有限公司 数据整合方法、分布式计算节点及分布式深度学习训练系统
CN110852445A (zh) * 2019-10-28 2020-02-28 广州文远知行科技有限公司 分布式机器学习训练方法、装置、计算机设备和存储介质
CN110990870A (zh) * 2019-11-29 2020-04-10 上海能塔智能科技有限公司 运维、使用模型库的处理方法、装置、设备与介质
CN115734244A (zh) * 2021-08-30 2023-03-03 华为技术有限公司 一种通信方法及装置
CN114997337B (zh) * 2022-07-18 2023-01-13 浪潮电子信息产业股份有限公司 信息融合、数据通信方法、装置及电子设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104714852A (zh) * 2015-03-17 2015-06-17 华中科技大学 一种适用于分布式机器学习的参数同步优化方法及其系统
CN105956021A (zh) * 2016-04-22 2016-09-21 华中科技大学 一种适用于分布式机器学习的自动化任务并行的方法及其系统
CN106815644A (zh) * 2017-01-26 2017-06-09 北京航空航天大学 机器学习方法和从节点
CN107025205A (zh) * 2016-01-30 2017-08-08 华为技术有限公司 一种分布式系统中的训练模型的方法及设备

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9633315B2 (en) * 2012-04-27 2017-04-25 Excalibur Ip, Llc Method and system for distributed machine learning
CN106779093A (zh) * 2017-01-06 2017-05-31 中国科学院上海高等研究院 基于滑动窗口采样的分布式机器学习训练方法及其系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104714852A (zh) * 2015-03-17 2015-06-17 华中科技大学 一种适用于分布式机器学习的参数同步优化方法及其系统
CN107025205A (zh) * 2016-01-30 2017-08-08 华为技术有限公司 一种分布式系统中的训练模型的方法及设备
CN105956021A (zh) * 2016-04-22 2016-09-21 华中科技大学 一种适用于分布式机器学习的自动化任务并行的方法及其系统
CN106815644A (zh) * 2017-01-26 2017-06-09 北京航空航天大学 机器学习方法和从节点

Also Published As

Publication number Publication date
CN107944566B (zh) 2020-12-22
CN107944566A (zh) 2018-04-20

Similar Documents

Publication Publication Date Title
WO2019104713A1 (zh) 一种机器学习方法、主节点、工作节点及系统
US10764125B2 (en) Method and device for training model in distributed system
CN103235835B (zh) 用于数据库集群的查询实现方法和装置
WO2017125015A1 (zh) 分布式系统工作流处理方法和工作流引擎系统
CN108650667B (zh) 终端调度方法和装置
CN109656690A (zh) 调度系统、方法和存储介质
CN110557416B (zh) 一种多节点协同打块的方法及系统
CN113434282B (zh) 流计算任务的发布、输出控制方法及装置
Huang et al. Enabling DNN acceleration with data and model parallelization over ubiquitous end devices
CN112202877B (zh) 网关联动方法、网关、云服务器及用户终端
CN106230914A (zh) 一种基于订阅信息发布的电子白板数据共享系统
CN114328432A (zh) 一种大数据联邦学习处理方法及系统
CN110955504B (zh) 智能分配渲染任务的方法、服务器、系统及存储介质
CN111049900B (zh) 一种物联网流计算调度方法、装置和电子设备
CN113220459B (zh) 一种任务处理方法及装置
CN103914313B (zh) 一种paxos实例更新方法、设备及系统
CN108566294B (zh) 一种支持计算平面的通信网络系统
CN110233791B (zh) 数据去重方法和装置
CN107959710B (zh) 基于云平台的协同建模方法、建模控制服务器和客户端
CN115361382B (zh) 基于数据群组的数据处理方法、装置、设备和存储介质
KR20170126665A (ko) 복수 단말의 상태 정보를 동기화하는 방법 및 그 장치
CN113821313B (zh) 一种任务调度方法、装置及电子设备
Garibay-Martínez et al. Improved holistic analysis for fork–join distributed real-time tasks supported by the FTT-SE protocol
CN111541759A (zh) 一种云平台通信系统及其通信方法
CN110958240A (zh) 消息订阅系统及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17933380

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17933380

Country of ref document: EP

Kind code of ref document: A1