CN111858058A - SGD load balancing method and device based on parallel computing and storage medium - Google Patents
SGD load balancing method and device based on parallel computing and storage medium Download PDFInfo
- Publication number
- CN111858058A CN111858058A CN202010723846.3A CN202010723846A CN111858058A CN 111858058 A CN111858058 A CN 111858058A CN 202010723846 A CN202010723846 A CN 202010723846A CN 111858058 A CN111858058 A CN 111858058A
- Authority
- CN
- China
- Prior art keywords
- nodes
- node
- load balancing
- sub
- parallel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Abstract
The invention discloses an SGD load balancing method based on parallel computing, which comprises the following steps: realizing distributed parallel gpu calculation based on a design mode combining model parallel and data parallel; and a semaphore mechanism is adopted to realize synchronous communication between the main node and the sub-nodes, and the optimizer in the sub-container updates the weight by adopting a random gradient descent algorithm. The main node constructs a minimum spanning tree by taking the error in the control table of the child nodes as the weight, finds out the key nodes in the graph nodes, eliminates the nodes without nodes in sequence and redistributes the hardware resources of the nodes. The method realizes that a plurality of model copies simultaneously process different subsets of training samples, periodically carries out interactive combination on the model copies, and optimizes a distributed algorithm. The invention provides a new framework thought to realize the strategy of load balancing calculation, improves the model development efficiency and reduces the development cost, and the algorithm has better adaptability to the data scale and simultaneously realizes the asynchronous communication among the dynamic management sub-containers.
Description
Technical Field
The invention relates to the field of machine learning, in particular to a SGD load balancing method and device based on parallel computing and a storage medium.
Background
At present, people have drawn great advantages of artificial intelligence in a plurality of fields. Machine learning is an important ring in artificial intelligence, and helps people make decisions by modeling and training mass data.
However, with the rise of big data, the data size is more and more huge, and the storage and calculation capabilities in the single machine mode cannot meet the requirements of massive data. Distributed machine learning comes from birth, and it has become the mainstream mode in the industry to adopt distributed machine learning to accelerate the speed of model convergence, and there are two more general methods for distributed machine learning at present: model parallel and data parallel.
However, the current parallel computation is limited by the barrel effect, and the next computation can be performed only by waiting until the slowest node is computed. Different subsets of training samples are processed on a plurality of model copies at the same time, and the results of the model copies are periodically combined in an interactive mode, so that the calculation efficiency under large-scale data is improved, and the technical difficulty requirement is high.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an SGD load balancing method, an SGD load balancing device and an SGD load balancing storage medium based on parallel computing, wherein a mode based on combination of a model parallel mode and a data parallel mode is adopted. Compared with the prior art, the method effectively realizes that a plurality of model copies simultaneously process different subsets of the training sample, periodically carries out interactive combination on the results of the model copies, and optimizes the distributed algorithm.
The purpose of the invention is realized by the following technical scheme:
the SGD load balancing method based on parallel computing comprises the following steps:
step 1: constructing a parallel gpu computing architecture, constructing a one-way communication graph by adopting a mode of combining a model parallel mode and a data parallel mode, periodically carrying out model circulation among graph nodes, enabling a model to cover a data set, and preferentially distributing hardware equipment for the graph nodes;
step 2: and dynamically managing node hardware resources, realizing synchronous communication between the main node and the sub-nodes by adopting a semaphore mechanism, and updating the weight by adopting a random gradient descent algorithm in the optimizer in the sub-container.
Specifically, the building of the parallel gpu computing architecture in the step 1 specifically includes the following sub-steps:
s101, configuring a management Node Manager, creating N containers to be deployed on different machines, marking as Node nodes, creating a Node control table on a child Node, and recording a Node ID, a Node data set and a current batch error;
s102, establishing connection among the sub-nodes to form a one-way connection graph, building a neural network in the sub-nodes, and setting a time slice T of one period;
s103, evenly dividing the data samples into N parts, sending the N parts into nodes in sequence, training the nodes on different nodes by using an SGD algorithm, obtaining a local gradient value by each part of the data samples through forward propagation and backward propagation, and updating the gradient;
and S104, traversing according to the hierarchy of the graph in each training period, recording the unbiased estimation quantity of the model error, and recording the error value in the node control table.
Specifically, the traversal process of the graph in the sub-step S104 specifically includes: packing parameters such as weight and bias output by an upper node into an NN object for transmission; after the current node receives the NN object transmitted by the upper node, training the NN object as a hidden layer; and if the current node has a plurality of upper nodes, merging NN objects transmitted from the upper nodes, and solving the mean value of the NN objects as a hidden layer for training.
Specifically, the process of dynamically managing node hardware resources in step 2 specifically includes the following sub-steps:
s201, in each period, inquiring a node control table through a main node, constructing a minimum spanning tree by taking an error in the node control table as a weight, and sequencing the weights in the minimum spanning tree;
s202, when the training model is to be converged, the main node sorts the nodes according to the minimum spanning tree of each period in the node control table and the weight, and sends a synchronization signal to the key node;
and S203, the main node sequentially recovers the tasks of the nodes which do not receive the synchronous signals in the unidirectional communication graph, distributes the hardware resources of the nodes to the adjacent key nodes, and accelerates the calculation speed of the adjacent key nodes until all the nodes finish the training tasks.
A computing device comprising a memory having stored therein computer-executable instructions; a processor for implementing the steps of the load balancing method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned load balancing method.
The invention has the beneficial effects that: the invention provides a new framework thought to realize the strategy of load balancing calculation, improves the model development efficiency and reduces the development cost, so that the algorithm has better adaptability to the data scale, and realizes the asynchronous communication among the dynamic management sub-containers.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a diagram of a parallel computing architecture of the present invention.
FIG. 3 is a diagram of the present invention implementing dynamic management of node hardware resources using a semaphore mechanism.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
In this embodiment, as shown in fig. 1, the SGD load balancing method based on parallel computing mainly includes the following steps:
step 1: constructing a parallel gpu computing architecture, constructing a one-way communication graph by adopting a mode of combining a model parallel mode and a data parallel mode, periodically carrying out model circulation among graph nodes, enabling a model to cover a data set, and preferentially distributing hardware equipment for the graph nodes;
step 2: and dynamically managing node hardware resources, realizing synchronous communication between the main node and the sub-nodes by adopting a semaphore mechanism, and updating the weight by adopting a random gradient descent algorithm in the optimizer in the sub-container.
In this embodiment, as shown in fig. 2, the present invention provides a schematic structural diagram of an SGD load balancing method based on parallel computing, and a specific implementation process of the method includes: firstly, configuring a management Node Manager, creating N containers to be deployed on different machines, marking as Node nodes, and creating a Node control table on a child Node for recording Node IDs, Node data sets and current batch errors. Establishing connection among the sub-nodes to form a one-way connection graph (graph nodes are GPU hardware equipment), building a neural network in the sub-nodes, and setting a time slice T of one period. The data samples are evenly divided into N parts, the N parts are sequentially sent into nodes, an SGD algorithm is used for training on different nodes, each part of data sample is subjected to forward propagation and backward propagation to obtain a local gradient value, and the gradient is updated. And in each training period, recording the unbiased estimation quantity of the model error according to the hierarchy traversal of the graph, and recording the error value in the node control table. In the graph traversal process, weights and offsets between adjacent nodes need to be transmitted, parameters are packaged into an NN object for transmission due to the fact that a neural network is complex and numerous in parameters, and the NN object is used as a hidden layer for training after the nodes receive the NN object transmitted from the upper-layer nodes. And if the nodes have a plurality of upper-layer nodes, merging NN objects transmitted from the upper-layer nodes, and solving the mean value of the NN objects to be used as a hidden layer for training. Model circulation is performed periodically, so that the model runs on all data.
Based on the framework in step 1, after training for a period of time, the error of part of nodes will decrease very slowly, and it takes a very long training time to achieve convergence, which greatly affects training efficiency, and meanwhile, a large amount of invalid calculations will be generated, resulting in waste of hardware resources. Therefore, the invention introduces a semaphore mechanism to realize synchronous communication between the main node and the sub-nodes and manage the dynamic management of the hardware resources of the nodes.
In this embodiment, fig. 3 is a schematic diagram of implementing dynamic management of node hardware resources by using a semaphore mechanism according to the present invention, and a specific implementation process includes: in each period, the main node inquires a node control table, constructs a minimum spanning tree by taking the error in the node control table as a weight, and sorts the weights in the minimum spanning tree. After training a certain period (when the model is to be converged), the main node sorts the nodes according to the minimum spanning tree of each period in the node control table and the weight, and sends synchronous signals to the key nodes. And then, the main node sequentially recovers the tasks of the nodes which do not receive the synchronous signals, and distributes the hardware resources of the nodes to the adjacent key nodes to accelerate the calculation speed of the adjacent nodes so as to improve the efficiency of the whole model.
The architecture thought adopted by the invention can effectively reduce the Loss value, provide the development efficiency of the model, reduce the development cost and have better adaptability to the data scale.
In addition, the invention also provides a computing device and a computer readable storage medium. Wherein a computing device comprises a memory having stored therein computer-executable instructions; and the processor is used for implementing all implementation processes and steps of the load balancing method in the embodiment when executing the computer program. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out all the methods and steps of the above-mentioned load balancing method.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (6)
1. The SGD load balancing method based on parallel computing is characterized by comprising the following steps of:
step 1: constructing a parallel gpu computing architecture, constructing a one-way communication graph by adopting a mode of combining a model parallel mode and a data parallel mode, periodically carrying out model circulation among graph nodes, enabling a model to cover a data set, and preferentially distributing hardware equipment for the graph nodes;
step 2: and dynamically managing node hardware resources, realizing synchronous communication between the main node and the sub-nodes by adopting a semaphore mechanism, and updating the weight by adopting a random gradient descent algorithm in the optimizer in the sub-container.
2. The SGD load balancing method based on parallel computing according to claim 1, wherein the building of the parallel gpu computing architecture in step 1 specifically includes the following sub-steps:
s101, configuring a management Node Manager, creating N containers to be deployed on different machines, marking as Node nodes, creating a Node control table on a child Node, and recording a Node ID, a Node data set and a current batch error;
s102, establishing connection among the sub-nodes to form a one-way connection graph, building a neural network in the sub-nodes, and setting a time slice T of one period;
s103, evenly dividing the data samples into N parts, sending the N parts into nodes in sequence, training the nodes on different nodes by using an SGD algorithm, obtaining a local gradient value by each part of the data samples through forward propagation and backward propagation, and updating the gradient; and S104, traversing according to the hierarchy of the graph in each training period, recording the unbiased estimation quantity of the model error, and recording the error value in the node control table.
3. The SGD load balancing method according to claim 2, wherein the traversal process of the graph in the sub-step S104 specifically includes: packing parameters such as weight and bias output by an upper node into an NN object for transmission; after the current node receives the NN object transmitted by the upper node, training the NN object as a hidden layer; and if the current node has a plurality of upper nodes, merging NN objects transmitted from the upper nodes, and solving the mean value of the NN objects as a hidden layer for training.
4. The SGD load balancing method based on parallel computing according to claim 1, wherein the step 2 of dynamically managing hardware resources of nodes specifically comprises the following sub-steps:
s201, in each period, inquiring a node control table through a main node, constructing a minimum spanning tree by taking an error in the node control table as a weight, and sequencing the weights in the minimum spanning tree;
s202, when the training model is to be converged, the main node sorts the nodes according to the minimum spanning tree of each period in the node control table and the weight, and sends a synchronization signal to the key node;
and S203, the main node sequentially recovers the tasks of the nodes which do not receive the synchronous signals in the unidirectional communication graph, distributes the hardware resources of the nodes to the adjacent key nodes, and accelerates the calculation speed of the adjacent key nodes until all the nodes finish the training tasks.
5. A computing device, comprising
A memory having computer-executable instructions stored therein;
a processor for implementing the steps of the load balancing method according to any one of claims 1 to 4 when executing the computer program.
6. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the load balancing method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010723846.3A CN111858058A (en) | 2020-07-24 | 2020-07-24 | SGD load balancing method and device based on parallel computing and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010723846.3A CN111858058A (en) | 2020-07-24 | 2020-07-24 | SGD load balancing method and device based on parallel computing and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111858058A true CN111858058A (en) | 2020-10-30 |
Family
ID=72950115
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010723846.3A Pending CN111858058A (en) | 2020-07-24 | 2020-07-24 | SGD load balancing method and device based on parallel computing and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111858058A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112598118A (en) * | 2021-03-03 | 2021-04-02 | 成都晓多科技有限公司 | Method, device, storage medium and equipment for processing abnormal labeling in supervised learning |
CN114167828A (en) * | 2021-12-03 | 2022-03-11 | 润电能源科学技术有限公司 | External hanging control method of DCS controller and related device |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339351A (en) * | 2016-08-30 | 2017-01-18 | 浪潮(北京)电子信息产业有限公司 | SGD (Stochastic Gradient Descent) algorithm optimization system and method |
CN108304918A (en) * | 2018-01-18 | 2018-07-20 | 中兴飞流信息科技有限公司 | A kind of the parameter exchange method and system of the deep learning of data parallel |
CN108921196A (en) * | 2018-06-01 | 2018-11-30 | 南京邮电大学 | A kind of semantic segmentation method for improving full convolutional neural networks |
CN110678843A (en) * | 2017-04-17 | 2020-01-10 | 微软技术许可有限责任公司 | Dynamically partitioning workloads in deep neural network modules to reduce power consumption |
CN110795228A (en) * | 2018-08-03 | 2020-02-14 | 伊姆西Ip控股有限责任公司 | Adaptive batch dataset partitioning for distributed deep learning using accelerator mixture sets |
CN111178486A (en) * | 2019-11-27 | 2020-05-19 | 湖州师范学院 | Hyper-parameter asynchronous parallel search method based on population evolution |
WO2020102526A1 (en) * | 2018-11-14 | 2020-05-22 | North Carolina State University | Deep neural network with compositional grammatical architectures |
US20200175422A1 (en) * | 2018-11-29 | 2020-06-04 | International Business Machines Corporation | Asynchronous gradient weight compression |
-
2020
- 2020-07-24 CN CN202010723846.3A patent/CN111858058A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339351A (en) * | 2016-08-30 | 2017-01-18 | 浪潮(北京)电子信息产业有限公司 | SGD (Stochastic Gradient Descent) algorithm optimization system and method |
CN110678843A (en) * | 2017-04-17 | 2020-01-10 | 微软技术许可有限责任公司 | Dynamically partitioning workloads in deep neural network modules to reduce power consumption |
CN108304918A (en) * | 2018-01-18 | 2018-07-20 | 中兴飞流信息科技有限公司 | A kind of the parameter exchange method and system of the deep learning of data parallel |
CN108921196A (en) * | 2018-06-01 | 2018-11-30 | 南京邮电大学 | A kind of semantic segmentation method for improving full convolutional neural networks |
CN110795228A (en) * | 2018-08-03 | 2020-02-14 | 伊姆西Ip控股有限责任公司 | Adaptive batch dataset partitioning for distributed deep learning using accelerator mixture sets |
WO2020102526A1 (en) * | 2018-11-14 | 2020-05-22 | North Carolina State University | Deep neural network with compositional grammatical architectures |
US20200175422A1 (en) * | 2018-11-29 | 2020-06-04 | International Business Machines Corporation | Asynchronous gradient weight compression |
CN111178486A (en) * | 2019-11-27 | 2020-05-19 | 湖州师范学院 | Hyper-parameter asynchronous parallel search method based on population evolution |
Non-Patent Citations (2)
Title |
---|
CHENG DANING 等: "Weighted parallel SGD for distributed unbalanced-workload training system", 《JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING》 * |
鲁淑霞 等: "带有方差减小的加权零阶随机梯度下降算法", 《河北大学学报(自然科学版)》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112598118A (en) * | 2021-03-03 | 2021-04-02 | 成都晓多科技有限公司 | Method, device, storage medium and equipment for processing abnormal labeling in supervised learning |
CN112598118B (en) * | 2021-03-03 | 2021-06-25 | 成都晓多科技有限公司 | Method, device, storage medium and equipment for processing abnormal labeling in supervised learning |
CN114167828A (en) * | 2021-12-03 | 2022-03-11 | 润电能源科学技术有限公司 | External hanging control method of DCS controller and related device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110533183B (en) | Task placement method for heterogeneous network perception in pipeline distributed deep learning | |
CN109032671B (en) | Distributed deep learning method and system based on data parallel strategy | |
Sun et al. | Optimizing network performance for distributed dnn training on gpu clusters: Imagenet/alexnet training in 1.5 minutes | |
US10171284B2 (en) | Reachability-based coordination for cyclic dataflow | |
CN114756383B (en) | Distributed computing method, system, equipment and storage medium | |
CN103970580B (en) | A kind of data flow towards multinuclear cluster compiles optimization method | |
CN109754060A (en) | A kind of training method and device of neural network machine learning model | |
CN107330516A (en) | Model parameter training method, apparatus and system | |
CN110222005A (en) | Data processing system and its method for isomery framework | |
CN111858058A (en) | SGD load balancing method and device based on parallel computing and storage medium | |
Van Tendeloo et al. | PythonPDEVS: a distributed parallel DEVS simulator. | |
CN108564164A (en) | A kind of parallelization deep learning method based on SPARK platforms | |
Zhan et al. | Pipe-torch: Pipeline-based distributed deep learning in a gpu cluster with heterogeneous networking | |
Sun et al. | Gradientflow: Optimizing network performance for large-scale distributed dnn training | |
CN111274036A (en) | Deep learning task scheduling method based on speed prediction | |
CN111241301A (en) | Knowledge graph representation learning-oriented distributed framework construction method | |
Sun et al. | Gssp: eliminating stragglers through grouping synchronous for distributed deep learning in heterogeneous cluster | |
Osawa et al. | Pipefisher: Efficient training of large language models using pipelining and fisher information matrices | |
Wu et al. | Rethinking memory and communication cost for efficient large language model training | |
CN110135067B (en) | Helicopter flow field overlapping mixed grid parallel method under double time step method | |
CN116755876A (en) | Large model hybrid parallel training acceleration method and system | |
CN112446484A (en) | Multitask training cluster intelligent network system and cluster network optimization method | |
CN116400963A (en) | Model automatic parallel method, device and storage medium based on load balancing | |
Wang et al. | A coordinated two-stages virtual network embedding algorithm based on reinforcement learning | |
CN115600673A (en) | Method and system for parallel training DNN model for multi-machine multi-card computing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20221209 |