CN111858058A

CN111858058A - SGD load balancing method and device based on parallel computing and storage medium

Info

Publication number: CN111858058A
Application number: CN202010723846.3A
Authority: CN
Inventors: 王彪; 王亚强; 刘魁
Original assignee: Chengdu Cheng Xin High Tech Information Technology Co ltd; Chengdu University of Information Technology
Current assignee: Chengdu Cheng Xin High Tech Information Technology Co ltd; Chengdu University of Information Technology
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-10-30

Abstract

The invention discloses an SGD load balancing method based on parallel computing, which comprises the following steps: realizing distributed parallel gpu calculation based on a design mode combining model parallel and data parallel; and a semaphore mechanism is adopted to realize synchronous communication between the main node and the sub-nodes, and the optimizer in the sub-container updates the weight by adopting a random gradient descent algorithm. The main node constructs a minimum spanning tree by taking the error in the control table of the child nodes as the weight, finds out the key nodes in the graph nodes, eliminates the nodes without nodes in sequence and redistributes the hardware resources of the nodes. The method realizes that a plurality of model copies simultaneously process different subsets of training samples, periodically carries out interactive combination on the model copies, and optimizes a distributed algorithm. The invention provides a new framework thought to realize the strategy of load balancing calculation, improves the model development efficiency and reduces the development cost, and the algorithm has better adaptability to the data scale and simultaneously realizes the asynchronous communication among the dynamic management sub-containers.

Description

SGD load balancing method and device based on parallel computing and storage medium

Technical Field

The invention relates to the field of machine learning, in particular to a SGD load balancing method and device based on parallel computing and a storage medium.

Background

At present, people have drawn great advantages of artificial intelligence in a plurality of fields. Machine learning is an important ring in artificial intelligence, and helps people make decisions by modeling and training mass data.

However, with the rise of big data, the data size is more and more huge, and the storage and calculation capabilities in the single machine mode cannot meet the requirements of massive data. Distributed machine learning comes from birth, and it has become the mainstream mode in the industry to adopt distributed machine learning to accelerate the speed of model convergence, and there are two more general methods for distributed machine learning at present: model parallel and data parallel.

However, the current parallel computation is limited by the barrel effect, and the next computation can be performed only by waiting until the slowest node is computed. Different subsets of training samples are processed on a plurality of model copies at the same time, and the results of the model copies are periodically combined in an interactive mode, so that the calculation efficiency under large-scale data is improved, and the technical difficulty requirement is high.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an SGD load balancing method, an SGD load balancing device and an SGD load balancing storage medium based on parallel computing, wherein a mode based on combination of a model parallel mode and a data parallel mode is adopted. Compared with the prior art, the method effectively realizes that a plurality of model copies simultaneously process different subsets of the training sample, periodically carries out interactive combination on the results of the model copies, and optimizes the distributed algorithm.

The purpose of the invention is realized by the following technical scheme:

the SGD load balancing method based on parallel computing comprises the following steps:

step 1: constructing a parallel gpu computing architecture, constructing a one-way communication graph by adopting a mode of combining a model parallel mode and a data parallel mode, periodically carrying out model circulation among graph nodes, enabling a model to cover a data set, and preferentially distributing hardware equipment for the graph nodes;

step 2: and dynamically managing node hardware resources, realizing synchronous communication between the main node and the sub-nodes by adopting a semaphore mechanism, and updating the weight by adopting a random gradient descent algorithm in the optimizer in the sub-container.

Specifically, the building of the parallel gpu computing architecture in the step 1 specifically includes the following sub-steps:

s101, configuring a management Node Manager, creating N containers to be deployed on different machines, marking as Node nodes, creating a Node control table on a child Node, and recording a Node ID, a Node data set and a current batch error;

s102, establishing connection among the sub-nodes to form a one-way connection graph, building a neural network in the sub-nodes, and setting a time slice T of one period;

s103, evenly dividing the data samples into N parts, sending the N parts into nodes in sequence, training the nodes on different nodes by using an SGD algorithm, obtaining a local gradient value by each part of the data samples through forward propagation and backward propagation, and updating the gradient;

and S104, traversing according to the hierarchy of the graph in each training period, recording the unbiased estimation quantity of the model error, and recording the error value in the node control table.

Specifically, the traversal process of the graph in the sub-step S104 specifically includes: packing parameters such as weight and bias output by an upper node into an NN object for transmission; after the current node receives the NN object transmitted by the upper node, training the NN object as a hidden layer; and if the current node has a plurality of upper nodes, merging NN objects transmitted from the upper nodes, and solving the mean value of the NN objects as a hidden layer for training.

Specifically, the process of dynamically managing node hardware resources in step 2 specifically includes the following sub-steps:

s201, in each period, inquiring a node control table through a main node, constructing a minimum spanning tree by taking an error in the node control table as a weight, and sequencing the weights in the minimum spanning tree;

s202, when the training model is to be converged, the main node sorts the nodes according to the minimum spanning tree of each period in the node control table and the weight, and sends a synchronization signal to the key node;

and S203, the main node sequentially recovers the tasks of the nodes which do not receive the synchronous signals in the unidirectional communication graph, distributes the hardware resources of the nodes to the adjacent key nodes, and accelerates the calculation speed of the adjacent key nodes until all the nodes finish the training tasks.

A computing device comprising a memory having stored therein computer-executable instructions; a processor for implementing the steps of the load balancing method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned load balancing method.

The invention has the beneficial effects that: the invention provides a new framework thought to realize the strategy of load balancing calculation, improves the model development efficiency and reduces the development cost, so that the algorithm has better adaptability to the data scale, and realizes the asynchronous communication among the dynamic management sub-containers.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a diagram of a parallel computing architecture of the present invention.

FIG. 3 is a diagram of the present invention implementing dynamic management of node hardware resources using a semaphore mechanism.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

In this embodiment, as shown in fig. 1, the SGD load balancing method based on parallel computing mainly includes the following steps:

In this embodiment, as shown in fig. 2, the present invention provides a schematic structural diagram of an SGD load balancing method based on parallel computing, and a specific implementation process of the method includes: firstly, configuring a management Node Manager, creating N containers to be deployed on different machines, marking as Node nodes, and creating a Node control table on a child Node for recording Node IDs, Node data sets and current batch errors. Establishing connection among the sub-nodes to form a one-way connection graph (graph nodes are GPU hardware equipment), building a neural network in the sub-nodes, and setting a time slice T of one period. The data samples are evenly divided into N parts, the N parts are sequentially sent into nodes, an SGD algorithm is used for training on different nodes, each part of data sample is subjected to forward propagation and backward propagation to obtain a local gradient value, and the gradient is updated. And in each training period, recording the unbiased estimation quantity of the model error according to the hierarchy traversal of the graph, and recording the error value in the node control table. In the graph traversal process, weights and offsets between adjacent nodes need to be transmitted, parameters are packaged into an NN object for transmission due to the fact that a neural network is complex and numerous in parameters, and the NN object is used as a hidden layer for training after the nodes receive the NN object transmitted from the upper-layer nodes. And if the nodes have a plurality of upper-layer nodes, merging NN objects transmitted from the upper-layer nodes, and solving the mean value of the NN objects to be used as a hidden layer for training. Model circulation is performed periodically, so that the model runs on all data.

Based on the framework in step 1, after training for a period of time, the error of part of nodes will decrease very slowly, and it takes a very long training time to achieve convergence, which greatly affects training efficiency, and meanwhile, a large amount of invalid calculations will be generated, resulting in waste of hardware resources. Therefore, the invention introduces a semaphore mechanism to realize synchronous communication between the main node and the sub-nodes and manage the dynamic management of the hardware resources of the nodes.

In this embodiment, fig. 3 is a schematic diagram of implementing dynamic management of node hardware resources by using a semaphore mechanism according to the present invention, and a specific implementation process includes: in each period, the main node inquires a node control table, constructs a minimum spanning tree by taking the error in the node control table as a weight, and sorts the weights in the minimum spanning tree. After training a certain period (when the model is to be converged), the main node sorts the nodes according to the minimum spanning tree of each period in the node control table and the weight, and sends synchronous signals to the key nodes. And then, the main node sequentially recovers the tasks of the nodes which do not receive the synchronous signals, and distributes the hardware resources of the nodes to the adjacent key nodes to accelerate the calculation speed of the adjacent nodes so as to improve the efficiency of the whole model.

The architecture thought adopted by the invention can effectively reduce the Loss value, provide the development efficiency of the model, reduce the development cost and have better adaptability to the data scale.

In addition, the invention also provides a computing device and a computer readable storage medium. Wherein a computing device comprises a memory having stored therein computer-executable instructions; and the processor is used for implementing all implementation processes and steps of the load balancing method in the embodiment when executing the computer program. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out all the methods and steps of the above-mentioned load balancing method.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The SGD load balancing method based on parallel computing is characterized by comprising the following steps of:

2. The SGD load balancing method based on parallel computing according to claim 1, wherein the building of the parallel gpu computing architecture in step 1 specifically includes the following sub-steps:

s103, evenly dividing the data samples into N parts, sending the N parts into nodes in sequence, training the nodes on different nodes by using an SGD algorithm, obtaining a local gradient value by each part of the data samples through forward propagation and backward propagation, and updating the gradient; and S104, traversing according to the hierarchy of the graph in each training period, recording the unbiased estimation quantity of the model error, and recording the error value in the node control table.

3. The SGD load balancing method according to claim 2, wherein the traversal process of the graph in the sub-step S104 specifically includes: packing parameters such as weight and bias output by an upper node into an NN object for transmission; after the current node receives the NN object transmitted by the upper node, training the NN object as a hidden layer; and if the current node has a plurality of upper nodes, merging NN objects transmitted from the upper nodes, and solving the mean value of the NN objects as a hidden layer for training.

4. The SGD load balancing method based on parallel computing according to claim 1, wherein the step 2 of dynamically managing hardware resources of nodes specifically comprises the following sub-steps:

5. A computing device, comprising

A memory having computer-executable instructions stored therein;

a processor for implementing the steps of the load balancing method according to any one of claims 1 to 4 when executing the computer program.

6. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the load balancing method according to any one of claims 1 to 4.