CN108829517B - Training method and system for machine learning in cluster environment - Google Patents

Training method and system for machine learning in cluster environment Download PDF

Info

Publication number
CN108829517B
CN108829517B CN201810549619.6A CN201810549619A CN108829517B CN 108829517 B CN108829517 B CN 108829517B CN 201810549619 A CN201810549619 A CN 201810549619A CN 108829517 B CN108829517 B CN 108829517B
Authority
CN
China
Prior art keywords
node
training
computing
data
unit time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810549619.6A
Other languages
Chinese (zh)
Other versions
CN108829517A (en
Inventor
程大宁
李士刚
张云泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201810549619.6A priority Critical patent/CN108829517B/en
Publication of CN108829517A publication Critical patent/CN108829517A/en
Application granted granted Critical
Publication of CN108829517B publication Critical patent/CN108829517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load

Abstract

The invention provides a training method for machine learning in a cluster environment, which comprises the following steps: 1) according to the number of computing nodes in the cluster environment, dividing data in a training set into a plurality of parts for each computing node to execute training operation in parallel; 2) training the portion of the assigned data set with each compute node in the cluster environment, such that each compute node trains a machine learning model in parallel; 3) weighted averaging of the processing results of the individual computing nodes, the weight of each of the computing nodes being set such that the decay rates for the variances and/or means of the individual nodes in the clustered environment are consistent or close to each other.

Description

Training method and system for machine learning in cluster environment
Technical Field
The present invention relates to training for machine learning in a cluster environment, and more particularly, to a method and system for training machine learning in a cluster environment with unbalanced load.
Background
With the development of machine learning and artificial intelligence techniques, the amount of computation required to complete tasks of application scenarios has increased dramatically, and in order to increase the computation speed, a large number of hardware devices are required to process data in a parallel manner. When a hardware device is actually used to implement a training process of machine learning, in a cluster environment, computation is generally performed in parallel by using a large number of computing nodes included in a cluster, and parameters or computation results of a machine model corresponding to a training target are obtained by iteratively performing computation.
Many users will purchase hardware products based on the amount of computation required to perform the tasks and the computing power of the hardware devices. Since the hardware product is updated very fast, it is very common that a user purchases different batches of CPUs and GPUs with different computing capabilities for a relatively long period of time. It is generally considered in the art that it is not suitable to perform the same computing task in parallel by using different batches of hardware with different computing capabilities as computing nodes belonging to the same cluster. This is because, on the one hand, in the parallel processing, the time consumed for computation depends on the computing node with the slowest processing speed, and if hardware with different computing capabilities is used as a computing node in a mixed manner, the computing advantage of the hardware with higher computing capabilities cannot be enjoyed. On the other hand, the adoption of different hardware in one cluster is not convenient for unified management. Based on the consideration of the two factors, the field generally adopts only hardware with the same or similar processing capacity as a cluster to perform machine learning iterative computation by each node in the cluster in parallel.
However, for a user who separately purchases hardware equipment with different computing power for a plurality of times, it is impossible to perform computing using hardware of different batches, which is very uneconomical in view of use cost. When the overall computing power needs to be improved, hardware with stronger computing power needs to be purchased again or new hardware with the same or similar computing power as the purchased hardware needs to be searched.
Disclosure of Invention
Accordingly, it is an object of the present invention to overcome the above-mentioned drawbacks of the prior art, and to provide a training method for machine learning in a clustered environment, comprising:
1) according to the number of computing nodes in the cluster environment, dividing data in a training set into a plurality of parts for each computing node to execute training operation in parallel;
2) training the portion of the assigned data set with each compute node in the cluster environment, such that each compute node trains a machine learning model in parallel;
3) weighted averaging of the processing results of the respective computing nodes, the weight of each of the computing nodes being set such that the degree of attenuation of the variance and/or mean for the respective nodes in the cluster environment is consistent or close to each other.
Preferably, according to the method, the weight of each of the computing nodes is set to be an index depending on a difference between an amount of data that the computing node can process per unit time and an amount of data that the computing node with the fastest computing speed can process per unit time in the cluster environment.
Preferably, according to the method, if the loss function used for training of machine learning belongs to a non-strongly convex function, the weight of the computing node i is set as:
Figure BDA0001680825560000021
wherein, lambda is a regular coefficient, eta is a step length, and the number range of each calculation node is 1iThe difference value is the difference value between the data amount which can be processed by the computing node i in unit time and the data amount which can be processed by the computing node with the highest computing speed in the cluster environment in unit time.
Preferably, according to the method, wherein the loss function is a non-strongly convex function, the number k of computing nodes, the regular coefficient λ, the step length η, and the difference T between the amount of data that can be processed by the computing node i in a unit time and the amount of data that can be processed by the computing node with the fastest computation speed in the cluster environment in a unit time are calculatediSelected to satisfy the expression:
Figure BDA0001680825560000022
preferably, according to the method, if the loss function used for training of machine learning belongs to a strong convex function, the weight of the computing node i is set as:
Figure BDA0001680825560000031
wherein r is the convergence rate of the SGD algorithm under the condition of giving a training data set and a training function, and the number range of each computing node is 1iThe data volume that the computing node i can process in unit time and the cluster environmentAnd the difference between the data amounts which can be processed in the unit time by the computing node with the fastest computing speed.
Preferably, according to the method, wherein the loss function is a strong convex function, the number k of computing nodes, the convergence rate r of the SGD algorithm given a training data set and a training function, and the difference T between the amount of data that can be processed per unit time by computing node i and the amount of data that can be processed per unit time by the computing node with the fastest computing speed in the cluster environmentiSelected to satisfy the expression:
Figure BDA0001680825560000032
preferably, according to the method, the weight of each of the computation nodes is set to be, if the loss function used for training for machine learning belongs to a strong convex function:
and fitting the difference between the data quantity which can be processed by the computing node in unit time and the data quantity which can be processed by the computing node with the highest computing speed in the cluster environment in unit time by using an exponential function according to the convergence condition of multiple iterations in the training process.
Preferably, according to the method, wherein the compute node is a CPU, or a GPU, or a MIC, or a combination thereof; and the cluster environment includes compute nodes of the same and/or different computing capabilities.
And a computer-readable storage medium in which a computer program is stored, the computer program, when executed, being for implementing the method of any one of the above.
And, a training system for machine learning in a clustered environment, comprising:
a storage device and a processor;
wherein the storage means is adapted to store a computer program which, when executed by the processor, is adapted to carry out the method of any of the above.
Compared with the prior art, the invention has the advantages that:
the method is particularly suitable for training machine learning in a cluster environment with unbalanced load, and particularly can achieve the effect similar to that of implementing the Simul Parallel SGD algorithm in a load balancing environment when the iteration is performed for the same times in the load unbalanced environment. Based on the scheme, a user can use hardware devices which are purchased for multiple times and have different computing capabilities together as a computing cluster for executing machine learning in parallel, hardware with stronger computing capabilities does not need to be purchased again or new hardware with the same or similar computing capabilities as the purchased hardware needs to be searched, hardware cost is greatly saved, and the method is particularly suitable for applications which need to use a large number of hardware devices to process data in a parallel mode.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
fig. 1 is a schematic diagram of a conventional cartridge SGD algorithm.
Fig. 2 is a schematic diagram of a modified cartridge SGD algorithm according to an embodiment of the present invention.
Fig. 3 is a scheme for using worker nodes mixed with a parameter server in a cluster according to an embodiment of the present invention.
FIG. 4 is a flow diagram of a training method for machine learning in a clustered environment in accordance with one embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
As introduced in the background, when training of machine learning is performed in a parallel computing manner in a cluster environment, it is necessary to obtain a desired result by iteratively iterating the computation repeatedly. For example, a training objective is defined as the value of a parameter to be determined at which the loss function (also referred to as an objective function) represented by the parameter is obtained to minimize the value of the parameter. Therefore, in order to converge the value of the loss function to a very small range as quickly as possible by iterative computation when performing training of machine learning, the most commonly used solution model is the Stochastic Gradient Descent (SGD).
Meanwhile, in consideration of the fact that the training process of machine learning needs to carry out repeated iterative computation on data of a large number of training sets, the computation amount is huge, so that a plurality of computers are often adopted to train the same machine model by adopting the same loss function in parallel, and the computation results of all the computers are summarized to be used as the finally output machine model, so that the training process is accelerated. Such an environment containing many computers is referred to as a cluster environment, and each computer used in parallel is referred to as a compute node.
Conventional parallel SGD algorithms can be classified into two broad categories, delayed SGD and barrel SGD. The delay-type SGD algorithm updates itself by the current model using the gradient of the past model (i.e., using the delay gradient), wherein the iteration number difference between the past model and the current model cannot exceed a specific value. Theoretical analysis shows that the convergence rate of the delay-type SGD algorithm is reduced along with the increase of delay, and the reduction of the convergence rate is faster than the acceleration benefit brought by the parallelism. In order to ensure that the delay does not become large, the communication overhead is inevitable, and therefore, for a delay-type SGD algorithm system, a tradeoff needs to be made between the parallelism (i.e., the number of computing nodes in the cluster), the delay, and the system efficiency.
In actual industrial production, the engineering implementation of the delayed SGD algorithm is a "parameter server". The calculation nodes of the parameter server are divided into a server node and a plurality of worker nodes, the worker nodes mainly work by pulling the current model from the server node, calculating the random gradient corresponding to the model and delivering the gradient to the server node. And after receiving the delayed gradient, the server updates the model by using the delayed gradient.
Another type of cylindrical SGD is to use the gain in variance to accelerate the SGD algorithm, which averages the training models of different nodes to generate the final model. Fig. 1 shows a schematic diagram of the cartridge SGD algorithm described above. In the model shown in fig. 1, each node uses the data of the training set in parallel to train the same machine model, the capacity of each node to process the data is equivalent, after a period of training, each node performs an equal amount of computation, and the output of each node is averaged to obtain the final training result. This algorithm has little communication overhead, however, the high performance of the tube-type SGD algorithm is not always effective, and its parallel performance effectiveness depends on the variance of the training data set.
The two SGD algorithms have advantages and disadvantages, but they are not designed for machine learning in a cluster environment with unbalanced load, and are not suitable for solving the problem described in the background art that the user cannot be divided into multiple purchases of hardware devices with different computing capabilities for use. For the delay type SGD algorithm system, the delay is increased due to the cluster environment with unbalanced load, and the convergence upper limit is increased twice with the increase of the delay theoretically, thereby resulting in the overall efficiency of the system being low. In contrast, for the conventional tube-type SGD, when the cluster load is not uniform, the slowest node becomes a bottleneck of performance, that is, all nodes need to complete an equal amount of tasks before continuing to the next stage of tasks, which means that the node with the fast processing speed needs to wait for the node with the slow processing speed to complete the same amount of tasks, which significantly reduces the computational efficiency in the training process. As stated in the background, in the actual use process, a user may purchase hardware devices with different computing capabilities several times, and if the hardware with different computing capabilities is expected to be commonly used for machine learning, the cluster formed by such hardware belongs to the cluster environment with unbalanced load. In the face of such circumstances, certain improvements to the SGD algorithm are needed.
In contrast, the traditional cylindrical SGD algorithm is improved, and the training time of the parallel SGD algorithm on the machine model in the cluster environment with unbalanced load is remarkably shortened under the condition that the convergence rate is not changed by adopting a weighted parallel mode. Fig. 2 shows a schematic diagram of the cartridge SGD algorithm improved by the present invention. As shown in fig. 2, in which the computing capacities of the nodes 1, 2, and 3 are different from each other, when the three nodes are allocated with data sets of equal or unequal numbers, and when a certain time point is reached after a period of time elapses, the computing amounts actually performed by the nodes 1, 2, and 3 are different from each other, the results of the respective nodes may be allocated with corresponding weights according to the difference to perform weighted average and output the computing results.
The inventors explored how to set such weights. After studying the convergence characteristics of the SGD-based algorithm, the inventors found that the convergence of the SGD-based algorithm mainly depends on the following three aspects: gradually converging to the variance of the motionless point model, gradually converging to the mean of the motionless point model, and gradually converging to the difference between the motionless point model and the optimal model. The difference between the stationary point model and the optimal model is determined by the iteration step length, is the inherent property of the iteration step, and the difference of the part is determined before calculation, so that the part cannot be influenced by load imbalance. In contrast, both the variance and the mean are exponentially reduced, and in a parallel stochastic gradient descent algorithm, the associated coefficients should be exponentially varied. Therefore, the inventors considered that the weights set for the respective nodes of the cluster can be determined according to the reduction of the variance and mean of the objective function employed for training.
The inventor finds the difference between the mean value of the machine learning model under the current iteration times and the mean value of the final convergence model through reasoning and verification
Figure BDA0001680825560000061
Sum variance
Figure BDA0001680825560000062
As the iterations are exponentially reduced, the upper attenuation limit for each iteration is 1- λ η, expressed as:
Figure BDA0001680825560000063
Figure BDA0001680825560000064
where λ is the regular coefficient, η is the step size, G is the maximum lipschitz norm of the loss function on the training set, μ is the mean, σ is the variance, t is the number of iterations,
Figure BDA0001680825560000065
under the condition that the step length is eta, after the t step iteration, the probability density distribution of the machine learning model is obtained,
Figure BDA0001680825560000066
is the probability density distribution after the machine learning model converges when the step length is η.
Therefore, the inventors considered that the converged distance can be directly corrected (i.e., weighted average), and the weight of the node that has not yet decayed to the final value can be adjusted.
Also, as described in the foregoing, the inventors consider that the results of the respective nodes may be assigned respective weights for weighted averaging according to the difference between the calculation amounts actually performed by the respective nodes in the cluster from each other. The invention therefore proposes that the node with the fastest execution speed, i.e. the node capable of handling the most amount of data per unit of time, which is referred to herein as the "fastest node", can be used as a reference. According to the difference value T between the data amount which can be processed by one node i in unit time and the data amount which can be processed by the fastest node in the cluster in unit timeiTo set the weight for node i. In the training process, the variance and the mean of the model after iteration are reduced according to the scale of the index, so that the inventor proposes that the weight of the node i should be in accordance with TiAs a function of the exponent of (a).
The process of adjusting the weight of each node may be viewed as the degree of attenuation for the variance and/or mean of each node in the clustered environment being consistent or close to each other. In this way, the gains of the individual nodes in variance are fully utilized. Although the invention is not limited theretoThe specific examples given use TiAs a parameter of the weight of the node i, however, it should be understood that other parameters may be used to determine the corresponding weight of the node as long as the set weight can sufficiently utilize the gain of each node in variance, and those skilled in the art can use the prior art to find the weight satisfying the above condition according to the teaching.
The upper limit 1- λ η of the above-mentioned attenuation rate is directed to a convex function in a general sense. Although the loss functions are all convex functions, the attenuation rate of the strong convex function is severely reduced, so according to one embodiment of the present invention, the corresponding weight required to be set for each node can be calculated according to the attenuation rate of the strong convex function.
Thus, in some embodiments of the invention, the classification of the strong convexity of the objective function may be based on to provide approximate weights for the respective compute nodes. The inventors have conducted studies on attenuation characteristics when an objective function belongs to a strongly convex function and a non-strongly convex function, respectively, to determine how approximate weights should be provided for respective calculation nodes when the objective function is a strongly convex function, and how approximate weights should be provided for respective calculation nodes when the objective function is a non-strongly convex function, as follows.
The strong convex function and the non-strong convex function of the present invention are defined herein as:
in the convex function, if a positive number sigma exists, the function f satisfies f (x + y) ≧ f (x) + yTf'(x)+σ||y||2(2), f is called as a sigma-strong convex function, otherwise, the function is a non-strong convex function; experiments have shown that when σ is very small (e.g., less than 10)-3) It can also be processed according to a non-strongly convex function.
Non-strongly convex function
If the objective function belongs to a non-strong convex function (for example, a hinge loss function used by a training support vector machine), under the condition of unbalanced cluster environment load, weighted averaging may be performed on the corresponding models according to exponentially decaying weights, and the models generated by different load nodes are combined by using the following weights.
Figure BDA0001680825560000081
Wherein, weight1-λη,iThe node number range is 1iIs the difference between the training data amount of node i and the training data amount of the fastest node. The regular coefficient lambda and the step length eta are values which are determined before training, do not change in the training process of the size, and TiThe value of (a) can be obtained after the training of a single node is completed.
The theoretical upper convergence limit of the non-strongly convex function is as follows:
Figure BDA0001680825560000082
where c is the objective function, w is the model parameter, D is the probability distribution after the model convergence, | | f | YlipF, is the lipschitz norm, v is the system output result, G is the maximum lipschitz norm of the loss function on the training set, and t is the iteration number.
The above theoretical results show that as the iteration progresses, the final result produced by the weighted summation shown in fig. 2 is better than the result produced by the fastest node when the following expression is satisfied, i.e., the target loss function value corresponding to the final result is less than the target loss function value corresponding to the result produced by the fastest node.
Figure BDA0001680825560000083
In other words, if it is desired to perform training for machine learning with clusters having unbalanced loads and it is desired that the result of average output is better than the result of output from the node having the highest computational performance, the number k of computational nodes in the cluster, the regularization coefficient λ, the step length η, and the training data amount of the i-th node may be selectedThe difference T between the training data amount and the fastest nodeiSo that it satisfies expression (3) to achieve the above object.
Strong convex function
For a strong convex function (e.g., a logistic loss function used by a logistic regression model, etc.), it has a faster convergence rate than a non-strong convex function, in this case, the variance and the mean move faster to a stationary point, i.e., a convergence point, and the convergence rate of each iteration is different as the parallel vertical relationship between the data sample and the current training model in the training process is different. Therefore, there is a need to overcome the above problems, for which the present invention proposes the following means:
in the first step, an exponential function may be used to fit the true convergence rate in actual calculations according to the actual calculation effect. Under the real condition, each iteration is a convergence process, the model is sampled at intervals for each iteration, the loss function value of the machine learning model under the current iteration progress is calculated, an exponential function is used, and y is rx+ b, fitting the true convergence rate.
In the second step, weights may be assigned to the respective calculation nodes using expression (4) according to the true convergence rate. Theoretically, the more convex the strong convex function, the poorer the capacity for accommodating the less loaded node. By combining the gains on different node variances, the variances converging to different degrees all contribute to the gains, and a better result can be obtained. In compliance with the requirements of expression (6), the overall system output is better than that of the fastest node, and expression (6) also illustrates that the cluster can accommodate a sufficient number of slow nodes.
The weights set for the models produced by the various nodes with different (or the same) loads are:
Figure BDA0001680825560000091
wherein weightr,iWeight assigned to node i, r is SGDThe convergence speed of the method given the training data set and the training function, b is a constant.
The theoretical upper convergence limit of the strongly convex function is:
Figure BDA0001680825560000101
the above equation shows that when the following expression (6) is satisfied, the final result generated by using the weighted summation shown in fig. 2 is better than the result generated by the fastest node, that is, the target loss function value corresponding to the output calculation result for one node is smaller than the target loss function value corresponding to the result generated by the node with the fastest calculation speed, and the smaller the loss function value is, the better the model effect is, and the smaller the loss function value at the same time is, the faster the convergence speed can be regarded as.
Figure BDA0001680825560000102
Similarly, if training for machine learning is desired with clusters that are not load balanced, and the average output is desired to be better than the output of the highest computational performance node, the number of computational nodes in the cluster k, the convergence speed r of the SGD algorithm given a training data set and a training function, and the difference T between the training data amount of node i and the fastest node can be selectediSo that it satisfies expression (6) to achieve the above object.
Training for machine learning
In conjunction with the above analysis for non-strongly convex and strongly convex functions, the inventors summarize the scheme of the invention as the following pseudo-code.
Figure BDA0001680825560000103
Figure BDA0001680825560000111
The pseudo code of the 1 st line randomly divides data in a training set into a plurality of parts according to the number of nodes in a cluster, so that each node can execute training operation in parallel;
lines 2-11 pseudo code represent training the data portion to which each node in the cluster is assigned, such that each node trains a machine learning model in parallel;
lines 12-13 represent the difference T between the amount of data consumed by node i at the present time and the amount of data consumed by the fastest node (i.e., the amount of data used to train the model at the present time)iWeights are assigned as follows: if the target function is a non-strong convex function and a large number of samples are perpendicular to the model in the iteration process, distributing the weight of each node by using the mode shown in the formula (1), and if the target function is a strong convex function, distributing the weight of each node by using the mode shown in the formula (4);
the pseudo code of line 14 represents a machine learning model which obtains a target by adopting a weighted average mode according to the weight of each node. In this way, a machine learning model which is close to the output of the load-balanced Simul Parallel SGD algorithm can be generated in a cluster environment with unbalanced load (namely, each node has different training data volume).
In the above embodiments, specific expressions for their respective attenuation rates are given for the case where the loss functions belong to the strong convex function and the non-strong convex function, respectively, to further determine the weights set for the respective nodes, and therefore in other embodiments of the present invention, it is necessary to experimentally determine the attenuation rates of the loss functions. Specifically, when the attenuation rate of the loss function is measured, the loss function value of the model may be calculated once every certain number of iterations (for example, every 10000 times) during training of each node model, and the attenuation rate of the loss function may be fitted by the obtained loss function value after the training is completed. Then, according to the calculation formula
Figure BDA0001680825560000121
A weight for node i is determined where r is the decay rate to which the loss function is fitted.
The training method for machine learning in a clustered environment according to the present invention will be described below by way of a specific embodiment. Referring to fig. 4 in conjunction with the cluster shown in fig. 2, according to one embodiment of the invention, the method includes:
step 1, distributing the data volume to be processed for each computing node in the cluster environment.
In the present invention, it is not limited whether the amounts of data allocated to the respective computing nodes need to be equal to each other. Corresponding to line 1 in the aforementioned pseudo-code, the separation of the data of the training set may be random, e.g., the amount of data allocated for each compute node may be different from one another. This is because the data amount and the number of iterations processed by a node determine the amount of computation performed by the node, and whether the amount of computation performed by each computation node at the time of weighted averaging is equal does not affect the present invention, so the data amounts allocated to each computation node by the present invention are not necessarily equal to each other.
And 2, processing the distributed data volume in parallel by each computing node in the cluster environment.
In this step, each compute node performs a corresponding computation based on its penalty function and the assigned data (e.g., data of the training set). The specific calculation process here depends on the specific application to be trained, and the data distributed to it is processed in parallel by the nodes.
And 3, carrying out weighted average processing on the results obtained by each computing node. Wherein the weight for each compute node is obtained by: determining a weight for each compute node according to whether the loss function is a non-strongly convex function or a strongly convex function and a difference between the amount of data consumed by the compute node and the amount of data consumed by the fastest compute node:
if loss functionFor non-strongly convex functions, use
Figure BDA0001680825560000122
And allocating weight to the ith node, wherein lambda is a regular coefficient, eta is a step length, and the node numbering range is 1iIs the difference between the training data volume of the ith node and the training data volume of the fastest node;
preferably, the number k of computing nodes in the selected cluster, the regular coefficient lambda, the step length eta, and the difference T between the training data amount of the ith node and the training data amount of the fastest node are enablediSatisfy the requirement of
Figure BDA0001680825560000131
If the loss function is a strong convex function, then use
Figure 1
Assigning a weight to node i, where weightr,iThe weight allocated to the node i is r, and the convergence rate of the SGD algorithm under the condition of giving a training data set and a training function is r;
preferably, the number k of computing nodes in the selected cluster, the convergence speed r of the SGD algorithm given a training data set and a training function, and the difference T between the training data amount of node i and the training data amount of the fastest node are madeiSatisfy the requirement of
Figure BDA0001680825560000133
And 4, outputting a processing result of carrying out weighted average on the results obtained by each computing node.
After each node obtains its own calculation result through the iterative process, the results obtained by each node are subjected to a weighted average process to obtain a weighted average result, that is:
Figure BDA0001680825560000134
wherein, wi,tIs the calculation result, weight, obtained after the iterative processing of the node i by tiIs a weight set for the node i.
Through the steps 1-4, the training of machine learning can be efficiently carried out in the cluster environment with unbalanced load.
Mixed use with delayed SGD algorithm
The present invention may also use the modified version of the cartridge SGD of the previous embodiment in combination with a delayed SGD algorithm. The deferred SGD algorithm is a random gradient descent method implemented on a parameter server, which is two completely different algorithms from the cartridge SGD algorithm, and how the improved cartridge SGD algorithm according to the present invention is mixed with the algorithm will be described in detail below. Specifically, as shown in fig. 3, a plurality of worker nodes connected under the same server have performance close to each other, but only one worker node may be connected under one server, and each worker node is trained in a separate delay-type SGD method. And (3) after the server finishes the training of the data set, carrying out weighted average on the output results of each server, wherein the weight used by the weighted average is the measurement convergence rate r, and then using the weight given by the expression (4) to further obtain the final result. According to the theory of the delay type SGD, the convergence rate can be maximized only when worker nodes with similar efficiency serve one server, so that it is preferable in the present invention to have the same or similar performance among worker nodes belonging to the same server.
Aiming at the scheme with the participation of the parameter server, under the condition of unbalanced load, the convergence upper limit of the target function is consistent with the convergence upper limit of the strong convex function in the front. Thus, the results output for the different servers in FIG. 3 may use weights consistent with the strongly convex function for the calculation of the weighted average.
Effect testing
As described above, the present invention is intended to provide a scheme for efficiently performing training of a machine model in a cluster environment with unbalanced load by using computers having different computing capacities together, so that when performing training in parallel in a cluster environment with unbalanced load, a node with a high processing speed does not need to wait for a node with a low processing speed to complete an amount of computation equivalent to that processed by the node. To verify whether the present invention can achieve the above object, the inventors performed the following effect test on the scheme of the present invention, and compared the effect of using the conventional WP-SGD algorithm in the cluster environment with unbalanced load and using the conventional Simul parallell SGD algorithm in the cluster environment with balanced load. The specific test is as follows.
The experimental environment is as follows: using a cluster of 10 nodes, each node hosts Xeon (R) CPU E5-2660v2@2.20 GHz. And, one processor is piggybacked on each node.
Experimental configuration:
in this experiment, the regularization coefficient λ is set to 0.01, the step length η is set to 0.0001, and the estimated convergence rate r is set to 0.99999. Since the final iteration result is very close to 0, we set the initial value of each feature of the machine learning model (i.e. each dimension of the model vector) to 4.0 in order to get more iteration steps.
The load imbalance environment is constructed through software simulation, and the method is specifically configured as follows: 10 nodes, 8 nodes have the same operation speed and are called normal nodes, the other 2 nodes are slow nodes, and the operation speed of the 2 nodes is 1/5 of the 8 normal nodes. Based on the above load imbalance environment, the present experiment tested the scheme according to the present invention, as well as the direct averaging scheme as shown in fig. 1.
Meanwhile, as a comparison, the invention also constructs a load balancing environment, and the specific configuration is as follows: and 10 nodes, wherein the performance of all the nodes is the same as that of the common nodes in the load imbalance environment. Based on the load balancing environment, the experiment tests the traditional Simul Parallel SGD algorithm.
The present experiment used the hinge loss function for training the support vector machine as the objective function.
A data source: as experimental data a KDD cup 2010(algebra) database was used containing 8407752 samples with 20216830 features per sample, with an average of 20-40 non-zero features per sample.
The evaluation method comprises the following steps: to show how the hinge loss function values vary with iteration, each node in this experiment trained a model individually using a portion of the data. In the training process, from the perspective of common calculation, every 100000 times of training, all systems stop training, and after the model is stored in a disk and evaluated, training is continued.
As a result: table 1 shows the case where the objective function of the Simul parallell SGD, the method according to the invention and the direct averaging method, varies as the iteration progresses. As can be seen from table 1, the loss function values of the Simul parallell SGD in the load balancing environment and the machine learning model obtained by the method according to the present invention in the load imbalance environment are almost identical at the same number of iterations. It should be noted that, in an environment with unbalanced load, the nodes with high computation speed and the nodes with low computation speed in the present invention experience different numbers of iterations when outputting the results obtained by iteration, and the nodes with high computation speed may relatively perform more iterations. In order to unify the comparison criteria for the method in the load balancing environment, the number of iterations in the load imbalance environment is subject to the number of iterations experienced by the node with the highest calculation speed. Considering that the method is implemented in the load imbalance environment, a node with a high processing speed does not need to wait for a node with a low processing speed to complete an equivalent task, and under the condition, the method can achieve an effect similar to that of a Simul Parallel SGD in the load balancing environment, and if the method is implemented in the load balancing environment, the method can achieve a better effect. Compared with a direct averaging method, the method has higher mathematical convergence speed and is closer to a Simul Parallel SGD algorithm of load balancing, for example, the loss function value Simul Parallel SGD of 500000 times and the loss function value of the method are approximately the same (both are about 185), and the direct averaging method needs to iterate to 700000 times to obtain the following target function value.
The above phenomenon is consistent with the theoretical analysis result.
Figure BDA0001680825560000151
Figure BDA0001680825560000161
Table 1: and solving the objective function values of the model parameters by using different parallel SGD algorithms and different optimization methods.
Therefore, the method is particularly suitable for training machine learning in a cluster environment with unbalanced load, and particularly can achieve the effect similar to that of implementing the Simul Parallel SGD algorithm in a load balancing environment when the machine learning is iterated for the same number of times in the cluster environment with unbalanced load (for the invention, the iteration number is based on the iteration number executed by the fastest node). Based on the scheme, a user can use hardware devices which are purchased for multiple times and have different computing capabilities together as a computing cluster for executing machine learning in parallel, hardware with stronger computing capabilities does not need to be purchased again or new hardware with the same or similar computing capabilities as the purchased hardware needs to be searched, hardware cost is greatly saved, and the method is particularly suitable for applications which need to use a large number of hardware devices to process data in a parallel mode.
It should be noted that, all the steps described in the above embodiments are not necessary, and those skilled in the art may make appropriate substitutions, replacements, modifications, and the like according to actual needs.
Also, the computing node described in the present invention corresponds to a process, and may schedule resources such as a CPU, a GPU, a MIC, or any corresponding computing resource, and combinations thereof.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (9)

1. A training method for machine learning based on a stochastic gradient descent method in a clustered environment, comprising:
1) according to the number of computing nodes in the cluster environment, dividing data in a training set into a plurality of parts for each computing node to execute training operation in parallel;
2) training the portion of the assigned data set with each compute node in the cluster environment, such that each compute node trains a machine learning model in parallel;
3) and carrying out weighted average on the processing results of the calculation nodes, wherein the weight of each calculation node is set to enable the attenuation speeds of the variance and/or the mean value of the calculation nodes in the cluster environment to be consistent or close to each other, the weight is designed and calculated to relate to the difference between the data volume which can be processed by the calculation node in unit time and the data volume which can be processed by the calculation node with the highest calculation speed in the cluster environment in unit time, the regular coefficient of an objective function, the step size in algorithm setting and the convergence speed of an SGD algorithm in the corresponding objective function.
2. The method of claim 1, wherein if the loss function used for training for machine learning belongs to a non-strongly convex function, the weight of the computing node i is set as:
Figure FDA0002881730030000011
wherein, lambda is a regular coefficient, eta is a step length, and the number range of each calculation node is 1iThe data volume that the computing node i can process in unit time and the computing node with the highest computing speed in the cluster environment are in unit timeThe difference between the amount of data that can be processed in between.
3. The method of claim 2, wherein for the loss function being a non-strongly convex function, the number k of compute nodes, the regular coefficient λ, the step length η, and the difference T between the amount of data per unit time that a compute node i can process and the amount of data per unit time that a compute node in the clustered environment that computes most quickly can processiSelected to satisfy the expression:
Figure FDA0002881730030000021
4. the method of claim 1, wherein if the loss function used for training for machine learning belongs to a strong convex function, the weight of the computing node i is set as:
Figure FDA0002881730030000022
wherein r is the convergence rate of the SGD algorithm under the condition of giving a training data set and a training function, and the number range of each computing node is 1iThe difference value is the difference value between the data amount which can be processed by the computing node i in unit time and the data amount which can be processed by the computing node with the highest computing speed in the cluster environment in unit time.
5. Method according to claim 4, wherein for the loss function being a strongly convex function, the number of compute nodes k, the convergence speed r of the SGD algorithm given a training data set and a training function, and the difference T between the amount of data that can be processed per unit time by compute node i and the amount of data that can be processed per unit time by the compute node with the fastest computation speed in the clustered environmentiSelected to satisfy the expression:
Figure FDA0002881730030000023
6. the method of claim 1, wherein the weight of each of the compute nodes is set to include:
if the loss function used for training of machine learning belongs to a non-strong convex function, setting the weight of the computing node i as:
Figure FDA0002881730030000031
wherein, lambda is a regular coefficient, eta is a step length, and the number range of each calculation node is 1iThe difference value is the difference value between the data volume which can be processed by the computing node i in unit time and the data volume which can be processed by the computing node with the highest computing speed in the cluster environment in unit time;
if the loss function used for training of machine learning belongs to a strong convex function, setting the weight of the computing node i as:
Figure FDA0002881730030000032
wherein r is the convergence rate of the SGD algorithm under the condition of giving a training data set and a training function, and the number range of each computing node is 1iThe difference value is the difference value between the data volume which can be processed by the computing node i in unit time and the data volume which can be processed by the computing node with the highest computing speed in the cluster environment in unit time;
and fitting the difference between the data quantity which can be processed by the computing node in unit time and the data quantity which can be processed by the computing node with the highest computing speed in the cluster environment in unit time by using an exponential function according to the convergence condition of multiple iterations in the training process.
7. The method of any of claims 1-6, wherein the compute node is a CPU, or a GPU, or a Mic, or a combination thereof; and the cluster environment includes compute nodes of the same and/or different computing capabilities.
8. A computer-readable storage medium, in which a computer program is stored which, when being executed, is adapted to carry out the method of any one of claims 1-7.
9. A training system for machine learning in a clustered environment, comprising:
a storage device and a processor;
wherein the storage means is for storing a computer program for implementing the method according to any of claims 1-7 when executed by the processor.
CN201810549619.6A 2018-05-31 2018-05-31 Training method and system for machine learning in cluster environment Active CN108829517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810549619.6A CN108829517B (en) 2018-05-31 2018-05-31 Training method and system for machine learning in cluster environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810549619.6A CN108829517B (en) 2018-05-31 2018-05-31 Training method and system for machine learning in cluster environment

Publications (2)

Publication Number Publication Date
CN108829517A CN108829517A (en) 2018-11-16
CN108829517B true CN108829517B (en) 2021-04-06

Family

ID=64147210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810549619.6A Active CN108829517B (en) 2018-05-31 2018-05-31 Training method and system for machine learning in cluster environment

Country Status (1)

Country Link
CN (1) CN108829517B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522129A (en) * 2018-11-23 2019-03-26 快云信息科技有限公司 A kind of resource method for dynamically balancing, device and relevant device
CN109522436A (en) * 2018-11-29 2019-03-26 厦门美图之家科技有限公司 Similar image lookup method and device
JP7230683B2 (en) * 2019-05-21 2023-03-01 富士通株式会社 Arithmetic processing device, program, and method of controlling arithmetic processing device
CN110705705B (en) * 2019-09-25 2022-04-22 浪潮电子信息产业股份有限公司 Convolutional neural network model synchronous training method, cluster and readable storage medium
CN110929884B (en) * 2019-11-22 2023-05-16 北京大学 Classification method and device for distributed machine learning optimization based on column division
CN113128696A (en) * 2019-12-31 2021-07-16 香港理工大学深圳研究院 Distributed machine learning communication optimization method and device, server and terminal equipment
CN111752713B (en) 2020-06-28 2022-08-05 浪潮电子信息产业股份有限公司 Method, device and equipment for balancing load of model parallel training task and storage medium
CN111782592A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Method, device and system for dividing data

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110320391A1 (en) * 2010-06-29 2011-12-29 Nec Laboratories America, Inc. Method and Apparatus for Predicting Application Performance Across Machines with Different Hardware Configurations
CN105843555A (en) * 2016-03-18 2016-08-10 南京邮电大学 Stochastic gradient descent based spectral hashing method in distributed storage
CN106250461A (en) * 2016-07-28 2016-12-21 北京北信源软件股份有限公司 A kind of algorithm utilizing gradient lifting decision tree to carry out data mining based on Spark framework
CN106339351A (en) * 2016-08-30 2017-01-18 浪潮(北京)电子信息产业有限公司 SGD (Stochastic Gradient Descent) algorithm optimization system and method
CN106779093A (en) * 2017-01-06 2017-05-31 中国科学院上海高等研究院 Distributed machines learning training method and its system based on sliding window sampling
CN107018184A (en) * 2017-03-28 2017-08-04 华中科技大学 Distributed deep neural network cluster packet synchronization optimization method and system
CN107153843A (en) * 2017-05-03 2017-09-12 西安电子科技大学 Surface subsidence forecasting system and method based on SVMs
CN108009642A (en) * 2016-10-31 2018-05-08 腾讯科技(深圳)有限公司 Distributed machines learning method and system
CN108090510A (en) * 2017-12-15 2018-05-29 北京大学 A kind of integrated learning approach and device based on interval optimization

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110320391A1 (en) * 2010-06-29 2011-12-29 Nec Laboratories America, Inc. Method and Apparatus for Predicting Application Performance Across Machines with Different Hardware Configurations
CN105843555A (en) * 2016-03-18 2016-08-10 南京邮电大学 Stochastic gradient descent based spectral hashing method in distributed storage
CN106250461A (en) * 2016-07-28 2016-12-21 北京北信源软件股份有限公司 A kind of algorithm utilizing gradient lifting decision tree to carry out data mining based on Spark framework
CN106339351A (en) * 2016-08-30 2017-01-18 浪潮(北京)电子信息产业有限公司 SGD (Stochastic Gradient Descent) algorithm optimization system and method
CN108009642A (en) * 2016-10-31 2018-05-08 腾讯科技(深圳)有限公司 Distributed machines learning method and system
CN106779093A (en) * 2017-01-06 2017-05-31 中国科学院上海高等研究院 Distributed machines learning training method and its system based on sliding window sampling
CN107018184A (en) * 2017-03-28 2017-08-04 华中科技大学 Distributed deep neural network cluster packet synchronization optimization method and system
CN107153843A (en) * 2017-05-03 2017-09-12 西安电子科技大学 Surface subsidence forecasting system and method based on SVMs
CN108090510A (en) * 2017-12-15 2018-05-29 北京大学 A kind of integrated learning approach and device based on interval optimization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于差异合并的分布式随机梯度下降算法";陈振宏 等;《计算机学报》;20151031;第38卷(第10期);正文第2054-2063页 *

Also Published As

Publication number Publication date
CN108829517A (en) 2018-11-16

Similar Documents

Publication Publication Date Title
CN108829517B (en) Training method and system for machine learning in cluster environment
Renggli et al. SparCML: High-performance sparse communication for machine learning
CN110399222B (en) GPU cluster deep learning task parallelization method and device and electronic equipment
US9032416B2 (en) Load balancing using progressive sampling based on load balancing quality targets
US8577816B2 (en) Optimized seeding of evolutionary algorithm based simulations
Esteves et al. Competitive k-means, a new accurate and distributed k-means algorithm for large datasets
Dehnen A fast multipole method for stellar dynamics
US8560472B2 (en) Systems and methods for supporting restricted search in high-dimensional spaces
US9477532B1 (en) Graph-data partitioning for workload-balanced distributed computation with cost estimation functions
Lastovetsky et al. Model-based optimization of EULAG kernel on Intel Xeon Phi through load imbalancing
Gast The power of two choices on graphs: the pair-approximation is accurate?
CN103559205A (en) Parallel feature selection method based on MapReduce
Heene et al. Load balancing for massively parallel computations with the sparse grid combination technique
CN110992432A (en) Depth neural network-based minimum variance gradient quantization compression and image processing method
Stiskalek et al. The scatter in the galaxy–halo connection: a machine learning analysis
CN113391894A (en) Optimization method of optimal hyper-task network based on RBP neural network
US10996976B2 (en) Systems and methods for scheduling neural networks by varying batch sizes
CN117311998B (en) Large model deployment method and system
Aljarah et al. A mapreduce based glowworm swarm optimization approach for multimodal functions
Aviv et al. Learning under delayed feedback: Implicitly adapting to gradient delays
Schmidt et al. Load-balanced parallel constraint-based causal structure learning on multi-core systems for high-dimensional data
Gu et al. Parallelizing machine learning optimization algorithms on distributed data-parallel platforms with parameter server
CN106874215B (en) Serialized storage optimization method based on Spark operator
Mackey et al. Parallel k-means++ for multiple shared-memory architectures
Zheng et al. A randomized heuristic for stochastic workflow scheduling on heterogeneous systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant