CN108829517B

CN108829517B - Training method and system for machine learning in cluster environment

Info

Publication number: CN108829517B
Application number: CN201810549619.6A
Authority: CN
Inventors: 程大宁; 李士刚; 张云泉
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2021-04-06
Anticipated expiration: 2038-05-31
Also published as: CN108829517A

Abstract

The invention provides a training method for machine learning in a cluster environment, which comprises the following steps: 1) according to the number of computing nodes in the cluster environment, dividing data in a training set into a plurality of parts for each computing node to execute training operation in parallel; 2) training the portion of the assigned data set with each compute node in the cluster environment, such that each compute node trains a machine learning model in parallel; 3) weighted averaging of the processing results of the individual computing nodes, the weight of each of the computing nodes being set such that the decay rates for the variances and/or means of the individual nodes in the clustered environment are consistent or close to each other.

Description

Training method and system for machine learning in cluster environment

Technical Field

The present invention relates to training for machine learning in a cluster environment, and more particularly, to a method and system for training machine learning in a cluster environment with unbalanced load.

Background

With the development of machine learning and artificial intelligence techniques, the amount of computation required to complete tasks of application scenarios has increased dramatically, and in order to increase the computation speed, a large number of hardware devices are required to process data in a parallel manner. When a hardware device is actually used to implement a training process of machine learning, in a cluster environment, computation is generally performed in parallel by using a large number of computing nodes included in a cluster, and parameters or computation results of a machine model corresponding to a training target are obtained by iteratively performing computation.

Many users will purchase hardware products based on the amount of computation required to perform the tasks and the computing power of the hardware devices. Since the hardware product is updated very fast, it is very common that a user purchases different batches of CPUs and GPUs with different computing capabilities for a relatively long period of time. It is generally considered in the art that it is not suitable to perform the same computing task in parallel by using different batches of hardware with different computing capabilities as computing nodes belonging to the same cluster. This is because, on the one hand, in the parallel processing, the time consumed for computation depends on the computing node with the slowest processing speed, and if hardware with different computing capabilities is used as a computing node in a mixed manner, the computing advantage of the hardware with higher computing capabilities cannot be enjoyed. On the other hand, the adoption of different hardware in one cluster is not convenient for unified management. Based on the consideration of the two factors, the field generally adopts only hardware with the same or similar processing capacity as a cluster to perform machine learning iterative computation by each node in the cluster in parallel.

However, for a user who separately purchases hardware equipment with different computing power for a plurality of times, it is impossible to perform computing using hardware of different batches, which is very uneconomical in view of use cost. When the overall computing power needs to be improved, hardware with stronger computing power needs to be purchased again or new hardware with the same or similar computing power as the purchased hardware needs to be searched.

Disclosure of Invention

Accordingly, it is an object of the present invention to overcome the above-mentioned drawbacks of the prior art, and to provide a training method for machine learning in a clustered environment, comprising:

1) according to the number of computing nodes in the cluster environment, dividing data in a training set into a plurality of parts for each computing node to execute training operation in parallel;

2) training the portion of the assigned data set with each compute node in the cluster environment, such that each compute node trains a machine learning model in parallel;

3) weighted averaging of the processing results of the respective computing nodes, the weight of each of the computing nodes being set such that the degree of attenuation of the variance and/or mean for the respective nodes in the cluster environment is consistent or close to each other.

Preferably, according to the method, the weight of each of the computing nodes is set to be an index depending on a difference between an amount of data that the computing node can process per unit time and an amount of data that the computing node with the fastest computing speed can process per unit time in the cluster environment.

Preferably, according to the method, if the loss function used for training of machine learning belongs to a non-strongly convex function, the weight of the computing node i is set as:

wherein, lambda is a regular coefficient, eta is a step length, and the number range of each calculation node is 1_iThe difference value is the difference value between the data amount which can be processed by the computing node i in unit time and the data amount which can be processed by the computing node with the highest computing speed in the cluster environment in unit time.

Preferably, according to the method, wherein the loss function is a non-strongly convex function, the number k of computing nodes, the regular coefficient λ, the step length η, and the difference T between the amount of data that can be processed by the computing node i in a unit time and the amount of data that can be processed by the computing node with the fastest computation speed in the cluster environment in a unit time are calculated_iSelected to satisfy the expression:

preferably, according to the method, if the loss function used for training of machine learning belongs to a strong convex function, the weight of the computing node i is set as:

wherein r is the convergence rate of the SGD algorithm under the condition of giving a training data set and a training function, and the number range of each computing node is 1_iThe data volume that the computing node i can process in unit time and the cluster environmentAnd the difference between the data amounts which can be processed in the unit time by the computing node with the fastest computing speed.

Preferably, according to the method, wherein the loss function is a strong convex function, the number k of computing nodes, the convergence rate r of the SGD algorithm given a training data set and a training function, and the difference T between the amount of data that can be processed per unit time by computing node i and the amount of data that can be processed per unit time by the computing node with the fastest computing speed in the cluster environment_iSelected to satisfy the expression:

preferably, according to the method, the weight of each of the computation nodes is set to be, if the loss function used for training for machine learning belongs to a strong convex function:

and fitting the difference between the data quantity which can be processed by the computing node in unit time and the data quantity which can be processed by the computing node with the highest computing speed in the cluster environment in unit time by using an exponential function according to the convergence condition of multiple iterations in the training process.

Preferably, according to the method, wherein the compute node is a CPU, or a GPU, or a MIC, or a combination thereof; and the cluster environment includes compute nodes of the same and/or different computing capabilities.

And a computer-readable storage medium in which a computer program is stored, the computer program, when executed, being for implementing the method of any one of the above.

And, a training system for machine learning in a clustered environment, comprising:

a storage device and a processor;

wherein the storage means is adapted to store a computer program which, when executed by the processor, is adapted to carry out the method of any of the above.

Compared with the prior art, the invention has the advantages that:

the method is particularly suitable for training machine learning in a cluster environment with unbalanced load, and particularly can achieve the effect similar to that of implementing the Simul Parallel SGD algorithm in a load balancing environment when the iteration is performed for the same times in the load unbalanced environment. Based on the scheme, a user can use hardware devices which are purchased for multiple times and have different computing capabilities together as a computing cluster for executing machine learning in parallel, hardware with stronger computing capabilities does not need to be purchased again or new hardware with the same or similar computing capabilities as the purchased hardware needs to be searched, hardware cost is greatly saved, and the method is particularly suitable for applications which need to use a large number of hardware devices to process data in a parallel mode.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

fig. 1 is a schematic diagram of a conventional cartridge SGD algorithm.

Fig. 2 is a schematic diagram of a modified cartridge SGD algorithm according to an embodiment of the present invention.

Fig. 3 is a scheme for using worker nodes mixed with a parameter server in a cluster according to an embodiment of the present invention.

FIG. 4 is a flow diagram of a training method for machine learning in a clustered environment in accordance with one embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

As introduced in the background, when training of machine learning is performed in a parallel computing manner in a cluster environment, it is necessary to obtain a desired result by iteratively iterating the computation repeatedly. For example, a training objective is defined as the value of a parameter to be determined at which the loss function (also referred to as an objective function) represented by the parameter is obtained to minimize the value of the parameter. Therefore, in order to converge the value of the loss function to a very small range as quickly as possible by iterative computation when performing training of machine learning, the most commonly used solution model is the Stochastic Gradient Descent (SGD).

Meanwhile, in consideration of the fact that the training process of machine learning needs to carry out repeated iterative computation on data of a large number of training sets, the computation amount is huge, so that a plurality of computers are often adopted to train the same machine model by adopting the same loss function in parallel, and the computation results of all the computers are summarized to be used as the finally output machine model, so that the training process is accelerated. Such an environment containing many computers is referred to as a cluster environment, and each computer used in parallel is referred to as a compute node.

Conventional parallel SGD algorithms can be classified into two broad categories, delayed SGD and barrel SGD. The delay-type SGD algorithm updates itself by the current model using the gradient of the past model (i.e., using the delay gradient), wherein the iteration number difference between the past model and the current model cannot exceed a specific value. Theoretical analysis shows that the convergence rate of the delay-type SGD algorithm is reduced along with the increase of delay, and the reduction of the convergence rate is faster than the acceleration benefit brought by the parallelism. In order to ensure that the delay does not become large, the communication overhead is inevitable, and therefore, for a delay-type SGD algorithm system, a tradeoff needs to be made between the parallelism (i.e., the number of computing nodes in the cluster), the delay, and the system efficiency.

In actual industrial production, the engineering implementation of the delayed SGD algorithm is a "parameter server". The calculation nodes of the parameter server are divided into a server node and a plurality of worker nodes, the worker nodes mainly work by pulling the current model from the server node, calculating the random gradient corresponding to the model and delivering the gradient to the server node. And after receiving the delayed gradient, the server updates the model by using the delayed gradient.

Another type of cylindrical SGD is to use the gain in variance to accelerate the SGD algorithm, which averages the training models of different nodes to generate the final model. Fig. 1 shows a schematic diagram of the cartridge SGD algorithm described above. In the model shown in fig. 1, each node uses the data of the training set in parallel to train the same machine model, the capacity of each node to process the data is equivalent, after a period of training, each node performs an equal amount of computation, and the output of each node is averaged to obtain the final training result. This algorithm has little communication overhead, however, the high performance of the tube-type SGD algorithm is not always effective, and its parallel performance effectiveness depends on the variance of the training data set.

The two SGD algorithms have advantages and disadvantages, but they are not designed for machine learning in a cluster environment with unbalanced load, and are not suitable for solving the problem described in the background art that the user cannot be divided into multiple purchases of hardware devices with different computing capabilities for use. For the delay type SGD algorithm system, the delay is increased due to the cluster environment with unbalanced load, and the convergence upper limit is increased twice with the increase of the delay theoretically, thereby resulting in the overall efficiency of the system being low. In contrast, for the conventional tube-type SGD, when the cluster load is not uniform, the slowest node becomes a bottleneck of performance, that is, all nodes need to complete an equal amount of tasks before continuing to the next stage of tasks, which means that the node with the fast processing speed needs to wait for the node with the slow processing speed to complete the same amount of tasks, which significantly reduces the computational efficiency in the training process. As stated in the background, in the actual use process, a user may purchase hardware devices with different computing capabilities several times, and if the hardware with different computing capabilities is expected to be commonly used for machine learning, the cluster formed by such hardware belongs to the cluster environment with unbalanced load. In the face of such circumstances, certain improvements to the SGD algorithm are needed.

In contrast, the traditional cylindrical SGD algorithm is improved, and the training time of the parallel SGD algorithm on the machine model in the cluster environment with unbalanced load is remarkably shortened under the condition that the convergence rate is not changed by adopting a weighted parallel mode. Fig. 2 shows a schematic diagram of the cartridge SGD algorithm improved by the present invention. As shown in fig. 2, in which the computing capacities of the nodes 1, 2, and 3 are different from each other, when the three nodes are allocated with data sets of equal or unequal numbers, and when a certain time point is reached after a period of time elapses, the computing amounts actually performed by the nodes 1, 2, and 3 are different from each other, the results of the respective nodes may be allocated with corresponding weights according to the difference to perform weighted average and output the computing results.

The inventors explored how to set such weights. After studying the convergence characteristics of the SGD-based algorithm, the inventors found that the convergence of the SGD-based algorithm mainly depends on the following three aspects: gradually converging to the variance of the motionless point model, gradually converging to the mean of the motionless point model, and gradually converging to the difference between the motionless point model and the optimal model. The difference between the stationary point model and the optimal model is determined by the iteration step length, is the inherent property of the iteration step, and the difference of the part is determined before calculation, so that the part cannot be influenced by load imbalance. In contrast, both the variance and the mean are exponentially reduced, and in a parallel stochastic gradient descent algorithm, the associated coefficients should be exponentially varied. Therefore, the inventors considered that the weights set for the respective nodes of the cluster can be determined according to the reduction of the variance and mean of the objective function employed for training.

The inventor finds the difference between the mean value of the machine learning model under the current iteration times and the mean value of the final convergence model through reasoning and verification

Sum variance

As the iterations are exponentially reduced, the upper attenuation limit for each iteration is 1- λ η, expressed as:

where λ is the regular coefficient, η is the step size, G is the maximum lipschitz norm of the loss function on the training set, μ is the mean, σ is the variance, t is the number of iterations,

under the condition that the step length is eta, after the t step iteration, the probability density distribution of the machine learning model is obtained,

is the probability density distribution after the machine learning model converges when the step length is η.

Therefore, the inventors considered that the converged distance can be directly corrected (i.e., weighted average), and the weight of the node that has not yet decayed to the final value can be adjusted.

Also, as described in the foregoing, the inventors consider that the results of the respective nodes may be assigned respective weights for weighted averaging according to the difference between the calculation amounts actually performed by the respective nodes in the cluster from each other. The invention therefore proposes that the node with the fastest execution speed, i.e. the node capable of handling the most amount of data per unit of time, which is referred to herein as the "fastest node", can be used as a reference. According to the difference value T between the data amount which can be processed by one node i in unit time and the data amount which can be processed by the fastest node in the cluster in unit time_iTo set the weight for node i. In the training process, the variance and the mean of the model after iteration are reduced according to the scale of the index, so that the inventor proposes that the weight of the node i should be in accordance with T_iAs a function of the exponent of (a).

The process of adjusting the weight of each node may be viewed as the degree of attenuation for the variance and/or mean of each node in the clustered environment being consistent or close to each other. In this way, the gains of the individual nodes in variance are fully utilized. Although the invention is not limited theretoThe specific examples given use T_iAs a parameter of the weight of the node i, however, it should be understood that other parameters may be used to determine the corresponding weight of the node as long as the set weight can sufficiently utilize the gain of each node in variance, and those skilled in the art can use the prior art to find the weight satisfying the above condition according to the teaching.

The upper limit 1- λ η of the above-mentioned attenuation rate is directed to a convex function in a general sense. Although the loss functions are all convex functions, the attenuation rate of the strong convex function is severely reduced, so according to one embodiment of the present invention, the corresponding weight required to be set for each node can be calculated according to the attenuation rate of the strong convex function.

Thus, in some embodiments of the invention, the classification of the strong convexity of the objective function may be based on to provide approximate weights for the respective compute nodes. The inventors have conducted studies on attenuation characteristics when an objective function belongs to a strongly convex function and a non-strongly convex function, respectively, to determine how approximate weights should be provided for respective calculation nodes when the objective function is a strongly convex function, and how approximate weights should be provided for respective calculation nodes when the objective function is a non-strongly convex function, as follows.

The strong convex function and the non-strong convex function of the present invention are defined herein as:

in the convex function, if a positive number sigma exists, the function f satisfies f (x + y) ≧ f (x) + y^Tf'(x)+σ||y||²(2), f is called as a sigma-strong convex function, otherwise, the function is a non-strong convex function; experiments have shown that when σ is very small (e.g., less than 10)^-3) It can also be processed according to a non-strongly convex function.

Non-strongly convex function

If the objective function belongs to a non-strong convex function (for example, a hinge loss function used by a training support vector machine), under the condition of unbalanced cluster environment load, weighted averaging may be performed on the corresponding models according to exponentially decaying weights, and the models generated by different load nodes are combined by using the following weights.

Wherein, weight_1-λη，iThe node number range is 1_iIs the difference between the training data amount of node i and the training data amount of the fastest node. The regular coefficient lambda and the step length eta are values which are determined before training, do not change in the training process of the size, and T_iThe value of (a) can be obtained after the training of a single node is completed.

The theoretical upper convergence limit of the non-strongly convex function is as follows:

where c is the objective function, w is the model parameter, D is the probability distribution after the model convergence, | | f | Y_lipF, is the lipschitz norm, v is the system output result, G is the maximum lipschitz norm of the loss function on the training set, and t is the iteration number.

The above theoretical results show that as the iteration progresses, the final result produced by the weighted summation shown in fig. 2 is better than the result produced by the fastest node when the following expression is satisfied, i.e., the target loss function value corresponding to the final result is less than the target loss function value corresponding to the result produced by the fastest node.

In other words, if it is desired to perform training for machine learning with clusters having unbalanced loads and it is desired that the result of average output is better than the result of output from the node having the highest computational performance, the number k of computational nodes in the cluster, the regularization coefficient λ, the step length η, and the training data amount of the i-th node may be selectedThe difference T between the training data amount and the fastest node_iSo that it satisfies expression (3) to achieve the above object.

Strong convex function

For a strong convex function (e.g., a logistic loss function used by a logistic regression model, etc.), it has a faster convergence rate than a non-strong convex function, in this case, the variance and the mean move faster to a stationary point, i.e., a convergence point, and the convergence rate of each iteration is different as the parallel vertical relationship between the data sample and the current training model in the training process is different. Therefore, there is a need to overcome the above problems, for which the present invention proposes the following means:

in the first step, an exponential function may be used to fit the true convergence rate in actual calculations according to the actual calculation effect. Under the real condition, each iteration is a convergence process, the model is sampled at intervals for each iteration, the loss function value of the machine learning model under the current iteration progress is calculated, an exponential function is used, and y is r^x+ b, fitting the true convergence rate.

In the second step, weights may be assigned to the respective calculation nodes using expression (4) according to the true convergence rate. Theoretically, the more convex the strong convex function, the poorer the capacity for accommodating the less loaded node. By combining the gains on different node variances, the variances converging to different degrees all contribute to the gains, and a better result can be obtained. In compliance with the requirements of expression (6), the overall system output is better than that of the fastest node, and expression (6) also illustrates that the cluster can accommodate a sufficient number of slow nodes.

The weights set for the models produced by the various nodes with different (or the same) loads are:

wherein weight_r，iWeight assigned to node i, r is SGDThe convergence speed of the method given the training data set and the training function, b is a constant.

The theoretical upper convergence limit of the strongly convex function is:

the above equation shows that when the following expression (6) is satisfied, the final result generated by using the weighted summation shown in fig. 2 is better than the result generated by the fastest node, that is, the target loss function value corresponding to the output calculation result for one node is smaller than the target loss function value corresponding to the result generated by the node with the fastest calculation speed, and the smaller the loss function value is, the better the model effect is, and the smaller the loss function value at the same time is, the faster the convergence speed can be regarded as.

Similarly, if training for machine learning is desired with clusters that are not load balanced, and the average output is desired to be better than the output of the highest computational performance node, the number of computational nodes in the cluster k, the convergence speed r of the SGD algorithm given a training data set and a training function, and the difference T between the training data amount of node i and the fastest node can be selected_iSo that it satisfies expression (6) to achieve the above object.

Training for machine learning

In conjunction with the above analysis for non-strongly convex and strongly convex functions, the inventors summarize the scheme of the invention as the following pseudo-code.

The pseudo code of the 1 st line randomly divides data in a training set into a plurality of parts according to the number of nodes in a cluster, so that each node can execute training operation in parallel;

lines 2-11 pseudo code represent training the data portion to which each node in the cluster is assigned, such that each node trains a machine learning model in parallel;

lines 12-13 represent the difference T between the amount of data consumed by node i at the present time and the amount of data consumed by the fastest node (i.e., the amount of data used to train the model at the present time)_iWeights are assigned as follows: if the target function is a non-strong convex function and a large number of samples are perpendicular to the model in the iteration process, distributing the weight of each node by using the mode shown in the formula (1), and if the target function is a strong convex function, distributing the weight of each node by using the mode shown in the formula (4);

the pseudo code of line 14 represents a machine learning model which obtains a target by adopting a weighted average mode according to the weight of each node. In this way, a machine learning model which is close to the output of the load-balanced Simul Parallel SGD algorithm can be generated in a cluster environment with unbalanced load (namely, each node has different training data volume).

In the above embodiments, specific expressions for their respective attenuation rates are given for the case where the loss functions belong to the strong convex function and the non-strong convex function, respectively, to further determine the weights set for the respective nodes, and therefore in other embodiments of the present invention, it is necessary to experimentally determine the attenuation rates of the loss functions. Specifically, when the attenuation rate of the loss function is measured, the loss function value of the model may be calculated once every certain number of iterations (for example, every 10000 times) during training of each node model, and the attenuation rate of the loss function may be fitted by the obtained loss function value after the training is completed. Then, according to the calculation formula

A weight for node i is determined where r is the decay rate to which the loss function is fitted.

The training method for machine learning in a clustered environment according to the present invention will be described below by way of a specific embodiment. Referring to fig. 4 in conjunction with the cluster shown in fig. 2, according to one embodiment of the invention, the method includes:

step 1, distributing the data volume to be processed for each computing node in the cluster environment.

In the present invention, it is not limited whether the amounts of data allocated to the respective computing nodes need to be equal to each other. Corresponding to line 1 in the aforementioned pseudo-code, the separation of the data of the training set may be random, e.g., the amount of data allocated for each compute node may be different from one another. This is because the data amount and the number of iterations processed by a node determine the amount of computation performed by the node, and whether the amount of computation performed by each computation node at the time of weighted averaging is equal does not affect the present invention, so the data amounts allocated to each computation node by the present invention are not necessarily equal to each other.

And 2, processing the distributed data volume in parallel by each computing node in the cluster environment.

In this step, each compute node performs a corresponding computation based on its penalty function and the assigned data (e.g., data of the training set). The specific calculation process here depends on the specific application to be trained, and the data distributed to it is processed in parallel by the nodes.

And 3, carrying out weighted average processing on the results obtained by each computing node. Wherein the weight for each compute node is obtained by: determining a weight for each compute node according to whether the loss function is a non-strongly convex function or a strongly convex function and a difference between the amount of data consumed by the compute node and the amount of data consumed by the fastest compute node:

if loss functionFor non-strongly convex functions, use

And allocating weight to the ith node, wherein lambda is a regular coefficient, eta is a step length, and the node numbering range is 1_iIs the difference between the training data volume of the ith node and the training data volume of the fastest node;

preferably, the number k of computing nodes in the selected cluster, the regular coefficient lambda, the step length eta, and the difference T between the training data amount of the ith node and the training data amount of the fastest node are enabled_iSatisfy the requirement of

If the loss function is a strong convex function, then use

Assigning a weight to node i, where weight_r，iThe weight allocated to the node i is r, and the convergence rate of the SGD algorithm under the condition of giving a training data set and a training function is r;

preferably, the number k of computing nodes in the selected cluster, the convergence speed r of the SGD algorithm given a training data set and a training function, and the difference T between the training data amount of node i and the training data amount of the fastest node are made_iSatisfy the requirement of

And 4, outputting a processing result of carrying out weighted average on the results obtained by each computing node.

After each node obtains its own calculation result through the iterative process, the results obtained by each node are subjected to a weighted average process to obtain a weighted average result, that is:

wherein, w_i,tIs the calculation result, weight, obtained after the iterative processing of the node i by t_iIs a weight set for the node i.

Through the steps 1-4, the training of machine learning can be efficiently carried out in the cluster environment with unbalanced load.

Mixed use with delayed SGD algorithm

The present invention may also use the modified version of the cartridge SGD of the previous embodiment in combination with a delayed SGD algorithm. The deferred SGD algorithm is a random gradient descent method implemented on a parameter server, which is two completely different algorithms from the cartridge SGD algorithm, and how the improved cartridge SGD algorithm according to the present invention is mixed with the algorithm will be described in detail below. Specifically, as shown in fig. 3, a plurality of worker nodes connected under the same server have performance close to each other, but only one worker node may be connected under one server, and each worker node is trained in a separate delay-type SGD method. And (3) after the server finishes the training of the data set, carrying out weighted average on the output results of each server, wherein the weight used by the weighted average is the measurement convergence rate r, and then using the weight given by the expression (4) to further obtain the final result. According to the theory of the delay type SGD, the convergence rate can be maximized only when worker nodes with similar efficiency serve one server, so that it is preferable in the present invention to have the same or similar performance among worker nodes belonging to the same server.

Aiming at the scheme with the participation of the parameter server, under the condition of unbalanced load, the convergence upper limit of the target function is consistent with the convergence upper limit of the strong convex function in the front. Thus, the results output for the different servers in FIG. 3 may use weights consistent with the strongly convex function for the calculation of the weighted average.

Effect testing

As described above, the present invention is intended to provide a scheme for efficiently performing training of a machine model in a cluster environment with unbalanced load by using computers having different computing capacities together, so that when performing training in parallel in a cluster environment with unbalanced load, a node with a high processing speed does not need to wait for a node with a low processing speed to complete an amount of computation equivalent to that processed by the node. To verify whether the present invention can achieve the above object, the inventors performed the following effect test on the scheme of the present invention, and compared the effect of using the conventional WP-SGD algorithm in the cluster environment with unbalanced load and using the conventional Simul parallell SGD algorithm in the cluster environment with balanced load. The specific test is as follows.

The experimental environment is as follows: using a cluster of 10 nodes, each node hosts Xeon (R) CPU E5-2660v2@2.20 GHz. And, one processor is piggybacked on each node.

Experimental configuration:

in this experiment, the regularization coefficient λ is set to 0.01, the step length η is set to 0.0001, and the estimated convergence rate r is set to 0.99999. Since the final iteration result is very close to 0, we set the initial value of each feature of the machine learning model (i.e. each dimension of the model vector) to 4.0 in order to get more iteration steps.

The load imbalance environment is constructed through software simulation, and the method is specifically configured as follows: 10 nodes, 8 nodes have the same operation speed and are called normal nodes, the other 2 nodes are slow nodes, and the operation speed of the 2 nodes is 1/5 of the 8 normal nodes. Based on the above load imbalance environment, the present experiment tested the scheme according to the present invention, as well as the direct averaging scheme as shown in fig. 1.

Meanwhile, as a comparison, the invention also constructs a load balancing environment, and the specific configuration is as follows: and 10 nodes, wherein the performance of all the nodes is the same as that of the common nodes in the load imbalance environment. Based on the load balancing environment, the experiment tests the traditional Simul Parallel SGD algorithm.

The present experiment used the hinge loss function for training the support vector machine as the objective function.

A data source: as experimental data a KDD cup 2010(algebra) database was used containing 8407752 samples with 20216830 features per sample, with an average of 20-40 non-zero features per sample.

The evaluation method comprises the following steps: to show how the hinge loss function values vary with iteration, each node in this experiment trained a model individually using a portion of the data. In the training process, from the perspective of common calculation, every 100000 times of training, all systems stop training, and after the model is stored in a disk and evaluated, training is continued.

As a result: table 1 shows the case where the objective function of the Simul parallell SGD, the method according to the invention and the direct averaging method, varies as the iteration progresses. As can be seen from table 1, the loss function values of the Simul parallell SGD in the load balancing environment and the machine learning model obtained by the method according to the present invention in the load imbalance environment are almost identical at the same number of iterations. It should be noted that, in an environment with unbalanced load, the nodes with high computation speed and the nodes with low computation speed in the present invention experience different numbers of iterations when outputting the results obtained by iteration, and the nodes with high computation speed may relatively perform more iterations. In order to unify the comparison criteria for the method in the load balancing environment, the number of iterations in the load imbalance environment is subject to the number of iterations experienced by the node with the highest calculation speed. Considering that the method is implemented in the load imbalance environment, a node with a high processing speed does not need to wait for a node with a low processing speed to complete an equivalent task, and under the condition, the method can achieve an effect similar to that of a Simul Parallel SGD in the load balancing environment, and if the method is implemented in the load balancing environment, the method can achieve a better effect. Compared with a direct averaging method, the method has higher mathematical convergence speed and is closer to a Simul Parallel SGD algorithm of load balancing, for example, the loss function value Simul Parallel SGD of 500000 times and the loss function value of the method are approximately the same (both are about 185), and the direct averaging method needs to iterate to 700000 times to obtain the following target function value.

The above phenomenon is consistent with the theoretical analysis result.

Table 1: and solving the objective function values of the model parameters by using different parallel SGD algorithms and different optimization methods.

Therefore, the method is particularly suitable for training machine learning in a cluster environment with unbalanced load, and particularly can achieve the effect similar to that of implementing the Simul Parallel SGD algorithm in a load balancing environment when the machine learning is iterated for the same number of times in the cluster environment with unbalanced load (for the invention, the iteration number is based on the iteration number executed by the fastest node). Based on the scheme, a user can use hardware devices which are purchased for multiple times and have different computing capabilities together as a computing cluster for executing machine learning in parallel, hardware with stronger computing capabilities does not need to be purchased again or new hardware with the same or similar computing capabilities as the purchased hardware needs to be searched, hardware cost is greatly saved, and the method is particularly suitable for applications which need to use a large number of hardware devices to process data in a parallel mode.

It should be noted that, all the steps described in the above embodiments are not necessary, and those skilled in the art may make appropriate substitutions, replacements, modifications, and the like according to actual needs.

Also, the computing node described in the present invention corresponds to a process, and may schedule resources such as a CPU, a GPU, a MIC, or any corresponding computing resource, and combinations thereof.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A training method for machine learning based on a stochastic gradient descent method in a clustered environment, comprising:

3) and carrying out weighted average on the processing results of the calculation nodes, wherein the weight of each calculation node is set to enable the attenuation speeds of the variance and/or the mean value of the calculation nodes in the cluster environment to be consistent or close to each other, the weight is designed and calculated to relate to the difference between the data volume which can be processed by the calculation node in unit time and the data volume which can be processed by the calculation node with the highest calculation speed in the cluster environment in unit time, the regular coefficient of an objective function, the step size in algorithm setting and the convergence speed of an SGD algorithm in the corresponding objective function.

2. The method of claim 1, wherein if the loss function used for training for machine learning belongs to a non-strongly convex function, the weight of the computing node i is set as:

wherein, lambda is a regular coefficient, eta is a step length, and the number range of each calculation node is 1_iThe data volume that the computing node i can process in unit time and the computing node with the highest computing speed in the cluster environment are in unit timeThe difference between the amount of data that can be processed in between.

3. The method of claim 2, wherein for the loss function being a non-strongly convex function, the number k of compute nodes, the regular coefficient λ, the step length η, and the difference T between the amount of data per unit time that a compute node i can process and the amount of data per unit time that a compute node in the clustered environment that computes most quickly can process_iSelected to satisfy the expression:

4. the method of claim 1, wherein if the loss function used for training for machine learning belongs to a strong convex function, the weight of the computing node i is set as:

wherein r is the convergence rate of the SGD algorithm under the condition of giving a training data set and a training function, and the number range of each computing node is 1_iThe difference value is the difference value between the data amount which can be processed by the computing node i in unit time and the data amount which can be processed by the computing node with the highest computing speed in the cluster environment in unit time.

5. Method according to claim 4, wherein for the loss function being a strongly convex function, the number of compute nodes k, the convergence speed r of the SGD algorithm given a training data set and a training function, and the difference T between the amount of data that can be processed per unit time by compute node i and the amount of data that can be processed per unit time by the compute node with the fastest computation speed in the clustered environment_iSelected to satisfy the expression:

6. the method of claim 1, wherein the weight of each of the compute nodes is set to include:

if the loss function used for training of machine learning belongs to a non-strong convex function, setting the weight of the computing node i as:

wherein, lambda is a regular coefficient, eta is a step length, and the number range of each calculation node is 1_iThe difference value is the difference value between the data volume which can be processed by the computing node i in unit time and the data volume which can be processed by the computing node with the highest computing speed in the cluster environment in unit time;

if the loss function used for training of machine learning belongs to a strong convex function, setting the weight of the computing node i as:

wherein r is the convergence rate of the SGD algorithm under the condition of giving a training data set and a training function, and the number range of each computing node is 1_iThe difference value is the difference value between the data volume which can be processed by the computing node i in unit time and the data volume which can be processed by the computing node with the highest computing speed in the cluster environment in unit time;

7. The method of any of claims 1-6, wherein the compute node is a CPU, or a GPU, or a Mic, or a combination thereof; and the cluster environment includes compute nodes of the same and/or different computing capabilities.

8. A computer-readable storage medium, in which a computer program is stored which, when being executed, is adapted to carry out the method of any one of claims 1-7.

9. A training system for machine learning in a clustered environment, comprising:

a storage device and a processor;

wherein the storage means is for storing a computer program for implementing the method according to any of claims 1-7 when executed by the processor.