CN111444026A - Deep learning training resource allocation prediction method in cloud environment

Info

Abstract

Description

Claims

CN111444026A

Publication number: CN111444026A
Application number: CN202010313690.1A
Authority: CN
Inventors: 梁毅; 刘明洁; 丁毅; 丁振兴
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2020-07-24

The invention designs a resource parameter configuration method for deep learning training in a cloud environment, wherein the parameters comprise batch size parameters, resource supply quantity and iteration round times. The method comprises the following steps: collecting performance indexes of each round in the model training process; establishing a model for a mathematical relation among batch size parameters, resource supply quantity, iteration round times, training data scale, training time consumption and training precision by using an order-preserving regression method; constructing a constraint relation according to the training cost and precision requirement of the deep learning model in the cloud environment; searching for proper parameter configuration by using an optimal search algorithm; and finally, training the model according to the parameter configuration. Experiments prove that the method can effectively reduce the training time cost of the deep learning model, achieve the preset training precision and finally meet the training requirement.

Deep learning training resource allocation prediction method in cloud environment

Technical Field

The invention belongs to the technical field of cloud platform data processing, and particularly relates to a deep learning model training parameter setting and resource supply cooperative setting method based on a cloud platform.

Background

Deep learning is a branch of machine learning research, which uses nonlinear operations such as feature combination and feature discretization to continuously train data to acquire data features with higher abstraction degree by taking human brain structure as a reference to construct a multilayer neural network. Distributed deep learning is the integration of a series of algorithms and systems designed to enhance the model training performance and improve the model prediction accuracy. Cloud computing environments are currently the main deployment environment for deep learning computing platforms. Distributed training in a cloud environment can offset the increasing amount of data by increasing the number of computing nodes, and performance bottlenecks caused by collecting data to a single node for centralized processing are avoided. Wherein the trained deep learning neural network comprises the hierarchical structure characteristics of the traditional neural network. The difference is that the deep learning neural network has more network nodes and introduces more complex and efficient algorithms.

In general, training a deep learning neural network is a process sensitive to computing resource anomalies, and the influence of resource configuration on training efficiency is more obvious, which is a training resource. Meanwhile, in the cloud computing environment, the model training usually divides the training data into a plurality of small-batch data subsets containing a plurality of samples, and the model training operation carries out a plurality of rounds of iterative computations in parallel through a plurality of working nodes until the model parameters meet the requirements, namely batch size parameters. Experiments can prove that the change of the batch size parameters and the resource configuration has influence on the time consumption of the model training process and the prediction accuracy of the model.

The deep learning computing platform provides the settings of resource usage and batch size parameters to improve the execution efficiency of model training. The user can intuitively set manually using historical experience while performing model training, or find a suitable set of resource configurations by trial and error. However, trial and error methods are time consuming and laborious. Meanwhile, determining the parameter configuration for assigning a given job requires knowing the characteristics of the corresponding job, and it is difficult to estimate accurately without sufficient experience. With the continuous operation of model training operation, the disadvantage of static setting will be revealed, and the uncertainty thereof will not only bring inconvenience to the user, but also be favorable to the efficient utilization of computing resources. Improper setting of the parameters may result in the calculation not being completed as expected. Therefore, during deep learning model training, correct parameter setting is to find the best balance between computational efficiency and resource capacity.

Disclosure of Invention

The invention provides a performance prediction method for parameter configuration and resource supply cooperative allocation, which aims at deep learning model training operation in a cloud computing environment and after the execution process and the operation mechanism of model training are analyzed. According to the method, under the constraint conditions of prediction accuracy and running time, the resource usage amount, the batch size value and the iteration number which are required to be set by one-time specified training operation are predicted by combining with the computing resource environment where the operation is located. The model training efficiency is improved, and the use cost of computing resources is saved. The method comprises the steps of firstly analyzing the relation among batch size parameters, resource supply, iteration turns, training data set size, training time consumption and training precision, then constructing a performance model for the factors by using a multi-dimensional order-preserving regression method, and then searching parameter combinations by using a heuristic algorithm according to the constraint of minimizing time cost and maximizing training precision to obtain a parameter configuration scheme. By the method and the device, the computing resources in the cloud environment can be effectively utilized in the limited time of model training operation, training is efficiently executed, the model precision meeting the requirements is obtained, and the model training execution efficiency is improved.

The method of the invention comprises four steps: initialization, performance model modeling, heuristic search and model setting training.

The method is realized on a computer according to the following steps:

(1) initialization

1.1) combining the model training execution environment to construct a cloud environment resource node set DataResources (dr)_iI is more than or equal to 1 and less than or equal to M, wherein, for any resource node dr_i∈DataResources，dr_iUsing triplets (centers)_i，capacity_i，memory_i) To indicate center_iRepresenting the number, capacity, of a resource node_iDelegate resource node center_iMemory capacity of (2)_iDelegate resource node center_iM denotes the memory capacity ofThe total number of data centers;

1.2) combining the configuration information in the model training execution process into a data set TrainDataSet, wherein the TrainDataSet is { td_jJ is less than or equal to 1 and less than or equal to N, wherein for any data td_j∈TrainDataSet，td_jUsing six-membered groups (batch size)_j，res_j，time_j，round_j，dataseize_j，acc_j) To represent the batch size_jRepresentative data td_jThe size of the batch size parameter used, res_jRepresentative data td_jNumber of resource nodes used, time_jRepresentative data td_jEach round of training takes time, round_jRepresentative data td_jUsing the iteration round number, datasize_jRepresentative data td_jCorresponding training data set size, acc_jRepresentative data td_jThe training accuracy of the training task is finally obtained, and N represents the total number of data.

(2) Performance model modeling

2.1) construction of a data matrix X from training data information on which model training has been performed_vIn the form of

In matrix X_vEach row represents a group of sample data, and each group of data comprises batch size parameters, resource node numbers, training data set sizes, total training rounds, time consumption for each training round and model precision obtained by training. Wherein the batch size parameter is x_batchkX is used for representing and resource node number_resourcekRepresenting, training data set size by x_dsizekRepresentative, Total number of training rounds by x_roundkRepresenting, time consuming per training run y_timekRepresentation and training precision y_acckAnd k is 1,2, …, n, n represents the total number of items for constructing the training data information.

2.2) constructing the relation between the time consumption of each round and the batch size parameter, the resource supply and the training data size, and the mathematical function relation is as the following formula

f(batchsize,res,datasize)＝t_round

In the above equation, batch size represents a batch size parameter variable, res represents a resource supply amount given by the present model training, datasize represents a training data size, t_roundIndicating that each turn is time consuming. For the training of the mathematical function relation, a data matrix X is selected_vThe first, second, third and fifth columns of (1). Therefore, the correlation between the variables in the mathematical relation and the parameters in the data matrix is as follows: the blocksize in the functional formula corresponds to x in the data matrix_batchkDatasize in the functional formula corresponds to x in the data matrix_dsizekRes in the functional formula corresponds to x in the data matrix_resourcekT in a functional relation_roundCorresponding to y in the data matrix_timek。

After the data items are established, the calculation of the order-preserving regression can be performed. Because the order-preserving regression algorithm is multidimensional order-preserving regression, a dimension reduction method is adopted, namely order-preserving regression between a certain one-dimensional independent variable and a dependent variable is considered. And by analogy, finishing order-preserving regression of independent variables of all dimensions. And finally, combining the independent variable order-preserving regression of each dimension, and finally completing the resolving of the multidimensional order-preserving regression.

Set matrix lambda_vTo select a data matrix X_vThe first, second, third and fifth columns. Let A_vRepresenting matrix Λ_vThe corresponding vector matrix, i.e. A_v＝(x_batchk,x_resourcek,x_dsizek,y_timek). According to the mathematical functional relation, the matrix A is solved as follows_vIn (x)_batchk,x_resourcek,x_dsizek) And y_timekAnd (3) preserving the sequence and regressing.

Selecting matrix A_vVector x in (1)_batchAnd y_timekWherein x is_batchThe included elements are matrix X_vIn (x)_batch1,x_batch2,…,x_batch)^T. Setting the relation of the obtained order-preserving regression functionHas a weight sequence of w_batch＝(w_u)^TAnd u has the same value as k.

Traverse x_batchkIf l ∈ l is present, x_batchl>x_batchl+1Let B be { l, l +1}, then

w_B＝w_l+w_l+1

This time game

Respectively representing a value sequence and a weight sequence obtained by traversing and calculating for a certain time in the ordering regression process. If calculated to a certain traversal so that

Is order-preserving, i.e.: x is the number of_batch1≤…≤x_batchl-1≤x_B≤x_batchl+1≤…≤x_batchnThen the calculation is stopped. If not, continuing to traverse until obtaining an order-preserving regression sequence

And corresponding weights

And assigns a weight sequence to w_batch。

Selecting matrix A_vVector x in (1)_resourcekAnd y_timekWherein x is_resourcekThe included elements are matrix X_vIn (x)_resource1,x_resource2,…,x_resourcen)^T. Setting the weight sequence of the obtained order-preserving regression function relationship as w_resource＝(w_s)^TAnd s has the same value as k.

Traverse x_resourcekIf l ∈ k is present, x_resourcel>x_resourcel+1Let B be { l, l +1}, then

w_B＝w_l+w_l+1

This time game

Is order-preserving, i.e.: x is the number of_resource1≤…≤x_resourcel-1≤x_B≤x_resourcel+1≤…≤x_resourcenThen the calculation is stopped. If not, continuing to traverse until obtaining an order-preserving regression sequence

And corresponding weights

And assigns a weight sequence to w_resource。

Selecting matrix A_vVector x in (1)_dsizekAnd y_timekWherein x is_dsizekThe included elements are matrix X_vIn (x)_dsize1,x_dsize2,…,x_dsizen)^T. Setting the weight sequence of the obtained order-preserving regression function relationship as w_dsize＝(w_h)^TAnd the value of h is the same as k.

Traverse x_dsizekIf l ∈ k is present, x_dsizel>x_dsizel+1Let B be { l, l +1}, then

w_B＝w_l+w_l+1

This time game

Is order-preserving, i.e.: x is the number of_dsize1≤…≤x_dsizel-1≤x_B≤x_dsizel+1≤…≤x_dsizenThen the calculation is stopped. If not, continuing to traverse until obtaining an order-preserving regression sequence

And corresponding weights

And assigns a weight sequence to w_dsize。

For x_batchk,x_resourcek,x_dsizekCalculating the derived weight w_batch,w_resource,w_dsize. The finally constructed functional relation is x_batchk*w_batch+x_resourcek*w_resource+x_dsizek*w_dsize＝t_round

2.3) constructing the relation among the batch size parameter, the iteration round times and the training precision, wherein the mathematical function relation is as the following formula

f(batchsize,round)＝acc

In the above equation, batch size represents a batch size parameter variable, round represents the total number of model training rounds, and acc represents the model training accuracy. For the training of the mathematical relationship, a data matrix X is selected_vThe first, fourth and sixth columns of (1). Therefore, the correlation between the variables in the mathematical relation and the parameters in the data matrix is as follows: the blocksize in the functional formula corresponds to x in the data matrix_batchkRound in the function corresponds to x in the data matrix_roundkAcc in the functional formula corresponds to x in the data matrix_acck。

The mathematical function relation is constructed in the same way as in step 2.2). Let x_batchk,x_roundkThe corresponding order-preserving regression weights are w_batch,w_round. The finally constructed functional relation is x_batchk*w_batch+x_roundk*w_round＝acc

(3) Optimized parameter configuration search

3.1) the problem of selecting an optimized parameter configuration and resource allocation scheme can be formalized as follows:

min(r_vm)

min(T_limit)

max(acc_predict)

in the above relationship:r_vmrepresenting the number of resource nodes, T_limitRepresenting the total time of operation of the model training operation, acc_predictIndicates the final training accuracy, C_batchNumber of packets divided for data, t_roundIt is time consuming to run a certain stage of the training job for the model. Wherein, min (r)_vm) Representing minimizing resource usage; min (T)_limit) Represents a minimum training time; max (acc)_predict) Representing the training precision of the maximization model; c_batch％r_vmWhen the data set is divided according to the batch size numerical value, the grouping number of the data division should be in integral multiple relation with the number of the working nodes;

the total running time of each round of the model training operation is ensured to be less than the total time of the model training operation.

3.2) forming a time consumption list of each training round under different batch sizes and resource configuration combinations and a training precision list which can be achieved under different batch sizes and iteration rounds through the calculation of the step 2), and respectively recording the time consumption list and the training precision list as tables₁And table₂。

And 3.3) combining the two result tables to search according to the minimum running time and the optimal model training precision target, wherein the minimum running time is the number of iteration rounds multiplied by the time consumption of each round of training.

3.4) two steps according to the search method,

3.4.1) in the List Table₂Searching a batch size numerical value and iteration round times corresponding to the optimal reachable model precision;

3.4.2) in the list table according to lot size values₁Searching corresponding iteration time consumption and working node resource configuration number of each round;

3.4.3) correspondingly finding multiple records, multiplying the iteration time consumption of each round by the iteration round number of the last step of search to obtain the total training time consumption, and correspondingly recording the record with the minimum total training time consumption to form a list to be confirmed for subsequent use; if the requirements are not met, the data are directly discarded.

3.5) finally, determining the final parameter configuration combination in the list to be confirmed, wherein the situation that a plurality of records or no records exist in the list occurs:

3.5.1) if there is no record, then it needs to be re-listed in the table₂Searching for a suboptimal value so as to continue searching;

3.5.2) if the number of the records is multiple, finding a group of configurations with the least resources of the working nodes in the total training time consumption according to the limiting condition with the minimum total training time consumption, and executing model training according to the configuration results.

(4) Performing model training

4.1) forming a parameter configuration list according to the solution space search result in the step 3);

4.2) setting model training configuration according to the parameter configuration table

(5) And (4) ending: and waiting for the completion of the model training operation.

Drawings

FIG. 1 is a general flow diagram of the process of the present invention.

FIG. 2(a) compares batch size parameters, resource allocation coordination, and achievable accuracy under manual configuration.

FIG. 2(b) compares the batch size parameter and resource allocation coordination adjustment with the number of iteration rounds under manual configuration.

Detailed Description

The invention is described below with reference to the accompanying drawings and the detailed description.

The specific implementation method can be divided into the following steps:

(1) initialization

1.1) constructing a cloud environment resource node set DataResources, wherein the DataResources is { dc₁＝(center1,204800MB,2048MB),dc₂＝(center2,204800MB,2048MB),dc₃(center3,204800mb,2048MB) }, and the total number M of resource nodes is 3;

1.2) configuration information set TrainDataSet in the process of executing model building training, wherein the TrainDataSet is { td ═₁＝(16,1,0,0,0,0),td₂＝(16,2,0,0,0,0),td₃＝(16,3,0,0,0,0),td₄＝(32,1,0,0,0,0),td₅＝(32,2,0,0,0,0),td₆＝(32,3,0,0,0,0),td₇＝(128,1,0,0,0,0),td₈＝(128,2,0,0,0,0),td₉＝(128,3,0,0,0,0),td₁₀＝(256,1,0,0,0,0),td₁₁＝(256,2,0,0,0,0),td₁₂＝(256,3,0,0,0,0),td₁₃＝(512,1,0,0,0,0),td₁₄＝(512,2,0,0,0,0),td₁₅(512,3,0,0,0,0) }, and the total number of data N is 15;

(2) performance model modeling

2.1) executing observation experiment, collecting data, filling data items of a TrainDataSet set and Tasks set and constructing a data matrix, wherein the form is as follows

2.2) construction of f (batch size, datasize, res) ═ t_roundA mathematical relationship. Set matrix lambda_vTo select a data matrix X_vThe first, second, third and fifth columns. Let A_vRepresenting matrix Λ_vThe corresponding vector matrix, i.e. A_v＝(x_batchk,x_resourcek,x_dsizek,y_timek). Solving the matrix A according to the mathematical function relation_vIn (x)_batchk,x_resourcek,x_dsizek) And y_timekAnd (3) preserving the sequence and regressing.

Is obtained for x_batchk,x_resourcek,x_dsizekCalculating the corresponding weight w_batch,w_resource,w_dsize. The finally constructed functional relation is x_batchk*w_batch+x_resourcek*w_resource+x_dsizek*w_dsize＝t_round

2.3) construction of f (batch size, r)_ound) And (5) the acc mathematical relation. For the training of the mathematical relationship, a data matrix X is selected_vThe first, fourth and sixth columns of (1). Therefore, the correlation between the variables in the mathematical relation and the parameters in the data matrix is as follows: the blocksize in the functional formula corresponds to x in the data matrix_batchkFunctional formulaRound in (3) corresponds to x in the data matrix_roundkAcc in the functional formula corresponds to x in the data matrix_acck. The mathematical function relation is constructed in the same way as in step 2.2). Calculate to get x_batchk,x_roundkThe corresponding order-preserving regression weights are w_batch,w_round. The finally constructed functional relation is x_batchk*w_batch+x_roundk*w_round＝acc

(3) Optimized resource configuration search

3.1) for a data set to be trained, different batch size parameters, resource supply quantity and iteration round times are respectively substituted into the performance model established in the step 2) to respectively form lists, and the lists are shown in the table 1 and the table 2

TABLE 1 batch size parameters, resource supply, training data set, and elapsed time for each round of training

Number of batch size	Amount of resource supply	Training data set size	Time consumed by each round of training
				16	1	1GB	17727s
16	2	1GB	5668s
				16	3	1GB	3521s
…	…	…	…
				512	3	1GB	338s

TABLE 2 batch size parameters, iteration round times and training accuracy

And 3.2) combining the two result tables to search according to the minimum running time and the optimal model training precision target, wherein the minimum running time is the number of iteration rounds multiplied by the time consumption of each round of training. Two steps according to the searching method

3.2.1) in the List Table₂Searching a batch size numerical value and iteration round times corresponding to the optimal reachable model precision;

3.2.2) listing table according to batch size value₁Searching corresponding iteration time consumption and working node resource configuration number of each round;

3.2.3) correspondingly searching a plurality of records, multiplying the iteration time consumption of each round by the iteration round number of the last step of search to obtain the total training time consumption, and if the total time consumption meets the requirement of limited time, recording the records to form a list to be confirmed for subsequent use; if the requirements are not met, the data are directly discarded.

3.2.4) finally, determining the final parameter configuration combination in the list to be confirmed, namely selecting 64 batch sizes, selecting 2 nodes for resource supply, and setting the iteration turn to be 38 turns.

(4) Performing model training

According to the method provided by the invention, the inventor conducts relevant tests on the effectiveness of the method. The evaluation indexes of the test are mainly two:

(1) average number of completed training jobs: representing the ratio of the number of completed model training jobs to the total number of model training jobs in a set period of time. The calculation is as follows:

wherein N is the number of model training operations completed within a specified time, and N is the total number of preset model training operations within the period of time.

(2) Average resource usage: and the ratio of the sum of the resource usage amount used by each model training operation to the number of completed model training operations in a set period of time is shown. The calculation is as follows:

wherein r is_iThe resource usage amount of a completed model training operation is n, and the number of completed model training operations in a set period of time is n.

The platform used in the test is tested by forming 4 working nodes by 4 physical machines, and the configuration of the experimental environment is shown in table 3. Wherein 1 node is used as a parameter server, and 3 nodes are used as computing nodes.

TABLE 3 test Environment configuration

The tested data set is an SVHN data set which is obtained by collecting and sorting house number image sets in Google street scenes, is a real world image data set used for developing an object detection algorithm and comprises more than 600000 digital images with labels. The test mainly compares the model training operation under the parameter configuration obtained by the performance model calculation of the invention with the model training operation result under the condition of manually setting the parameter configuration under the unused performance model. The results are shown in FIG. 2.

As can be seen from the graphical representation, for different convolutional neural networks, the results of model training using the parameters obtained by the performance model are improved compared with those obtained by manual setting. For example, when the performance model is not used and only the training parameters are manually set for model training, the optimal prediction precision which can be achieved is 0.967; and after the corresponding parameter configuration is obtained by using the performance model calculation, the model training is executed, and the model prediction precision of 0.973 is obtained, which is higher than the result obtained by manually setting the parameter training. For the performance model building target with the minimized running time and the optimal model training precision, after the model training is carried out by using the parameters selected by the performance model, the number of iteration turns is increased and the prediction precision is increased by 0.4%.

The above description is only an embodiment of the present invention, but the protection scope of the present invention is not limited thereto, and the present invention is not limited to the technology described in the present invention, and all technical solutions and modifications thereof without departing from the spirit and scope of the present invention should be covered in the claims of the present invention.

Number of batch size

Iteration round

Training accuracy

1. A deep learning training resource allocation prediction method under a cloud environment is characterized by comprising the following four steps: initializing, constructing a performance model, searching in a heuristic mode, and executing model training;

the method is realized on a computer according to the following steps:

(1) initialization

1.1) combining the model training execution environment to construct a cloud environment resource node set DataResources (dr)_iI is more than or equal to 1 and less than or equal to M, wherein, for any resource node dr_i∈DataResources，dr_iUsing triplets (centers)_i，capacity_i，memory_i) To indicate center_iRepresenting the number, capacity, of a resource node_iDelegate resource node center_iMemory capacity of (2)_iDelegate resource node center_iM represents the total number of data centers;

1.2) combining the configuration information in the model training execution process into a data set TrainDataSet, wherein the TrainDataSet is { td_jJ is less than or equal to 1 and less than or equal to N, wherein for any data td_j∈TrainDataSet，td_jUsing six-membered groups (batch size)_j，res_j，time_j，round_j，dataseize_j，acc_j) To represent the batch size_jRepresentative data td_jThe size of the batch size parameter used, res_jRepresentative data td_jNumber of resource nodes used, time_jRepresentative data td_jEach round of training takes time, round_jRepresentative data td_jUsing the iteration round number, datasize_jRepresentative data td_jCorresponding training data set size, acc_jRepresentative data td_jThe training precision finally obtained by the training task is obtained, and N represents the total number of data;

(2) performance model modeling

In matrix X_vEach row represents a group of sample data, and each group of data comprises batch size parameters, resource node numbers, training data set sizes, total training rounds, time consumption for each training round and model precision obtained by training, wherein the batch size parameters, the resource node numbers, the training data set sizes and the total training rounds are correspondingly used in the training of a certain model; wherein the batch size parameter is x_batchkX is used for representing and resource node number_resourcekRepresenting, training data set size by x_dsizekRepresentative, Total number of training rounds by x_roundkRepresenting, time consuming per training run y_timekRepresentation and training precision y_acckRepresenting that k is 1,2, …, n, n represents the total number of items for constructing the training data information;

f(batchsize,res,datasize)＝t_round

In the above equation, batch size represents a batch size parameter variable, and res represents a resource given by the current model trainingGiven the volume, datasize denotes the training data size, t_roundIndicating the time consumed in each round; for the training of the mathematical function relation, a data matrix X is selected_vThe first, second, third and fifth columns of (1); therefore, the correlation between the variables in the mathematical relation and the parameters in the data matrix is as follows: the blocksize in the functional formula corresponds to x in the data matrix_batchkDatasize in the functional formula corresponds to x in the data matrix_dsizekRes in the functional formula corresponds to x in the data matrix_resourcekT in a functional relation_roundCorresponding to y in the data matrix_timek；

After the data items are established, the calculation of the order-preserving regression can be carried out; because the order-preserving regression algorithm is multidimensional order-preserving regression, a dimension reduction method is adopted, namely order-preserving regression between a certain one-dimensional independent variable and a dependent variable is considered; in this way, order-preserving regression of independent variables of all dimensions is completed; finally, combining independent variable order-preserving regression of each dimension, and finally completing resolving of the multidimensional order-preserving regression;

set matrix lambda_vTo select a data matrix X_vA data matrix formed after the first, second, third and fifth columns; let A_vRepresenting matrix Λ_vThe corresponding vector matrix, i.e. A_v＝(x_batchk，x_resourcek，x_dsizek，y_timek) (ii) a According to the mathematical functional relation, the matrix A is solved as follows_vIn (x)_batchk，x_resourcek，x_dsizek) And y_timekThe order-preserving regression between;

selecting matrix A_vVector x in (1)_batchkAnd y_timekWherein x is_batchkThe included elements are matrix X_vIn (x)_batch1，x_batch2，...，x_batchn)^T(ii) a Setting the weight sequence of the obtained order-preserving regression function relation as W ═ W_u)^TThe value of u is the same as k;

traverse x_batchkIf l ∈ k is present, x_batchl＞x_batchl+1Let B be { l, l +1}, then

w_B＝w_l+w_l+1

This time game

Respectively representing a value sequence and a weight sequence obtained by traversing and calculating for a certain time in the sequential regression process; if calculated to a certain traversal so that

Is order-preserving, i.e.: x is the number of_batch1≤…≤x_batchl-1≤x_B≤x_batchl+1≤…≤x_batchnStopping the calculation; if not, continuing to traverse until obtaining an order-preserving regression sequence

And corresponding weights

In sequence to x_resourcekAnd x_dsizekRespectively carrying out order-preserving regression calculation to form corresponding weight sequences; let x_batchk，x_resourcek，x_dsizekThe corresponding weights are w_batch，w_resource，w_dsize(ii) a The finally constructed functional relation is

x_batchk*w_batch+x_resourcek*w_resource+x_dsizek*w_dsize＝t_round

f(batchsize，round)＝acc

In the formula, batch size represents a batch size parameter variable, round represents the total number of times of model training, and acc represents the model training precision; for the training of the mathematical relationship, a data matrix X is selected_vThe first, fourth and sixth columns of (1); therefore, the correlation between the variables in the mathematical relation and the parameters in the data matrix is as follows: the blocksize in the functional formula corresponds to x in the data matrix_batchkRound in the function corresponds to x in the data matrix_roundkAcc in the functional formula corresponds to x in the data matrix_acck；

The construction method of the mathematical function relation is the same as that in the step 2.2); let x_batchk，x_roundkThe corresponding order-preserving regression weights are w_batch，w_round(ii) a The finally constructed functional relation is

x_batchk*w_batch+x_roundk*w_round＝acc

(3) Optimized parameter configuration search

min(r_vm)

min(T_limit)

max(acc_predict)

subject to：

in the above relationship: r is_vmRepresenting the number of resource nodes, T_limitRepresenting the total time of operation of the model training operation, acc_predictIndicates the final training accuracy, C_batchNumber of packets divided for data, t_roundRunning time is consumed for a certain stage of model training operation; wherein，min(r_vm) Representing minimizing resource usage; min (T)_limit) Represents a minimum training time; max (acc)_predict) Representing the training precision of the maximization model; c_batch％r_vmWhen the data set is divided according to the batch size numerical value, the grouping number of the data division should be in integral multiple relation with the number of the working nodes;

representing that the total running time of each round of model training operation is less than the total time of the model training operation;

3.2) forming a time consumption list of each training round under different batch sizes and resource configuration combinations and a training precision list which can be achieved under different batch sizes and iteration rounds through the calculation of the step 2), and respectively recording the time consumption list and the training precision list as tables₁And table₂；

3.3) combining the two result tables to search according to the minimum running time and the optimal model training precision target, wherein the minimum running time is the time consumed by each round of training multiplied by the number of iteration rounds;

3.4) two steps according to the search method,

3.4.3) correspondingly finding multiple records, multiplying the iteration time consumption of each round by the iteration round number of the last step of search to obtain the total training time consumption, and correspondingly recording the record with the minimum total training time consumption to form a list to be confirmed for subsequent use; if the requirement is not met, directly discarding;

3.5.1) if there is no record, then it needs to be re-listed in the table₂In order to find a sub-optimal valueContinuing searching;

3.5.2) if the record is a plurality of records, finding a group of configuration with the least resources of the working nodes in the total training time consumption according to the minimum limiting condition of the total training time consumption, and executing model training according to the configuration result;

(4) performing model training

4.1) forming a parameter configuration list according to the solution space search result in the step (3);