CN111444026A - Deep learning training resource allocation prediction method in cloud environment - Google Patents

Deep learning training resource allocation prediction method in cloud environment Download PDF

Info

Publication number
CN111444026A
CN111444026A CN202010313690.1A CN202010313690A CN111444026A CN 111444026 A CN111444026 A CN 111444026A CN 202010313690 A CN202010313690 A CN 202010313690A CN 111444026 A CN111444026 A CN 111444026A
Authority
CN
China
Prior art keywords
training
round
data
model
resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010313690.1A
Other languages
Chinese (zh)
Inventor
梁毅
刘明洁
丁毅
丁振兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202010313690.1A priority Critical patent/CN111444026A/en
Publication of CN111444026A publication Critical patent/CN111444026A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention designs a resource parameter configuration method for deep learning training in a cloud environment, wherein the parameters comprise batch size parameters, resource supply quantity and iteration round times. The method comprises the following steps: collecting performance indexes of each round in the model training process; establishing a model for a mathematical relation among batch size parameters, resource supply quantity, iteration round times, training data scale, training time consumption and training precision by using an order-preserving regression method; constructing a constraint relation according to the training cost and precision requirement of the deep learning model in the cloud environment; searching for proper parameter configuration by using an optimal search algorithm; and finally, training the model according to the parameter configuration. Experiments prove that the method can effectively reduce the training time cost of the deep learning model, achieve the preset training precision and finally meet the training requirement.

Description

Deep learning training resource allocation prediction method in cloud environment
Technical Field
The invention belongs to the technical field of cloud platform data processing, and particularly relates to a deep learning model training parameter setting and resource supply cooperative setting method based on a cloud platform.
Background
Deep learning is a branch of machine learning research, which uses nonlinear operations such as feature combination and feature discretization to continuously train data to acquire data features with higher abstraction degree by taking human brain structure as a reference to construct a multilayer neural network. Distributed deep learning is the integration of a series of algorithms and systems designed to enhance the model training performance and improve the model prediction accuracy. Cloud computing environments are currently the main deployment environment for deep learning computing platforms. Distributed training in a cloud environment can offset the increasing amount of data by increasing the number of computing nodes, and performance bottlenecks caused by collecting data to a single node for centralized processing are avoided. Wherein the trained deep learning neural network comprises the hierarchical structure characteristics of the traditional neural network. The difference is that the deep learning neural network has more network nodes and introduces more complex and efficient algorithms.
In general, training a deep learning neural network is a process sensitive to computing resource anomalies, and the influence of resource configuration on training efficiency is more obvious, which is a training resource. Meanwhile, in the cloud computing environment, the model training usually divides the training data into a plurality of small-batch data subsets containing a plurality of samples, and the model training operation carries out a plurality of rounds of iterative computations in parallel through a plurality of working nodes until the model parameters meet the requirements, namely batch size parameters. Experiments can prove that the change of the batch size parameters and the resource configuration has influence on the time consumption of the model training process and the prediction accuracy of the model.
The deep learning computing platform provides the settings of resource usage and batch size parameters to improve the execution efficiency of model training. The user can intuitively set manually using historical experience while performing model training, or find a suitable set of resource configurations by trial and error. However, trial and error methods are time consuming and laborious. Meanwhile, determining the parameter configuration for assigning a given job requires knowing the characteristics of the corresponding job, and it is difficult to estimate accurately without sufficient experience. With the continuous operation of model training operation, the disadvantage of static setting will be revealed, and the uncertainty thereof will not only bring inconvenience to the user, but also be favorable to the efficient utilization of computing resources. Improper setting of the parameters may result in the calculation not being completed as expected. Therefore, during deep learning model training, correct parameter setting is to find the best balance between computational efficiency and resource capacity.
Disclosure of Invention
The invention provides a performance prediction method for parameter configuration and resource supply cooperative allocation, which aims at deep learning model training operation in a cloud computing environment and after the execution process and the operation mechanism of model training are analyzed. According to the method, under the constraint conditions of prediction accuracy and running time, the resource usage amount, the batch size value and the iteration number which are required to be set by one-time specified training operation are predicted by combining with the computing resource environment where the operation is located. The model training efficiency is improved, and the use cost of computing resources is saved. The method comprises the steps of firstly analyzing the relation among batch size parameters, resource supply, iteration turns, training data set size, training time consumption and training precision, then constructing a performance model for the factors by using a multi-dimensional order-preserving regression method, and then searching parameter combinations by using a heuristic algorithm according to the constraint of minimizing time cost and maximizing training precision to obtain a parameter configuration scheme. By the method and the device, the computing resources in the cloud environment can be effectively utilized in the limited time of model training operation, training is efficiently executed, the model precision meeting the requirements is obtained, and the model training execution efficiency is improved.
The method of the invention comprises four steps: initialization, performance model modeling, heuristic search and model setting training.
The method is realized on a computer according to the following steps:
(1) initialization
1.1) combining the model training execution environment to construct a cloud environment resource node set DataResources (dr)iI is more than or equal to 1 and less than or equal to M, wherein, for any resource node dri∈DataResources,driUsing triplets (centers)i,capacityi,memoryi) To indicate centeriRepresenting the number, capacity, of a resource nodeiDelegate resource node centeriMemory capacity of (2)iDelegate resource node centeriM denotes the memory capacity ofThe total number of data centers;
1.2) combining the configuration information in the model training execution process into a data set TrainDataSet, wherein the TrainDataSet is { tdjJ is less than or equal to 1 and less than or equal to N, wherein for any data tdj∈TrainDataSet,tdjUsing six-membered groups (batch size)j,resj,timej,roundj,dataseizej,accj) To represent the batch sizejRepresentative data tdjThe size of the batch size parameter used, resjRepresentative data tdjNumber of resource nodes used, timejRepresentative data tdjEach round of training takes time, roundjRepresentative data tdjUsing the iteration round number, datasizejRepresentative data tdjCorresponding training data set size, accjRepresentative data tdjThe training accuracy of the training task is finally obtained, and N represents the total number of data.
(2) Performance model modeling
2.1) construction of a data matrix X from training data information on which model training has been performedvIn the form of
Figure BDA0002458644290000021
In matrix XvEach row represents a group of sample data, and each group of data comprises batch size parameters, resource node numbers, training data set sizes, total training rounds, time consumption for each training round and model precision obtained by training. Wherein the batch size parameter is xbatchkX is used for representing and resource node numberresourcekRepresenting, training data set size by xdsizekRepresentative, Total number of training rounds by xroundkRepresenting, time consuming per training run ytimekRepresentation and training precision yacckAnd k is 1,2, …, n, n represents the total number of items for constructing the training data information.
2.2) constructing the relation between the time consumption of each round and the batch size parameter, the resource supply and the training data size, and the mathematical function relation is as the following formula
f(batchsize,res,datasize)=tround
In the above equation, batch size represents a batch size parameter variable, res represents a resource supply amount given by the present model training, datasize represents a training data size, troundIndicating that each turn is time consuming. For the training of the mathematical function relation, a data matrix X is selectedvThe first, second, third and fifth columns of (1). Therefore, the correlation between the variables in the mathematical relation and the parameters in the data matrix is as follows: the blocksize in the functional formula corresponds to x in the data matrixbatchkDatasize in the functional formula corresponds to x in the data matrixdsizekRes in the functional formula corresponds to x in the data matrixresourcekT in a functional relationroundCorresponding to y in the data matrixtimek
After the data items are established, the calculation of the order-preserving regression can be performed. Because the order-preserving regression algorithm is multidimensional order-preserving regression, a dimension reduction method is adopted, namely order-preserving regression between a certain one-dimensional independent variable and a dependent variable is considered. And by analogy, finishing order-preserving regression of independent variables of all dimensions. And finally, combining the independent variable order-preserving regression of each dimension, and finally completing the resolving of the multidimensional order-preserving regression.
Set matrix lambdavTo select a data matrix XvThe first, second, third and fifth columns. Let AvRepresenting matrix ΛvThe corresponding vector matrix, i.e. Av=(xbatchk,xresourcek,xdsizek,ytimek). According to the mathematical functional relation, the matrix A is solved as followsvIn (x)batchk,xresourcek,xdsizek) And ytimekAnd (3) preserving the sequence and regressing.
Selecting matrix AvVector x in (1)batchAnd ytimekWherein x isbatchThe included elements are matrix XvIn (x)batch1,xbatch2,…,xbatch)T. Setting the relation of the obtained order-preserving regression functionHas a weight sequence of wbatch=(wu)TAnd u has the same value as k.
Traverse xbatchkIf l ∈ l is present, xbatchl>xbatchl+1Let B be { l, l +1}, then
Figure BDA0002458644290000031
wB=wl+wl+1
This time game
Figure BDA0002458644290000032
Figure BDA0002458644290000033
Figure BDA0002458644290000034
Respectively representing a value sequence and a weight sequence obtained by traversing and calculating for a certain time in the ordering regression process. If calculated to a certain traversal so that
Figure BDA0002458644290000035
Is order-preserving, i.e.: x is the number ofbatch1≤…≤xbatchl-1≤xB≤xbatchl+1≤…≤xbatchnThen the calculation is stopped. If not, continuing to traverse until obtaining an order-preserving regression sequence
Figure BDA0002458644290000036
And corresponding weights
Figure BDA0002458644290000037
And assigns a weight sequence to wbatch
Selecting matrix AvVector x in (1)resourcekAnd ytimekWherein x isresourcekThe included elements are matrix XvIn (x)resource1,xresource2,…,xresourcen)T. Setting the weight sequence of the obtained order-preserving regression function relationship as wresource=(ws)TAnd s has the same value as k.
Traverse xresourcekIf l ∈ k is present, xresourcel>xresourcel+1Let B be { l, l +1}, then
Figure BDA0002458644290000038
wB=wl+wl+1
This time game
Figure BDA0002458644290000041
Figure BDA0002458644290000042
Figure BDA0002458644290000043
Respectively representing a value sequence and a weight sequence obtained by traversing and calculating for a certain time in the ordering regression process. If calculated to a certain traversal so that
Figure BDA0002458644290000044
Is order-preserving, i.e.: x is the number ofresource1≤…≤xresourcel-1≤xB≤xresourcel+1≤…≤xresourcenThen the calculation is stopped. If not, continuing to traverse until obtaining an order-preserving regression sequence
Figure BDA0002458644290000045
And corresponding weights
Figure BDA0002458644290000046
And assigns a weight sequence to wresource
Selecting matrix AvVector x in (1)dsizekAnd ytimekWherein x isdsizekThe included elements are matrix XvIn (x)dsize1,xdsize2,…,xdsizen)T. Setting the weight sequence of the obtained order-preserving regression function relationship as wdsize=(wh)TAnd the value of h is the same as k.
Traverse xdsizekIf l ∈ k is present, xdsizel>xdsizel+1Let B be { l, l +1}, then
Figure BDA0002458644290000047
wB=wl+wl+1
This time game
Figure BDA0002458644290000048
Figure BDA0002458644290000049
Figure BDA00024586442900000410
Respectively representing a value sequence and a weight sequence obtained by traversing and calculating for a certain time in the ordering regression process. If calculated to a certain traversal so that
Figure BDA00024586442900000411
Is order-preserving, i.e.: x is the number ofdsize1≤…≤xdsizel-1≤xB≤xdsizel+1≤…≤xdsizenThen the calculation is stopped. If not, continuing to traverse until obtaining an order-preserving regression sequence
Figure BDA00024586442900000412
And corresponding weights
Figure BDA00024586442900000413
And assigns a weight sequence to wdsize
For xbatchk,xresourcek,xdsizekCalculating the derived weight wbatch,wresource,wdsize. The finally constructed functional relation is xbatchk*wbatch+xresourcek*wresource+xdsizek*wdsize=tround
2.3) constructing the relation among the batch size parameter, the iteration round times and the training precision, wherein the mathematical function relation is as the following formula
f(batchsize,round)=acc
In the above equation, batch size represents a batch size parameter variable, round represents the total number of model training rounds, and acc represents the model training accuracy. For the training of the mathematical relationship, a data matrix X is selectedvThe first, fourth and sixth columns of (1). Therefore, the correlation between the variables in the mathematical relation and the parameters in the data matrix is as follows: the blocksize in the functional formula corresponds to x in the data matrixbatchkRound in the function corresponds to x in the data matrixroundkAcc in the functional formula corresponds to x in the data matrixacck
The mathematical function relation is constructed in the same way as in step 2.2). Let xbatchk,xroundkThe corresponding order-preserving regression weights are wbatch,wround. The finally constructed functional relation is xbatchk*wbatch+xroundk*wround=acc
(3) Optimized parameter configuration search
3.1) the problem of selecting an optimized parameter configuration and resource allocation scheme can be formalized as follows:
min(rvm)
min(Tlimit)
max(accpredict)
Figure BDA0002458644290000051
in the above relationship:rvmrepresenting the number of resource nodes, TlimitRepresenting the total time of operation of the model training operation, accpredictIndicates the final training accuracy, CbatchNumber of packets divided for data, troundIt is time consuming to run a certain stage of the training job for the model. Wherein, min (r)vm) Representing minimizing resource usage; min (T)limit) Represents a minimum training time; max (acc)predict) Representing the training precision of the maximization model; cbatch%rvmWhen the data set is divided according to the batch size numerical value, the grouping number of the data division should be in integral multiple relation with the number of the working nodes;
Figure BDA0002458644290000052
the total running time of each round of the model training operation is ensured to be less than the total time of the model training operation.
3.2) forming a time consumption list of each training round under different batch sizes and resource configuration combinations and a training precision list which can be achieved under different batch sizes and iteration rounds through the calculation of the step 2), and respectively recording the time consumption list and the training precision list as tables1And table2
And 3.3) combining the two result tables to search according to the minimum running time and the optimal model training precision target, wherein the minimum running time is the number of iteration rounds multiplied by the time consumption of each round of training.
3.4) two steps according to the search method,
3.4.1) in the List Table2Searching a batch size numerical value and iteration round times corresponding to the optimal reachable model precision;
3.4.2) in the list table according to lot size values1Searching corresponding iteration time consumption and working node resource configuration number of each round;
3.4.3) correspondingly finding multiple records, multiplying the iteration time consumption of each round by the iteration round number of the last step of search to obtain the total training time consumption, and correspondingly recording the record with the minimum total training time consumption to form a list to be confirmed for subsequent use; if the requirements are not met, the data are directly discarded.
3.5) finally, determining the final parameter configuration combination in the list to be confirmed, wherein the situation that a plurality of records or no records exist in the list occurs:
3.5.1) if there is no record, then it needs to be re-listed in the table2Searching for a suboptimal value so as to continue searching;
3.5.2) if the number of the records is multiple, finding a group of configurations with the least resources of the working nodes in the total training time consumption according to the limiting condition with the minimum total training time consumption, and executing model training according to the configuration results.
(4) Performing model training
4.1) forming a parameter configuration list according to the solution space search result in the step 3);
4.2) setting model training configuration according to the parameter configuration table
(5) And (4) ending: and waiting for the completion of the model training operation.
Drawings
FIG. 1 is a general flow diagram of the process of the present invention.
FIG. 2(a) compares batch size parameters, resource allocation coordination, and achievable accuracy under manual configuration.
FIG. 2(b) compares the batch size parameter and resource allocation coordination adjustment with the number of iteration rounds under manual configuration.
Detailed Description
The invention is described below with reference to the accompanying drawings and the detailed description.
The specific implementation method can be divided into the following steps:
(1) initialization
1.1) constructing a cloud environment resource node set DataResources, wherein the DataResources is { dc1=(center1,204800MB,2048MB),dc2=(center2,204800MB,2048MB),dc3(center3,204800mb,2048MB) }, and the total number M of resource nodes is 3;
1.2) configuration information set TrainDataSet in the process of executing model building training, wherein the TrainDataSet is { td ═1=(16,1,0,0,0,0),td2=(16,2,0,0,0,0),td3=(16,3,0,0,0,0),td4=(32,1,0,0,0,0),td5=(32,2,0,0,0,0),td6=(32,3,0,0,0,0),td7=(128,1,0,0,0,0),td8=(128,2,0,0,0,0),td9=(128,3,0,0,0,0),td10=(256,1,0,0,0,0),td11=(256,2,0,0,0,0),td12=(256,3,0,0,0,0),td13=(512,1,0,0,0,0),td14=(512,2,0,0,0,0),td15(512,3,0,0,0,0) }, and the total number of data N is 15;
(2) performance model modeling
2.1) executing observation experiment, collecting data, filling data items of a TrainDataSet set and Tasks set and constructing a data matrix, wherein the form is as follows
Figure BDA0002458644290000061
2.2) construction of f (batch size, datasize, res) ═ troundA mathematical relationship. Set matrix lambdavTo select a data matrix XvThe first, second, third and fifth columns. Let AvRepresenting matrix ΛvThe corresponding vector matrix, i.e. Av=(xbatchk,xresourcek,xdsizek,ytimek). Solving the matrix A according to the mathematical function relationvIn (x)batchk,xresourcek,xdsizek) And ytimekAnd (3) preserving the sequence and regressing.
Is obtained for xbatchk,xresourcek,xdsizekCalculating the corresponding weight wbatch,wresource,wdsize. The finally constructed functional relation is xbatchk*wbatch+xresourcek*wresource+xdsizek*wdsize=tround
2.3) construction of f (batch size, r)ound) And (5) the acc mathematical relation. For the training of the mathematical relationship, a data matrix X is selectedvThe first, fourth and sixth columns of (1). Therefore, the correlation between the variables in the mathematical relation and the parameters in the data matrix is as follows: the blocksize in the functional formula corresponds to x in the data matrixbatchkFunctional formulaRound in (3) corresponds to x in the data matrixroundkAcc in the functional formula corresponds to x in the data matrixacck. The mathematical function relation is constructed in the same way as in step 2.2). Calculate to get xbatchk,xroundkThe corresponding order-preserving regression weights are wbatch,wround. The finally constructed functional relation is xbatchk*wbatch+xroundk*wround=acc
(3) Optimized resource configuration search
3.1) for a data set to be trained, different batch size parameters, resource supply quantity and iteration round times are respectively substituted into the performance model established in the step 2) to respectively form lists, and the lists are shown in the table 1 and the table 2
TABLE 1 batch size parameters, resource supply, training data set, and elapsed time for each round of training
Number of batch size Amount of resource supply Training data set size Time consumed by each round of training
16 1 1GB 17727s
16 2 1GB 5668s
16 3 1GB 3521s
512 3 1GB 338s
TABLE 2 batch size parameters, iteration round times and training accuracy
Number of batch size Iteration round Training accuracy
16 4 0.9583
32 15 0.967
64 38 0.9726
512 80 0.9588
And 3.2) combining the two result tables to search according to the minimum running time and the optimal model training precision target, wherein the minimum running time is the number of iteration rounds multiplied by the time consumption of each round of training. Two steps according to the searching method
3.2.1) in the List Table2Searching a batch size numerical value and iteration round times corresponding to the optimal reachable model precision;
3.2.2) listing table according to batch size value1Searching corresponding iteration time consumption and working node resource configuration number of each round;
3.2.3) correspondingly searching a plurality of records, multiplying the iteration time consumption of each round by the iteration round number of the last step of search to obtain the total training time consumption, and if the total time consumption meets the requirement of limited time, recording the records to form a list to be confirmed for subsequent use; if the requirements are not met, the data are directly discarded.
3.2.4) finally, determining the final parameter configuration combination in the list to be confirmed, namely selecting 64 batch sizes, selecting 2 nodes for resource supply, and setting the iteration turn to be 38 turns.
(4) Performing model training
4.1) forming a parameter configuration list according to the solution space search result in the step 3);
4.2) setting model training configuration according to the parameter configuration table
(5) And (4) ending: and waiting for the completion of the model training operation.
According to the method provided by the invention, the inventor conducts relevant tests on the effectiveness of the method. The evaluation indexes of the test are mainly two:
(1) average number of completed training jobs: representing the ratio of the number of completed model training jobs to the total number of model training jobs in a set period of time. The calculation is as follows:
Figure BDA0002458644290000081
wherein N is the number of model training operations completed within a specified time, and N is the total number of preset model training operations within the period of time.
(2) Average resource usage: and the ratio of the sum of the resource usage amount used by each model training operation to the number of completed model training operations in a set period of time is shown. The calculation is as follows:
Figure BDA0002458644290000082
wherein r isiThe resource usage amount of a completed model training operation is n, and the number of completed model training operations in a set period of time is n.
The platform used in the test is tested by forming 4 working nodes by 4 physical machines, and the configuration of the experimental environment is shown in table 3. Wherein 1 node is used as a parameter server, and 3 nodes are used as computing nodes.
TABLE 3 test Environment configuration
Figure BDA0002458644290000083
The tested data set is an SVHN data set which is obtained by collecting and sorting house number image sets in Google street scenes, is a real world image data set used for developing an object detection algorithm and comprises more than 600000 digital images with labels. The test mainly compares the model training operation under the parameter configuration obtained by the performance model calculation of the invention with the model training operation result under the condition of manually setting the parameter configuration under the unused performance model. The results are shown in FIG. 2.
As can be seen from the graphical representation, for different convolutional neural networks, the results of model training using the parameters obtained by the performance model are improved compared with those obtained by manual setting. For example, when the performance model is not used and only the training parameters are manually set for model training, the optimal prediction precision which can be achieved is 0.967; and after the corresponding parameter configuration is obtained by using the performance model calculation, the model training is executed, and the model prediction precision of 0.973 is obtained, which is higher than the result obtained by manually setting the parameter training. For the performance model building target with the minimized running time and the optimal model training precision, after the model training is carried out by using the parameters selected by the performance model, the number of iteration turns is increased and the prediction precision is increased by 0.4%.
The above description is only an embodiment of the present invention, but the protection scope of the present invention is not limited thereto, and the present invention is not limited to the technology described in the present invention, and all technical solutions and modifications thereof without departing from the spirit and scope of the present invention should be covered in the claims of the present invention.

Claims (1)

1. A deep learning training resource allocation prediction method under a cloud environment is characterized by comprising the following four steps: initializing, constructing a performance model, searching in a heuristic mode, and executing model training;
the method is realized on a computer according to the following steps:
(1) initialization
1.1) combining the model training execution environment to construct a cloud environment resource node set DataResources (dr)iI is more than or equal to 1 and less than or equal to M, wherein, for any resource node dri∈DataResources,driUsing triplets (centers)i,capacityi,memoryi) To indicate centeriRepresenting the number, capacity, of a resource nodeiDelegate resource node centeriMemory capacity of (2)iDelegate resource node centeriM represents the total number of data centers;
1.2) combining the configuration information in the model training execution process into a data set TrainDataSet, wherein the TrainDataSet is { tdjJ is less than or equal to 1 and less than or equal to N, wherein for any data tdj∈TrainDataSet,tdjUsing six-membered groups (batch size)j,resj,timej,roundj,dataseizej,accj) To represent the batch sizejRepresentative data tdjThe size of the batch size parameter used, resjRepresentative data tdjNumber of resource nodes used, timejRepresentative data tdjEach round of training takes time, roundjRepresentative data tdjUsing the iteration round number, datasizejRepresentative data tdjCorresponding training data set size, accjRepresentative data tdjThe training precision finally obtained by the training task is obtained, and N represents the total number of data;
(2) performance model modeling
2.1) construction of a data matrix X from training data information on which model training has been performedvIn the form of
Figure FDA0002458644280000011
In matrix XvEach row represents a group of sample data, and each group of data comprises batch size parameters, resource node numbers, training data set sizes, total training rounds, time consumption for each training round and model precision obtained by training, wherein the batch size parameters, the resource node numbers, the training data set sizes and the total training rounds are correspondingly used in the training of a certain model; wherein the batch size parameter is xbatchkX is used for representing and resource node numberresourcekRepresenting, training data set size by xdsizekRepresentative, Total number of training rounds by xroundkRepresenting, time consuming per training run ytimekRepresentation and training precision yacckRepresenting that k is 1,2, …, n, n represents the total number of items for constructing the training data information;
2.2) constructing the relation between the time consumption of each round and the batch size parameter, the resource supply and the training data size, and the mathematical function relation is as the following formula
f(batchsize,res,datasize)=tround
In the above equation, batch size represents a batch size parameter variable, and res represents a resource given by the current model trainingGiven the volume, datasize denotes the training data size, troundIndicating the time consumed in each round; for the training of the mathematical function relation, a data matrix X is selectedvThe first, second, third and fifth columns of (1); therefore, the correlation between the variables in the mathematical relation and the parameters in the data matrix is as follows: the blocksize in the functional formula corresponds to x in the data matrixbatchkDatasize in the functional formula corresponds to x in the data matrixdsizekRes in the functional formula corresponds to x in the data matrixresourcekT in a functional relationroundCorresponding to y in the data matrixtimek
After the data items are established, the calculation of the order-preserving regression can be carried out; because the order-preserving regression algorithm is multidimensional order-preserving regression, a dimension reduction method is adopted, namely order-preserving regression between a certain one-dimensional independent variable and a dependent variable is considered; in this way, order-preserving regression of independent variables of all dimensions is completed; finally, combining independent variable order-preserving regression of each dimension, and finally completing resolving of the multidimensional order-preserving regression;
set matrix lambdavTo select a data matrix XvA data matrix formed after the first, second, third and fifth columns; let AvRepresenting matrix ΛvThe corresponding vector matrix, i.e. Av=(xbatchk,xresourcek,xdsizek,ytimek) (ii) a According to the mathematical functional relation, the matrix A is solved as followsvIn (x)batchk,xresourcek,xdsizek) And ytimekThe order-preserving regression between;
selecting matrix AvVector x in (1)batchkAnd ytimekWherein x isbatchkThe included elements are matrix XvIn (x)batch1,xbatch2,...,xbatchn)T(ii) a Setting the weight sequence of the obtained order-preserving regression function relation as W ═ Wu)TThe value of u is the same as k;
traverse xbatchkIf l ∈ k is present, xbatchl>xbatchl+1Let B be { l, l +1}, then
Figure FDA0002458644280000021
wB=wl+wl+1
This time game
Figure FDA0002458644280000022
Figure FDA0002458644280000023
Figure FDA0002458644280000024
Respectively representing a value sequence and a weight sequence obtained by traversing and calculating for a certain time in the sequential regression process; if calculated to a certain traversal so that
Figure FDA0002458644280000027
Is order-preserving, i.e.: x is the number ofbatch1≤…≤xbatchl-1≤xB≤xbatchl+1≤…≤xbatchnStopping the calculation; if not, continuing to traverse until obtaining an order-preserving regression sequence
Figure FDA0002458644280000026
And corresponding weights
Figure FDA0002458644280000025
In sequence to xresourcekAnd xdsizekRespectively carrying out order-preserving regression calculation to form corresponding weight sequences; let xbatchk,xresourcek,xdsizekThe corresponding weights are wbatch,wresource,wdsize(ii) a The finally constructed functional relation is
xbatchk*wbatch+xresourcek*wresource+xdsizek*wdsize=tround
2.3) constructing the relation among the batch size parameter, the iteration round times and the training precision, wherein the mathematical function relation is as the following formula
f(batchsize,round)=acc
In the formula, batch size represents a batch size parameter variable, round represents the total number of times of model training, and acc represents the model training precision; for the training of the mathematical relationship, a data matrix X is selectedvThe first, fourth and sixth columns of (1); therefore, the correlation between the variables in the mathematical relation and the parameters in the data matrix is as follows: the blocksize in the functional formula corresponds to x in the data matrixbatchkRound in the function corresponds to x in the data matrixroundkAcc in the functional formula corresponds to x in the data matrixacck
The construction method of the mathematical function relation is the same as that in the step 2.2); let xbatchk,xroundkThe corresponding order-preserving regression weights are wbatch,wround(ii) a The finally constructed functional relation is
xbatchk*wbatch+xroundk*wround=acc
(3) Optimized parameter configuration search
3.1) the problem of selecting an optimized parameter configuration and resource allocation scheme can be formalized as follows:
min(rvm)
min(Tlimit)
max(accpredict)
subject to:
Figure FDA0002458644280000031
in the above relationship: r isvmRepresenting the number of resource nodes, TlimitRepresenting the total time of operation of the model training operation, accpredictIndicates the final training accuracy, CbatchNumber of packets divided for data, troundRunning time is consumed for a certain stage of model training operation; wherein,min(rvm) Representing minimizing resource usage; min (T)limit) Represents a minimum training time; max (acc)predict) Representing the training precision of the maximization model; cbatch%rvmWhen the data set is divided according to the batch size numerical value, the grouping number of the data division should be in integral multiple relation with the number of the working nodes;
Figure FDA0002458644280000032
representing that the total running time of each round of model training operation is less than the total time of the model training operation;
3.2) forming a time consumption list of each training round under different batch sizes and resource configuration combinations and a training precision list which can be achieved under different batch sizes and iteration rounds through the calculation of the step 2), and respectively recording the time consumption list and the training precision list as tables1And table2
3.3) combining the two result tables to search according to the minimum running time and the optimal model training precision target, wherein the minimum running time is the time consumed by each round of training multiplied by the number of iteration rounds;
3.4) two steps according to the search method,
3.4.1) in the List Table2Searching a batch size numerical value and iteration round times corresponding to the optimal reachable model precision;
3.4.2) in the list table according to lot size values1Searching corresponding iteration time consumption and working node resource configuration number of each round;
3.4.3) correspondingly finding multiple records, multiplying the iteration time consumption of each round by the iteration round number of the last step of search to obtain the total training time consumption, and correspondingly recording the record with the minimum total training time consumption to form a list to be confirmed for subsequent use; if the requirement is not met, directly discarding;
3.5) finally, determining the final parameter configuration combination in the list to be confirmed, wherein the situation that a plurality of records or no records exist in the list occurs:
3.5.1) if there is no record, then it needs to be re-listed in the table2In order to find a sub-optimal valueContinuing searching;
3.5.2) if the record is a plurality of records, finding a group of configuration with the least resources of the working nodes in the total training time consumption according to the minimum limiting condition of the total training time consumption, and executing model training according to the configuration result;
(4) performing model training
4.1) forming a parameter configuration list according to the solution space search result in the step (3);
4.2) setting model training configuration according to the parameter configuration table
(5) And (4) ending: and waiting for the completion of the model training operation.
CN202010313690.1A 2020-04-20 2020-04-20 Deep learning training resource allocation prediction method in cloud environment Pending CN111444026A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010313690.1A CN111444026A (en) 2020-04-20 2020-04-20 Deep learning training resource allocation prediction method in cloud environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010313690.1A CN111444026A (en) 2020-04-20 2020-04-20 Deep learning training resource allocation prediction method in cloud environment

Publications (1)

Publication Number Publication Date
CN111444026A true CN111444026A (en) 2020-07-24

Family

ID=71656112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010313690.1A Pending CN111444026A (en) 2020-04-20 2020-04-20 Deep learning training resource allocation prediction method in cloud environment

Country Status (1)

Country Link
CN (1) CN111444026A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113052334A (en) * 2021-04-14 2021-06-29 中南大学 Method and system for realizing federated learning, terminal equipment and readable storage medium
CN113157413A (en) * 2021-04-16 2021-07-23 上海交通大学 Deep learning task resource optimization configuration method and system based on service quality requirement
CN113205128A (en) * 2021-04-28 2021-08-03 华东师范大学 Distributed deep learning performance guarantee method based on serverless computing
CN114816711A (en) * 2022-05-13 2022-07-29 湖南长银五八消费金融股份有限公司 Batch task processing method and device, computer equipment and storage medium
CN116954873A (en) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 Heterogeneous computing system, and method, device, equipment and medium for selecting power nodes of heterogeneous computing system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203809A (en) * 2017-04-20 2017-09-26 华中科技大学 A kind of deep learning automation parameter adjustment method and system based on Keras
CN109659933A (en) * 2018-12-20 2019-04-19 浙江工业大学 A kind of prediction technique of power quality containing distributed power distribution network based on deep learning model
US20190164268A1 (en) * 2017-11-27 2019-05-30 Nvidia Corporation Deep-learning method for separating reflection and transmission images visible at a semi-reflective surface in a computer image of a real-world scene
CN110795228A (en) * 2018-08-03 2020-02-14 伊姆西Ip控股有限责任公司 Adaptive batch dataset partitioning for distributed deep learning using accelerator mixture sets
CN110929884A (en) * 2019-11-22 2020-03-27 北京大学 Classification method and device for distributed machine learning optimization based on column division
CN110930356A (en) * 2019-10-12 2020-03-27 上海交通大学 Industrial two-dimensional code reference-free quality evaluation system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203809A (en) * 2017-04-20 2017-09-26 华中科技大学 A kind of deep learning automation parameter adjustment method and system based on Keras
US20190164268A1 (en) * 2017-11-27 2019-05-30 Nvidia Corporation Deep-learning method for separating reflection and transmission images visible at a semi-reflective surface in a computer image of a real-world scene
CN110795228A (en) * 2018-08-03 2020-02-14 伊姆西Ip控股有限责任公司 Adaptive batch dataset partitioning for distributed deep learning using accelerator mixture sets
CN109659933A (en) * 2018-12-20 2019-04-19 浙江工业大学 A kind of prediction technique of power quality containing distributed power distribution network based on deep learning model
CN110930356A (en) * 2019-10-12 2020-03-27 上海交通大学 Industrial two-dimensional code reference-free quality evaluation system and method
CN110929884A (en) * 2019-11-22 2020-03-27 北京大学 Classification method and device for distributed machine learning optimization based on column division

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113052334A (en) * 2021-04-14 2021-06-29 中南大学 Method and system for realizing federated learning, terminal equipment and readable storage medium
CN113052334B (en) * 2021-04-14 2023-09-29 中南大学 Federal learning realization method, system, terminal equipment and readable storage medium
CN113157413A (en) * 2021-04-16 2021-07-23 上海交通大学 Deep learning task resource optimization configuration method and system based on service quality requirement
CN113157413B (en) * 2021-04-16 2022-04-26 上海交通大学 Deep learning task resource optimization configuration method and system based on service quality requirement
CN113205128A (en) * 2021-04-28 2021-08-03 华东师范大学 Distributed deep learning performance guarantee method based on serverless computing
CN114816711A (en) * 2022-05-13 2022-07-29 湖南长银五八消费金融股份有限公司 Batch task processing method and device, computer equipment and storage medium
CN116954873A (en) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 Heterogeneous computing system, and method, device, equipment and medium for selecting power nodes of heterogeneous computing system
CN116954873B (en) * 2023-09-21 2024-01-23 浪潮电子信息产业股份有限公司 Heterogeneous computing system, and method, device, equipment and medium for selecting power nodes of heterogeneous computing system

Similar Documents

Publication Publication Date Title
CN111444026A (en) Deep learning training resource allocation prediction method in cloud environment
Alipourfard et al. {CherryPick}: Adaptively unearthing the best cloud configurations for big data analytics
CN105446979B (en) Data digging method and node
US8799916B2 (en) Determining an allocation of resources for a job
US20170228652A1 (en) Method and apparatus for evaluating predictive model
CN109634744B (en) Accurate matching method, equipment and storage medium based on cloud platform resource allocation
US20130318538A1 (en) Estimating a performance characteristic of a job using a performance model
Chen et al. $ d $ d-Simplexed: Adaptive Delaunay Triangulation for Performance Modeling and Prediction on Big Data Analytics
CN112433853B (en) Heterogeneous perception data partitioning method for supercomputer data parallel application
JP3792879B2 (en) Parallel execution system
CN108427756A (en) Personalized query word completion recommendation method and device based on same-class user model
Singh et al. A machine learning approach for modular workflow performance prediction
CN112907026A (en) Comprehensive evaluation method based on editable mesh index system
CN112000460A (en) Service capacity expansion method based on improved Bayesian algorithm and related equipment
Sukhija et al. Portfolio-based selection of robust dynamic loop scheduling algorithms using machine learning
CN112463532B (en) Method for constructing SNN workload automatic mapper and automatic mapper
CN112632615B (en) Scientific workflow data layout method based on hybrid cloud environment
CN110175172B (en) Extremely-large binary cluster parallel enumeration method based on sparse bipartite graph
CN109711555B (en) Method and system for predicting single-round iteration time of deep learning model
US11579680B2 (en) Methods and devices for power management based on synthetic machine learning benchmarks
CN116974249A (en) Flexible job shop scheduling method and flexible job shop scheduling device
Cai et al. A recommendation-based parameter tuning approach for Hadoop
CN115859016A (en) Processor-based operation method and device, computer equipment and storage medium
CN116033026A (en) Resource scheduling method
CN115601103A (en) Article information display method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination