WO2022001134A1 - 模型并行训练任务负载均衡方法、装置、设备及存储介质 - Google Patents

模型并行训练任务负载均衡方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022001134A1
WO2022001134A1 PCT/CN2021/076963 CN2021076963W WO2022001134A1 WO 2022001134 A1 WO2022001134 A1 WO 2022001134A1 CN 2021076963 W CN2021076963 W CN 2021076963W WO 2022001134 A1 WO2022001134 A1 WO 2022001134A1
Authority
WO
WIPO (PCT)
Prior art keywords
scheme
initial
equalization
network layer
computing device
Prior art date
Application number
PCT/CN2021/076963
Other languages
English (en)
French (fr)
Inventor
王丽
高开
曹芳
郭振华
Original Assignee
浪潮电子信息产业股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮电子信息产业股份有限公司 filed Critical 浪潮电子信息产业股份有限公司
Priority to US18/010,725 priority Critical patent/US11868817B2/en
Publication of WO2022001134A1 publication Critical patent/WO2022001134A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7817Specially adapted for signal processing, e.g. Harvard architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5022Workload threshold

Definitions

  • the invention relates to the technical field of parallel training, in particular to a model parallel training task load balancing method, a model parallel training task load balancing device, a model parallel training task load balancing device and a computer-readable storage medium.
  • DNNs deep neural networks
  • AI ARTIFICIAL INTELLIGENCE, artificial intelligence
  • Model-parallel training involves partitioning the model across devices so that each computing device evaluates only a portion of the model's parameters and performs updates.
  • the DNN model is generally divided and trained by the staff by manual division according to experience.
  • manual division cannot achieve good load balancing, the amount of calculation that needs to be calculated on each computing device varies greatly, and the overall training efficiency is low. Therefore, the related technologies have problems of unbalanced load and low overall training efficiency.
  • the purpose of the present invention is to provide a model parallel training task load balancing method, a model parallel training task load balancing device, a model parallel training task load balancing device and a computer-readable storage medium, which solve the problem of load inconsistency in the related art. Balanced, overall training efficiency is low.
  • the present invention provides a load balancing method for model parallel training tasks, including:
  • the intermediate equalization scheme is adjusted according to the data traffic to obtain a final equalization scheme.
  • the load balancing operation is performed by adopting a variety of equipment critical layer location division rules according to the initial calculation amount to obtain a plurality of initial balancing schemes, including:
  • the network layer sequence divide the network layer for each of the computing devices according to the initial calculation amount, and perform device critical layer detection;
  • the device critical layer is divided into pre-order computing devices to obtain a first equalization scheme; wherein, the pre-order computing device belongs to the pre-order network layer corresponding to the device critical layer of said computing device;
  • the device critical layer is divided into subsequent computing devices to obtain a second equilibrium solution; wherein, the subsequent computing device belongs to the subsequent network layer corresponding to the device critical layer of said computing device;
  • the first equalization scheme and the second equalization scheme are determined as the initial equalization scheme.
  • performing statistics on time performance parameters corresponding to the initial equalization scheme, and determining an intermediate equalization scheme in the initial equalization scheme according to the time performance parameters including:
  • the intermediate equalization scheme is selected among the candidate equalization schemes according to a preset selection rule.
  • adjusting the intermediate equalization scheme according to the data traffic to obtain a final equalization scheme including:
  • the communication duration of each of the computing devices is obtained, and the total duration corresponding to the computing device is obtained by using the communication duration and the computing duration;
  • the intermediate equalization scheme is determined as the final equalization scheme.
  • performing network layer division and optimization processing on the target computing device corresponding to the maximum total duration in the intermediate equalization scheme to obtain an optimized equalization scheme including:
  • a candidate optimization solution is determined in the first optimization solution and the second optimization solution
  • the optimized equalization scheme is determined in the candidate optimization scheme and the intermediate equalization scheme according to the candidate time performance parameter and the time performance parameter corresponding to the candidate optimization scheme.
  • obtaining the data traffic and theoretical calculation amount of each network layer in the target model includes:
  • the theoretical calculation amount is calculated using the parameter information, and the data communication amount is calculated using the input and output information.
  • the method further includes:
  • Each of the network layer groups is sent to the corresponding computing device for training.
  • the present invention also provides a model parallel training task load balancing device, comprising:
  • the acquisition module is used to acquire the data traffic and theoretical calculation amount of each network layer in the target model
  • an initial calculation amount determination module configured to determine the theoretical calculation power of each computing device, and obtain the initial calculation amount corresponding to each of the computing devices according to the theoretical calculation power and the theoretical calculation amount;
  • an initial scheme obtaining module configured to perform load balancing operations by adopting a variety of equipment critical layer location division rules according to the initial calculation amount to obtain a plurality of initial balancing schemes
  • an intermediate scheme determination module configured to count time performance parameters corresponding to the initial equalization scheme, and determine an intermediate equalization scheme in the initial equalization scheme according to the time performance parameters
  • a final solution acquisition module configured to adjust the intermediate equalization solution according to the data traffic to obtain a final equalization solution.
  • the present invention also provides a model parallel training task load balancing device, including a memory and a processor, wherein:
  • the memory for storing computer programs
  • the processor is configured to execute the computer program to implement the above-mentioned method for load balancing of model parallel training tasks.
  • the present invention also provides a computer-readable storage medium for storing a computer program, wherein, when the computer program is executed by a processor, the above-mentioned method for balancing model parallel training tasks is implemented.
  • the model parallel training task load balancing method obtains the data traffic and theoretical calculation amount of each network layer in the target model; determines the theoretical computing power of each computing device, and obtains each computing device according to the theoretical computing power and theoretical calculation amount The corresponding initial calculation amount; according to the initial calculation amount, a variety of equipment critical layer location division rules are used to perform load balancing operations, and multiple initial balancing schemes are obtained; the time performance parameters corresponding to the initial balancing scheme are counted, and the initial balancing is performed according to the time performance parameters.
  • An intermediate equalization scheme is determined in the scheme; the intermediate equalization scheme is adjusted according to the data traffic to obtain a final equalization scheme.
  • this method obtains the initial calculation amount corresponding to each computing device through the theoretical computing power of the computing device and the theoretical calculation amount of the target model.
  • the initial balancing scheme After the load balancing operation, a variety of different balancing schemes are obtained, that is, the initial balancing scheme.
  • the initial balancing scheme By counting time performance parameters, the time performance of multiple technical solutions is determined, and an initial equalization solution with better performance is selected as an intermediate equalization solution. Finally, considering the influence of data communication between computing devices, the intermediate equalization scheme is adjusted to obtain the final equalization scheme.
  • the time performance parameter can be used to represent the overall computing efficiency of all computing devices in the initial balance scheme ;Finally consider the impact of the data communication process on each computing device and adjust the intermediate balancing scheme based on this, and obtain the final balancing scheme, which can achieve load balancing among various computing devices, ensure the overall computing efficiency, that is, training efficiency, and solve related problems.
  • the technology has the problems of unbalanced load and low training efficiency.
  • the present invention also provides a model parallel training task load balancing device, a model parallel training task load balancing device and a computer-readable storage medium, which also have the above beneficial effects.
  • FIG. 1 is a flowchart of a method for load balancing of model parallel training tasks provided by an embodiment of the present invention
  • FIG. 2 is a flowchart of a specific method for adjusting an intermediate equalization scheme provided by an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a model parallel training task load balancing device provided by an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a model parallel training task load balancing device according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a method for load balancing of model parallel training tasks provided by an embodiment of the present invention.
  • the method includes:
  • S101 Acquire the data traffic and theoretical calculation amount of each network layer in the target model.
  • the target model is a network model that needs to be trained in parallel on multiple computing devices, which may specifically be a deep learning model or other network models with network layers.
  • the target model may specifically be an image classification model, a speech recognition model, a language translation model, and the like.
  • the specific type of computing device is not limited.
  • it can be heterogeneous acceleration device, that is, an acceleration device constructed based on a variety of different architectures.
  • the architecture can be FPGA (Field-Programmable Gate Array, Field Programmable Gate Array) architecture, TPU (Tensor Processing Unit) architecture or GPU (Graphics Processing Unit, graphics processor) architecture.
  • the data traffic and theoretical calculation amount of each network layer need to be determined. Since each network layer has data input and data output, the data traffic volume may specifically be the data input volume or the data output volume of the network layer, which can be selected according to the actual situation.
  • the theoretical calculation amount is the theoretical total amount of computing resources required for the network layer to complete the training. Data traffic and theoretical calculation are the biggest factors affecting the training time of the target model. The greater the data traffic and theoretical calculation, the more time required for the corresponding network layer training.
  • S102 Determine the theoretical computing power of each computing device, and obtain an initial computing amount corresponding to each computing device according to the theoretical computing power and the theoretical computing amount.
  • each computing device has a corresponding theoretical computing power.
  • the theoretical computing power can represent the computing speed of the computing device, and its specific size is related to the computing device itself, which is not limited.
  • the computing power of each computing device can be determined according to the theoretical computing power. Therefore, according to the theoretical computing power and the theoretical calculation amount corresponding to each network layer, the load corresponding to the entire target network can be distributed to each computing device in a balanced manner, and the corresponding initial calculation amount can be obtained. .
  • the stronger the theoretical computing power the more the corresponding initial calculation amount, the weaker the theoretical computing power, the smaller the corresponding initial calculation amount. Therefore, when training under ideal conditions, each computing device can use the same time to complete the training.
  • the calculation of the corresponding initial calculation amount avoids the situation that some computing devices have completed the calculation while other computing devices have not completed and therefore need to wait, and the basic overall computing efficiency is guaranteed.
  • a performance model can be constructed and solved to obtain the initial calculation amount.
  • the construction and solving process of the performance model are not limited in this embodiment, and related technologies can be referred to;
  • the computing power is normalized to the unit, and each theoretical computing power is represented by the same representation.
  • the theoretical calculation amount of each network layer is used to calculate the training load of the target network, and the training load is distributed according to the ratio of each theoretical computing power.
  • the initial calculation amount corresponding to each computing device can be determined.
  • the network layer is the smallest division unit of the target network, and the theoretical calculation amount of the network layer has a certain lower limit, it may not be possible to perfectly divide the corresponding initial calculation load for each computing device when actually dividing the network layer. Therefore, based on the initial calculation amount, a variety of different equipment critical layer location division rules are used to divide the network layer, that is, load balancing operation is performed, and multiple initial balancing schemes are obtained.
  • the device critical layer is a special network layer, and the allocated load corresponding to the first device is insufficient for the corresponding initial calculation amount, and when the device critical layer is divided into the first device, the first The load of the device will exceed the corresponding initial calculation amount, so it can be divided into the second device or the first device. Therefore, the device critical layer is located between two computing devices when it is divided, that is, it can be divided into any one of the two computing devices. Therefore, the device critical layer has two optional positions, that is, the first device or the second device. Based on these two possible positions, the corresponding multiple device critical layer position division rules can be set.
  • the rules for dividing the device critical layer to the first device or the rules for dividing all the device critical layers to the second device, or the rules for dividing some device critical layers to the first device, and the other device critical layers to the first device. two devices. According to these various equipment critical layer position division rules, multiple initial division schemes can be obtained.
  • the initial partition scheme ensures the balance of computing load among various computing devices as much as possible.
  • S104 Count the time performance parameters corresponding to the initial equalization scheme, and determine an intermediate equalization scheme in the initial equalization scheme according to the time performance parameters.
  • the time performance parameters corresponding to the multiple initial equalization schemes are counted, and the time performance parameter is used to represent the time performance of the initial equalization scheme, which may specifically be the calculation time average and time standard deviation of each computing device. , time variance or other similar parameters, the number of time performance parameters can be one or more.
  • the scheme with the best time performance parameters is selected from the initial equalization schemes as the intermediate equalization scheme.
  • the method for evaluating time performance parameters is not limited in this embodiment, and may be set according to the number and type of time performance parameters.
  • the intermediate equalization scheme is adjusted according to the data traffic and time performance parameters. Since data communication also takes a certain amount of time, the required time corresponding to the different data traffic of each network layer is different. Therefore, it is necessary to consider the impact of data communication, and use it to modify the intermediate equalization scheme. After the modification, use time performance parameters to evaluate and adjust the intermediate equalization scheme, and then the final equalization scheme can be obtained.
  • the final balancing scheme combines the influences of data transmission and calculation, and realizes the balanced distribution of the load.
  • the initial calculation amount corresponding to each computing device is obtained through the theoretical computing power of the computing device and the theoretical calculation amount of the target model.
  • the layer adopts a variety of equipment critical layer location division rules to perform load balancing operations, and obtains a variety of different balancing schemes, that is, initial balancing schemes.
  • initial balancing schemes that is, initial balancing schemes.
  • time performance parameters By counting time performance parameters, the time performance of multiple technical solutions is determined, and an initial equalization solution with better performance is selected as an intermediate equalization solution. Finally, considering the influence of data communication between computing devices, the intermediate equalization scheme is adjusted to obtain the final equalization scheme.
  • the time performance parameter can be used to represent the overall computing efficiency of all computing devices in the initial balance scheme ;Finally consider the impact of the data communication process on each computing device and adjust the intermediate balancing scheme based on this, and obtain the final balancing scheme, which can achieve load balancing among various computing devices, ensure the overall computing efficiency, that is, training efficiency, and solve related problems.
  • the technology has the problems of unbalanced load and low training efficiency.
  • Step S101 may include:
  • S1011 Acquire parameter information and input and output information corresponding to each network layer.
  • a forward computing network can be constructed, and parameter information and input and output information can be obtained by using the forward computing network.
  • the parameter information is used to indicate how the network layer performs calculations, and may also be referred to as operator parameter information, the specific content of which is not limited in this embodiment, for example, may be convolution kernel size information, or may also include input and output information.
  • the input and output information can be input information or output information, which is set in advance and has the same direction as the data traffic, that is, when the data traffic is the data input, the input and output information is the input information, and when the data traffic When it is the data output quantity, the input and output information is the output information.
  • the input and output information is used to represent the input or output of the network layer, and its specific content can be the number of features, the size of features, etc., which is not limited.
  • S1012 Calculate the theoretical calculation amount by using the parameter information, and calculate the data communication amount by using the input and output information.
  • the theoretical calculation amount can be calculated according to the parameter information. Since the parameter information indicates what kind of calculation the network layer needs to perform, the corresponding theoretical calculation amount can be accurately determined. Similarly, the corresponding data traffic can be accurately calculated by using the input and output information.
  • step S103 may include:
  • S1031 According to the sequence of the network layers, divide the network layers for each computing device according to the initial calculation amount, and perform device critical layer detection.
  • the network layer is divided for the corresponding computing device, and the device critical layer detection is performed when the network layer is divided.
  • the device critical layer detection can be performed by detecting the following conditions:
  • the target network layer is determined to be the device critical layer, that is, the device critical layer is detected. For example, if computing device 1 comes first, and computing device 2 follows, the initial calculation amount of computing device 1 is 1000, and the current load is 990. When the target network layer is assigned to computing device 1, the current load of computing device 1 is 1010. Then the target network layer is the device critical layer.
  • the device critical layer can be divided into the pre-order computing devices, that is, all the device critical layers are divided into the pre-order computing devices, and the first equilibrium scheme can be obtained.
  • the pre-order computing device is the computing device to which the pre-order network layer corresponding to the device critical layer belongs, and the pre-order network layer is the network layer whose network layer sequence precedes the device critical layer, so the pre-order computing device is S1031
  • the final computing load of the computing device 1 is 1010 .
  • the device critical layer can also be divided into subsequent computing devices, that is, all device critical layers are divided into subsequent computing devices, and the second equilibrium solution can be obtained.
  • the subsequent computing device is the computing device to which the subsequent network layer corresponding to the device critical layer belongs, and the subsequent network layer is the network layer whose network layer sequence is after the device critical layer, so the subsequent computing device is an example of step S1031.
  • the final computing load of the computing device 1 is 990.
  • this embodiment does not limit the execution order of the two steps of S1032 and S1033.
  • S1032 may be executed first, and then S1033 may be executed; or S1033 may be executed first, and then S1032 may be executed; or S1032 and S1033 may be executed simultaneously.
  • S1034 Determine the first equalization scheme and the second equalization scheme as the initial equalization scheme.
  • the first equalization scheme and the second equalization scheme are obtained, they are determined as the initial equalization scheme, so as to subsequently determine the intermediate equalization scheme.
  • step S104 may include:
  • S1041 Count the calculation durations corresponding to each computing device in the initial equalization scheme, and use the calculation durations to calculate the time average and time standard deviation corresponding to the initial equalization scheme to obtain time performance parameters.
  • time performance parameters two parameters, the time average value and the time standard deviation, are used as time performance parameters. Specifically, after the initial equalization scheme is obtained, the calculation corresponding to each computing device in each initial equalization scheme is counted according to the theoretical computing power of the computing device. time, and use the calculation time to calculate the time average and time standard deviation corresponding to each initial equalization scheme.
  • the time average is the average computing time required by each computing device, which can reflect the overall computing capability, and the computing standard deviation can represent the difference in computing time between computing devices. The greater the difference, the lower the overall computing efficiency.
  • S1042 Determine whether the time average value is less than the first threshold and whether the time standard deviation is less than the second threshold.
  • the first threshold is used to compare with the time average value
  • the second threshold is used to compare with the time standard deviation.
  • the specific sizes of the first threshold and the second threshold are not limited in this embodiment, and can be set according to the actual situation. .
  • the number of candidate equalization schemes can be counted, and when there is only one candidate equalization scheme, it can be directly determined as an intermediate equalization scheme.
  • the number of candidate equalization schemes is not one, there are two cases at this time, the number of solution candidate equalization schemes is greater than 1, or the number of candidate equalization schemes is 0.
  • an intermediate equalization scheme may be selected among them according to a preset selection rule.
  • the number of candidate equalization schemes is 0, all initial equalization schemes may be determined as candidate equalization schemes, and an intermediate equalization scheme may be determined among them according to a preset selection rule.
  • step S105 may include:
  • S1051 Obtain the communication duration of each computing device according to the data traffic and the network layer communication speed of each computing device, and obtain the total duration corresponding to the computing device by using the communication duration and the computing duration.
  • the communication speed of the network layer is the data transmission speed between the network layers in the computing device. According to the data traffic and the communication speed of the network layer, the communication duration corresponding to each computing device can be obtained.
  • the computing time is the time required for the computing device to calculate all loads. The total time required for the computing device to process the load can be obtained by adding the communication time and the computing time.
  • S1052 Determine the maximum total duration, and perform network layer division and optimization processing on the target computing device corresponding to the maximum total duration in the intermediate equalization scheme to obtain an optimized equalization scheme.
  • the maximum total duration is the maximum value among all the total durations.
  • the target computing device is determined by determining the maximum total duration, and the network layer division and optimization process is performed on the target computing device. Specifically, the last network layer in the target computing device can be divided into the subsequent computing devices of the target computing device, and by analogy, the network layer division optimization process can be completed to obtain an optimized equilibrium solution; or the first network layer in the target computing device can be divided A network layer is divided into a pre-order computing device of the target computing device, and the optimization process of network layer division is completed to obtain an optimal equilibrium solution.
  • the network layer may be optimized twice, and an optimized equalization scheme may be determined from the optimization, specifically:
  • S10521 Reduce the network layer corresponding to the target computing device by one layer, and adjust the network layers corresponding to other computing devices to obtain a first optimization solution.
  • the network layer corresponding to the target computing device is reduced by one layer, and the network layers corresponding to other computing devices are adjusted at the same time to obtain the first optimization solution.
  • the reduced network layer can be the last network layer or the first network layer.
  • S10522 Reduce the network layer corresponding to the target computing device by two layers, and adjust the network layers corresponding to other computing devices to obtain a second optimization solution.
  • the network layer corresponding to the target computing device may be reduced by two layers to obtain the second optimization solution.
  • the two network layers may be the last network layer and the first network layer, or the last network layer and the penultimate network layer, or the first network layer and the second network layer.
  • S10523 Count the first time performance parameter corresponding to the first optimization solution and the second time performance parameter corresponding to the second optimization solution.
  • the first time performance parameters corresponding to the first optimization scheme and the second time performance parameters corresponding to the second optimization scheme are counted respectively.
  • the statistical process of the first time performance parameter and the second time performance parameter please refer to the above process, which is not repeated in this embodiment.
  • S10524 Determine a candidate optimization solution in the first optimization solution and the second optimization solution according to the first time performance parameter and the second time performance parameter.
  • the first optimization solution and the second optimization solution are evaluated according to the first time performance parameter and the second time performance parameter, and the solution with better time performance is selected as the candidate optimization solution.
  • S10525 According to the candidate time performance parameter and the time performance parameter corresponding to the candidate optimization solution, determine the optimal equalization solution in the candidate optimization solution and the intermediate equalization solution.
  • S1053 Determine the optimized equalization scheme as an intermediate equalization scheme, and update the optimization times.
  • the optimized equalization scheme is determined as an intermediate equalization scheme, and the optimization times are updated. Since the optimization process cannot be performed indefinitely, the number of optimization processes that the intermediate equalization scheme has undergone is recorded by the number of optimization processes.
  • the intermediate balancing scheme is determined as the final balancing scheme, and the load balancing processing for the target model is completed.
  • FIG. 2 is a flowchart of a specific method for adjusting an intermediate equalization scheme provided by an embodiment of the present invention.
  • the maximum number of iterations MAX_ITR is the preset optimization number threshold
  • the optimal splitting strategy for initialization is split_index_before
  • the corresponding time performance parameter is t_before.
  • i is the number of optimizations.
  • the first optimization scheme split_index1 is obtained, and the network layer processed by max_index is reduced by two layers, and the second optimization scheme split_index2 is obtained.
  • Count the corresponding first time performance parameters and second time performance parameters and use the split strategy comparison module to compare the two split results to obtain the split result split_index with better time performance, that is, the candidate optimization scheme, its corresponding time
  • the performance parameter is t_now.
  • the candidate optimization scheme and the intermediate equalization scheme are drawn, and the optimized equalization scheme split_index_last is obtained, and its corresponding time performance parameter is t_last.
  • the target model may also be split and trained. Specifically, it can also include:
  • Step 11 Split the target model according to the final equalization scheme to obtain multiple network layer groups.
  • the target network can be split according to the network layer to obtain the network layer group corresponding to each computing device.
  • Step 12 Send each network layer group to the corresponding computing device for training.
  • model parallel training task load balancing apparatus provided by the embodiments of the present invention.
  • the model parallel training task load balancing apparatus described below and the model parallel training task load balancing method described above may refer to each other correspondingly.
  • FIG. 3 is a schematic structural diagram of a model parallel training task load balancing device according to an embodiment of the present invention, including:
  • the acquisition module 310 is used to acquire the data traffic and theoretical calculation amount of each network layer in the target model
  • the initial calculation amount determination module 320 is used to determine the theoretical computing power of each computing device, and obtain the initial calculation amount corresponding to each computing device according to the theoretical computing power and the theoretical calculation amount;
  • the initial scheme obtaining module 330 is configured to perform load balancing operations by adopting a variety of equipment critical layer location division rules according to the initial calculation amount to obtain a plurality of initial balancing schemes;
  • An intermediate scheme determination module 340 configured to count time performance parameters corresponding to the initial equalization scheme, and determine an intermediate equalization scheme in the initial equalization scheme according to the time performance parameters;
  • the final scheme obtaining module 350 is configured to adjust the intermediate equalization scheme according to the data traffic to obtain the final equalization scheme.
  • the initial solution obtaining module 330 includes:
  • the device critical layer detection unit is used to divide the network layer for each computing device according to the network layer sequence and according to the initial calculation amount, and perform device critical layer detection;
  • the first equalization scheme determination unit is used to divide the equipment critical layer to the pre-order computing device when the device critical layer is detected, so as to obtain the first equalization scheme; wherein, the pre-order computing device is the pre-order network layer corresponding to the device critical layer the computing device to which it belongs;
  • the second equalization scheme determination unit is configured to divide the equipment critical layer into subsequent computing devices when the device critical layer is detected to obtain a second equalization scheme; wherein, the subsequent computing device is the subsequent network layer corresponding to the device critical layer the computing device to which it belongs;
  • An initial equalization scheme determination unit configured to determine the first equalization scheme and the second equalization scheme as the initial equalization scheme.
  • the intermediate solution determination module 340 includes:
  • the time performance parameter acquisition unit is used to count the calculation duration corresponding to each computing device in the initial equalization scheme, and use the calculation duration to calculate the time average and time standard deviation corresponding to the initial equalization scheme to obtain the time performance parameter;
  • a judging unit for judging whether the time average value is less than the first threshold and whether the time standard deviation is less than the second threshold
  • a candidate equalization solution determination unit configured to determine the initial equalization solution as a candidate equalization solution if the time average value is less than the first threshold and the time standard deviation is less than the second threshold;
  • a first determining unit configured to determine that the candidate equalization solution is an intermediate equalization solution when the number of candidate equalization solutions is one
  • the second determination unit is configured to select an intermediate equalization scheme among the candidate equalization schemes according to a preset selection rule when the number of candidate equalization schemes is not one.
  • the final solution obtaining module 350 includes:
  • the total duration calculation unit is used to obtain the communication duration of each computing device according to the data traffic and the network layer communication speed of each computing device, and obtain the total duration corresponding to the computing device by using the communication duration and the computing duration;
  • the optimization processing unit is used to determine the maximum total duration, and perform network layer division and optimization processing on the target computing device corresponding to the maximum total duration in the intermediate equalization scheme, so as to obtain the optimized equalization scheme;
  • the optimization times updating unit is used to determine the optimized equalization scheme as an intermediate equalization scheme, and update the optimization times;
  • the final balancing scheme determining unit is configured to determine the intermediate balancing scheme as the final balancing scheme when the number of optimizations reaches a preset threshold of the number of optimizations.
  • optimize the processing unit including:
  • the first processing subunit is used to reduce the network layer corresponding to the target computing device by one layer, and adjust the network layer corresponding to other computing devices to obtain the first optimization scheme;
  • the second processing subunit is used to reduce the network layer corresponding to the target computing device by two layers, and adjust the network layers corresponding to other computing devices to obtain the second optimization scheme;
  • a time performance parameter statistics subunit configured to count the first time performance parameters corresponding to the first optimization solution and the second time performance parameters corresponding to the second optimization solution;
  • a candidate optimization solution determination subunit configured to determine a candidate optimization solution in the first optimization solution and the second optimization solution according to the first time performance parameter and the second time performance parameter;
  • the optimal equalization scheme determination subunit is used for determining the optimal equalization scheme in the candidate optimization scheme and the intermediate equalization scheme according to the candidate time performance parameters and time performance parameters corresponding to the candidate optimization scheme.
  • the obtaining module 310 includes:
  • an information acquisition unit used for acquiring parameter information and input and output information corresponding to each network layer
  • the calculation unit is used to calculate the theoretical calculation amount by using the parameter information, and use the input and output information to calculate the data communication amount.
  • the splitting module is used to split the target model according to the final equilibrium scheme to obtain multiple network layer groups
  • the sending module is used for sending each network layer group to the corresponding computing device for training.
  • model parallel training task load balancing device provided by the embodiments of the present invention.
  • the model parallel training task load balancing device described below and the model parallel training task load balancing method described above may refer to each other correspondingly.
  • the model parallel training task load balancing device 400 may include a processor 401 and a memory 402 , and may further include one or more of a multimedia component 403 , an information input/information output (I/O) interface 404 and a communication component 405 .
  • a multimedia component 403 may be included in the model parallel training task load balancing device 400 .
  • I/O information input/information output
  • the processor 401 is used to control the overall operation of the model parallel training task load balancing device 400 to complete all or part of the steps in the above-mentioned model parallel training task load balancing method;
  • the memory 402 is used to store various types of data to support In the operation of the model-parallel training task load balancing device 400, these data may include, for example, instructions for any application or method operating on the model-parallel training task load balancing device 400, as well as application-related data.
  • the memory 402 may be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory) Erasable Programmable Read-Only Memory, EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (Read- One or more of Only Memory, ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM Static Random Access Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • EPROM Erasable Programmable Read-Only Memory
  • PROM Programmable Read-Only Memory
  • Read- One or more of Only Memory ROM
  • magnetic memory flash memory
  • flash memory magnetic disk or optical disk.
  • Multimedia components 403 may include screen and audio components.
  • the screen can be, for example, a touch screen, and the audio component is used for outputting and/or inputting audio signals.
  • the audio component may include a microphone for receiving external audio signals.
  • the received audio signal may be further stored in memory 402 or transmitted through communication component 405 .
  • the audio assembly also includes at least one speaker for outputting audio signals.
  • the I/O interface 404 provides an interface between the processor 401 and other interface modules, and the above-mentioned other interface modules may be a keyboard, a mouse, a button, and the like. These buttons can be virtual buttons or physical buttons.
  • the communication component 405 is used for wired or wireless communication between the model parallel training task load balancing device 400 and other devices. Wireless communication, such as Wi-Fi, Bluetooth, Near Field Communication (NFC for short), 2G, 3G or 4G, or one or a combination of them, so the corresponding communication component 405 may include: Wi-Fi parts, Bluetooth parts, NFC parts.
  • the model parallel training task load balancing device 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices , referred to as DSPD), Programmable Logic Device (Programmable Logic Device, referred to as PLD), Field Programmable Gate Array (Field Programmable Gate Array, referred to as FPGA), controller, microcontroller, microprocessor or other electronic components to achieve, with For executing the model parallel training task load balancing method given in the above embodiment.
  • ASICs Application Specific Integrated Circuits
  • DSPs Digital Signal Processors
  • DSPD Digital Signal Processing Devices
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • controller microcontroller
  • microprocessor or other electronic components to achieve, with For executing the model parallel training task load balancing method given in the above embodiment.
  • the present invention also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above-mentioned model parallel training task load balancing method are implemented.
  • the computer-readable storage medium may include: a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc. that can store program codes medium.
  • a software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
  • RAM random access memory
  • ROM read only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
  • model parallel training task load balancing method The model parallel training task load balancing method, the model parallel training task load balancing device, the model parallel training task load balancing device and the computer-readable storage medium provided by the present invention have been described in detail above.
  • the principles and implementations have been described, and the descriptions of the above embodiments are only used to help understand the method of the present invention and its core idea; There will be changes in the above.
  • the content of this specification should not be construed as a limitation to the present invention.

Abstract

一种模型并行训练任务负载均衡方法、装置、设备及计算机可读存储介质,包括:获取目标模型中各个网络层的数据通信量和理论计算量;确定各个计算设备的理论算力,并根据理论算力和理论计算量得到各个计算设备对应的初始计算量;根据初始计算量,采用多种设备临界层位置划分规则进行负载均衡操作,得到多个初始均衡方案;统计初始均衡方案对应的时间性能参数,并根据时间性能参数在初始均衡方案中确定中间均衡方案;根据数据通信量对中间均衡方案进行调整,得到最终均衡方案;该方法通过理论算力得到初始均衡方案,选择中间方案并进行调整,可以使各个计算设备的负载均衡,提高效率。

Description

模型并行训练任务负载均衡方法、装置、设备及存储介质
本申请要求于2020年06月28日提交中国专利局、申请号为202010597645.3、发明名称为“模型并行训练任务负载均衡方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及并行训练技术领域,特别涉及一种模型并行训练任务负载均衡方法、模型并行训练任务负载均衡装置、模型并行训练任务负载均衡设备及计算机可读存储介质。
背景技术
近年来,随着人工智能的兴起,深度神经网络(Deep Neural Network,DNN)在图像视频分类、语音识别和语言翻译等领域得到广泛应用。随着训练数据集的增大和网络规模的日趋复杂,深度神经网络的训练成本越来越高,对计算平台提出了更高的算力需求,模型训练并行化成为增强其应用时效性的迫切需求。近年来基于分布式训练的AI(ARTIFICIAL INTELLIGENCE,人工智能)加速器(如FPGA、TPU、AI芯片等)层出不穷,为深度神经网络并行训练提供了硬件基础。
当DNN模型规模较大无法将整个模型部署在单个计算设备上时,会采用模型并行训练的方式进行训练。模型并行训练涉及在设备之间划分模型,以便每个计算设备仅评估模型参数的一部分并执行更新。相关技术一般由工作人员根据经验,采用手动划分的方式对DNN模型进行划分并训练。但是,由于手动划分无法做到较好的负载均衡,各个计算设备上需要被计算的计算量差异较大,总体训练效率较低,因此相关技术存在负载不均衡、总体训练效率较低的问题。
因此,如何解决相关技术存在的负载不均衡、总体训练效率较低的问题,是本领域技术人员需要解决的技术问题。
发明内容
有鉴于此,本发明的目的在于提供一种模型并行训练任务负载均衡方法、模型并行训练任务负载均衡装置、模型并行训练任务负载均衡设备及计算机可读存储介质,解决了相关技术存在的负载不均衡、总体训练效率较低的问题。
为解决上述技术问题,本发明提供了一种模型并行训练任务负载均衡方法,包括:
获取目标模型中各个网络层的数据通信量和理论计算量;
确定各个计算设备的理论算力,并根据所述理论算力和所述理论计算量得到各个所述计算设备对应的初始计算量;
根据所述初始计算量,采用多种设备临界层位置划分规则进行负载均衡操作,得到多个初始均衡方案;
统计所述初始均衡方案对应的时间性能参数,并根据所述时间性能参数在所述初始均衡方案中确定中间均衡方案;
根据所述数据通信量对所述中间均衡方案进行调整,得到最终均衡方案。
可选地,所述根据所述初始计算量,采用多种设备临界层位置划分规则进行负载均衡操作,得到多个初始均衡方案,包括:
按照网络层顺序,根据所述初始计算量为各个所述计算设备划分所述网络层,并进行设备临界层检测;
当检测到所述设备临界层时,将所述设备临界层划分给前序计算设备,得到第一均衡方案;其中,所述前序计算设备为所述设备临界层对应的前序网络层所属的所述计算设备;
当检测到所述设备临界层时,将所述设备临界层划分给后序计算设备,得到第二均衡方案;其中,所述后序计算设备为所述设备临界层对应的后序网络层所属的所述计算设备;
将所述第一均衡方案和所述第二均衡方案确定为所述初始均衡方案。
可选地,所述统计所述初始均衡方案对应的时间性能参数,并根据所述时间性能参数在所述初始均衡方案中确定中间均衡方案,包括:
统计所述初始均衡方案中各个所述计算设备对应的计算时长,并利用所述计算时长计算所述初始均衡方案对应的时间平均值和时间标准差,得到所述时间性能参数;
判断所述时间平均值是否小于第一阈值且所述时间标准差是否小于第二阈值;
若是,则将所述初始均衡方案确定为候选均衡方案;
当所述候选均衡方案的数量为一时,确定所述候选均衡方案为所述中间均衡方案;
当所述候选均衡方案的数量不为一时,按照预设选择规则在所述候选均衡方案中选择所述中间均衡方案。
可选地,所述根据所述数据通信量对所述中间均衡方案进行调整,得到最终均衡方案,包括:
根据所述数据通信量和各个所述计算设备的网络层通信速度,得到各个所述计算设备的通信时长,并利用所述通信时长和计算时长得到所述计算设备对应的总时长;
确定最大总时长,并将所述中间均衡方案中所述最大总时长对应的目标计算设备进行网络层划分优化处理,得到优化均衡方案;
将所述优化均衡方案确定为所述中间均衡方案,并更新优化次数;
当所述优化次数达到预设优化次数阈值时,将所述中间均衡方案确定为所述最终均衡方案。
可选地,所述将所述中间均衡方案中所述最大总时长对应的目标计算设备进行网络层划分优化处理,得到优化均衡方案,包括:
将所述目标计算设备对应的所述网络层减少一层,并调整其他计算设备对应的所述网络层,得到第一优化方案;
将所述目标计算设备对应的所述网络层减少两层,并调整其他计算设备对应的所述网络层,得到第二优化方案;
统计所述第一优化方案对应的第一时间性能参数和所述第二优化方案对应的第二时间性能参数;
根据所述第一时间性能参数和所述第二时间性能参数,在所述第一优 化方案和所述第二优化方案中确定候选优化方案;
根据所述候选优化方案对应的候选时间性能参数和所述时间性能参数,在所述候选优化方案和所述中间均衡方案中确定所述优化均衡方案。
可选地,所述获取目标模型中各个网络层的数据通信量和理论计算量,包括:
获取各个所述网络层对应的参数信息和输入输出信息;
利用所述参数信息计算所述理论计算量,并利用所述输入输出信息计算所述数据通信量。
可选地,在得到最终均衡方案之后,还包括:
按照所述最终均衡方案对所述目标模型进行拆分,得到多个网络层组;
将各个所述网络层组发送给对应的所述计算设备进行训练。
本发明还提供了一种模型并行训练任务负载均衡装置,包括:
获取模块,用于获取目标模型中各个网络层的数据通信量和理论计算量;
初始计算量确定模块,用于确定各个计算设备的理论算力,并根据所述理论算力和所述理论计算量得到各个所述计算设备对应的初始计算量;
初始方案获取模块,用于根据所述初始计算量,采用多种设备临界层位置划分规则进行负载均衡操作,得到多个初始均衡方案;
中间方案确定模块,用于统计所述初始均衡方案对应的时间性能参数,并根据所述时间性能参数在所述初始均衡方案中确定中间均衡方案;
最终方案获取模块,用于根据所述数据通信量对所述中间均衡方案进行调整,得到最终均衡方案。
本发明还提供了一种模型并行训练任务负载均衡设备,包括存储器和处理器,其中:
所述存储器,用于保存计算机程序;
所述处理器,用于执行所述计算机程序,以实现上述的模型并行训练任务负载均衡方法。
本发明还提供了一种计算机可读存储介质,用于保存计算机程序,其中,所述计算机程序被处理器执行时实现上述的模型并行训练任务负载均 衡方法。
本发明提供的模型并行训练任务负载均衡方法,获取目标模型中各个网络层的数据通信量和理论计算量;确定各个计算设备的理论算力,并根据理论算力和理论计算量得到各个计算设备对应的初始计算量;根据初始计算量,采用多种设备临界层位置划分规则进行负载均衡操作,得到多个初始均衡方案;统计初始均衡方案对应的时间性能参数,并根据时间性能参数在初始均衡方案中确定中间均衡方案;根据数据通信量对中间均衡方案进行调整,得到最终均衡方案。
可见,该方法通过计算设备的理论算力和目标模型的理论计算量得到各个计算设备对应的初始计算量,依据该初始计算量对目标模型中的网络层采用多种设备临界层位置划分规则进行负载均衡操作,得到多种不同的均衡方案,即初始均衡方案。通过统计时间性能参数,确定多个技术方案的时间性能,并选择性能更好的初始均衡方案作为中间均衡方案。最后考虑计算设备间数据通信的影响,对中间均衡方案进行调整,得到最终均衡方案。通过理论算力得到初始计算量并根据其得到初始均衡方案,可以从各个计算设备需要计算的数据量方面达到较好的均衡;利用时间性能参数可以表示初始均衡方案中所有计算设备的整体计算效率;最终考虑数据通信过程对各个计算设备的影响并基于此对中间均衡方案进行调整,得到最终均衡方案,可以做到各个计算设备之间的负载均衡,保证整体计算效率即训练效率,解决了相关技术存在的负载不均衡、训练效率较低的问题。
此外,本发明还提供了一种模型并行训练任务负载均衡装置、模型并行训练任务负载均衡设备及计算机可读存储介质,同样具有上述有益效果。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本发明实施例提供的一种模型并行训练任务负载均衡方法流程图;
图2为本发明实施例提供的一种具体的中间均衡方案调整方法流程图;
图3为本发明实施例提供的一种模型并行训练任务负载均衡装置的结构示意图;
图4为本发明实施例提供的一种模型并行训练任务负载均衡设备的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在一种可能的实施方式中,请参考图1,图1为本发明实施例提供的一种模型并行训练任务负载均衡方法流程图。该方法包括:
S101:获取目标模型中各个网络层的数据通信量和理论计算量。
目标模型为需要在多个计算设备上进行并行训练的网络模型,其具体可以为深度学习模型或其他具有网络层的网络模型。目标模型具体可以为图像分类模型、语音识别模型、语言翻译模型等。计算设备的具体类型不做限定,例如可以异构加速设备,即基于多种以上的不同架构构建的加速设备,架构具体可以为FPGA(Field-Programmable Gate Array,现场可编程门阵列)架构、TPU(Tensor Processing Unit)架构或GPU(Graphics Processing Unit,图形处理器)架构。
在目标模型中具有多个网络层,在确定目标模型的均衡方案时,需要确定各个网络层的数据通信量和理论计算量。由于每一个网络层均具有数据输入和数据输出,因此数据通信量具体可以为网络层的数据输入量或数据输出量,可以根据实际情况进行选择。理论计算量即为该网络层完成训练所需的理论上的计算资源总量。数据通信量和理论计算量为影响目标模 型训练所需时间的最大影响因素,数据通信量和理论计算量越大,对应的网络层训练所需的时间就越多。
S102:确定各个计算设备的理论算力,并根据理论算力和理论计算量得到各个计算设备对应的初始计算量。
本实施例中,计算设备为多个,各个计算设备均具有对应的理论算力。理论算力可以表示计算设备的计算速度,其具体大小与计算设备的本身相关,对此不做限定。根据理论算力可以确定各个计算设备的计算能力,因此可以根据理论算力以及各个网络层对应的理论计算量,将整个目标网络对应的负载均衡地分配给各个计算设备,得到对应的初始计算量。具体的,理论算力越强,对应的初始计算量越多,理论算力越弱,对应的初始计算量越少,因此在进行理想情况下的训练时,各个计算设备可以采用相同的时间完成各自对应的初始计算量的计算,避免有的计算设备完成计算而其他计算设备未完成因此需要等待的情况发生,保证了基本的整体计算效率。
具体的,在得到各个计算设备的理论算力后,可以构建性能模型并求解,得到初始计算量,性能模型的构建和求解过程本实施例不做限定,可以参考相关技术;或者可以将各个理论算力进行单位的归一化处理,利用同一表示形式对各个理论算力进行表示,同时利用各个网络层的理论计算量计算得到目标网络的训练负载,按照各个理论算力的比例将训练负载分配给对应的计算设备,即可确定各个计算设备对应的初始计算量。
S103:根据初始计算量,采用多种设备临界层位置划分规则进行负载均衡操作,得到多个初始均衡方案。
由于网络层为目标网络的最小划分单位,而网络层的理论计算量具有一定的下限,因此在实际划分网络层时可能并不能完美的为各个计算设备划分对应的初始计算量的负载。因此基于初始计算量,采用多种不同的设备临界层位置划分规则对网络层进行划分,即进行负载均衡操作,得到多个初始均衡方案。
需要说明的是,设备临界层为一种特殊的网络层,其对应的第一个设备已划分的负载不足对应的初始计算量,而将设备临界层划分给第一个设备时,第一个设备的负载又会超出对应的初始计算量,因此可以将其划分 给第二个设备,也可以划分给第一个设备。所以设备临界层在被划分时处于两个计算设备之间,即可以被划分给这两个计算设备的任意一个。故此,设备临界层具有两个可选的位置,即第一个设备或第二个设备,基于这两个可能的位置,可以设定对应的多个设备临界层位置划分规则,例如将所有的设备临界层划分给第一个设备的规则,或者将所有的设备临界层划分给第二个设备的规则,或者将部分设备临界层划分为第一个设备,将另外的设备临界层划分给第二个设备。根据这多种设备临界层位置划分规则,可以得到多个初始划分方案。初始划分方案尽可能地保证了各个计算设备之间计算负载的均衡。
S104:统计初始均衡方案对应的时间性能参数,并根据时间性能参数在初始均衡方案中确定中间均衡方案。
由于各个计算设备实际被分配的计算负载并不一定等于初始计算量,因此基于初始均衡方案进行训练时可能会导致某些计算设备已经计算完毕,而其他计算设备未计算完毕的情况,这种情况严重时会大大影响所有计算设备的整体计算效率,即影响到了整体的训练效率。
因此,在得到初始均衡方案后,统计多个初始均衡方案对应的时间性能参数,时间性能参数用于表示初始均衡方案的时间性能,其具体可以为各个计算设备的计算时间平均值、时间标准差、时间方差或其他类似参数,时间性能参数的数量可以为一个或多个。在统计得到多个初始训练方案的时间性能参数后,根据时间性能参数的优劣在初始均衡方案中选择时间性能参数最佳的方案作为中间均衡方案。时间性能参数的评价方法本实施例不做限定,可以根据时间性能参数的数量和类型进行设定。
S105:根据数据通信量对中间均衡方案进行调整,得到最终均衡方案。
在确定中间均衡方案后,根据数据通信量和时间性能参数对中间均衡方案进行调整。由于进行数据通信同样需要一定的时间,各个网络层的数据通信量不同对应的所需时间即不同。因此需要将数据通信的影响进行考虑,并利用其对中间均衡方案进行修改,在修改后利用时间性能参数进行评估,实现对中间均衡方案的调整,即可得到最终均衡方案。最终均衡方案综合了数据传输与计算两方面的影响,实现了负载的均衡分配。
应用本发明实施例提供的模型并行训练任务负载均衡方法,通过计算设备的理论算力和目标模型的理论计算量得到各个计算设备对应的初始计算量,依据该初始计算量对目标模型中的网络层采用多种设备临界层位置划分规则进行负载均衡操作,得到多种不同的均衡方案,即初始均衡方案。通过统计时间性能参数,确定多个技术方案的时间性能,并选择性能更好的初始均衡方案作为中间均衡方案。最后考虑计算设备间数据通信的影响,对中间均衡方案进行调整,得到最终均衡方案。通过理论算力得到初始计算量并根据其得到初始均衡方案,可以从各个计算设备需要计算的数据量方面达到较好的均衡;利用时间性能参数可以表示初始均衡方案中所有计算设备的整体计算效率;最终考虑数据通信过程对各个计算设备的影响并基于此对中间均衡方案进行调整,得到最终均衡方案,可以做到各个计算设备之间的负载均衡,保证整体计算效率即训练效率,解决了相关技术存在的负载不均衡、训练效率较低的问题。
基于上述实施例,本实施例将对上述实施例中的若干步骤进行具体的阐述。其中,为了准确的得到各个网络层的数据通信量和理论计算量,可以通过网络层的参数信息和输入输出信息对其进行计算。S101步骤,可以包括:
S1011:获取各个网络层对应的参数信息和输入输出信息。
具体的,可以构建前向计算网络,利用前向计算网络获取参数信息和输入输出信息。参数信息用于表示网络层进行怎样的计算,也可以被称为算子参数信息,其具体内容本实施例不做限定,例如可以为卷积核大小信息,或者还可以包括输入输出信息。输入输出信息具体可以为输入信息或输出信息,其被提前设定好,与数据通信量的方向相同,即当数据通信量为数据输入量时,输入输出信息即为输入信息,当数据通信量为数据输出量时,输入输出信息即为输出信息。输入输出信息用于表示网络层的输入情况或输出情况,其具体内容可以为特征数量、特征大小等,对此不做限定。
S1012:利用参数信息计算理论计算量,并利用输入输出信息计算数据 通信量。
根据参数信息可以计算得到理论计算量,由于参数信息表明了网络层需要进行怎样的计算,因此可以准确地确定其对应的理论计算量。同理,利用输入输出信息可以准确地计算其对应的数据通信量。
基于上述实施例,在确定初始均衡方案时,为了提高初始均衡方案的确定速度,减少所需的计算资源,进而减少得到最终均衡方案所需的计算资源,可以采用两种设备临界层位置划分规则进行负载均衡操作。具体的,S103步骤,可以包括:
S1031:按照网络层顺序,根据初始计算量为各个计算设备划分网络层,并进行设备临界层检测。
由于目标网络的训练需要按照网络层的顺序进行,因此在划分网络层时需要按照网络层顺序进行。根据各个计算设备对应的初始计算量,为对应的计算设备划分网络层,并在网络层划分时进行设备临界层检测。具体的,可以通过检测如下情况来进行设备临界层检测:
判断目标计算设备的当前负载是否小于对应初始计算量;若小于对应的初始计算量,则将目标网络层划入目标计算设备,同时判断目标计算设备的当前负载是否大于对应的初始计算量;若大于,则确定目标网络层为设备临界层,即检测到设备临界层。例如,计算设备1在先,计算设备2在后,计算设备1的初始计算量为1000,当前负载为990,当将目标网络层划入计算设备1时,计算设备1的当前负载为1010,则目标网络层即为设备临界层。
S1032:当检测到设备临界层时,将设备临界层划分给前序计算设备,得到第一均衡方案。
当检测到设备临界层时,可以将设备临界层划分给前序计算设备,即将所有的设备临界层均划分给前序计算设备,即可得到第一均衡方案。需要说明的是,前序计算设备即为设备临界层对应的前序网络层所属的计算设备,前序网络层为网络层顺序在设备临界层之前的网络层,因此前序计算设备即为S1031步骤的例子中的计算设备1,基于上述例子,则计算设备1最终的计算负载为1010。
S1033:当检测到设备临界层时,将设备临界层划分给后序计算设备,得到第二均衡方案。
在检测到设备临界层后,还可以将该设备临界层划分给后续计算设备,即将所有的设备临界层均划分给后序计算设备,即可得到第二均衡方案。需要说明的是,后续计算设备即为设备临界层对应的后续网络层所属的计算设备,后序网络层为网络层顺序在设备临界层之后的网络层,因此后续计算设备即为S1031步骤的例子中的计算设备2,基于上述例子,则计算设备1最终的计算负载为990。
需要说明的是,本实施例并不限定S1032和S1033两个步骤的执行顺序,例如,可以先执行S1032,再执行S1033;或者可以先执行S1033,再执行S1032;或者可以同时执行S1032和S1033。
S1034:将第一均衡方案和第二均衡方案确定为初始均衡方案。
在得到第一均衡方案和第二均衡方案后,将其确定为初始均衡方案,以便后续确定中间均衡方案。
基于上述实施例,为了保证选择到最佳的中间均衡方案,可以采用多个时间性能参数和阈值对初始均衡方案进行评估,并最终得到中间均衡方案。具体的,S104步骤,可以包括:
S1041:统计初始均衡方案中各个计算设备对应的计算时长,并利用计算时长计算初始均衡方案对应的时间平均值和时间标准差,得到时间性能参数。
本实施例中,利用时间平均值和时间标准差两个参数作为时间性能参数,具体的,在得到初始均衡方案后,根据计算设备的理论算力统计各个初始均衡方案中各个计算设备对应的计算时长,并利用计算时长计算各个初始均衡方案对应的时间平均值和时间标准差。时间平均值为各个计算设备所需的平均计算时长,可以体现整体计算能力,而计算标准差可以表示各个计算设备之间的计算时长的差别大小,差别越大,整体的计算效率就越低。
S1042:判断时间平均值是否小于第一阈值且时间标准差是否小于第二阈值。
第一阈值用于与时间平均值进行比对,第二阈值用于与时间标准差进行比对,第一阈值和第二阈值的具体大小本实施例不做限定,可以根据实际情况进行设定。
S1043:若是,则将初始均衡方案确定为候选均衡方案。
当某一个初始均衡方案的时间平均值小于第一阈值且时间标准差小于第二阈值时,说明该初始均衡方案的时间性能较好,因此将其确定为候选均衡方案。持续上述步骤直至对所有的初始均衡方案进行判断。
S1044:当候选均衡方案的数量为一时,确定候选均衡方案为中间均衡方案。
在所有的初始均衡方案均经过判断后,可以统计候选均衡方案的数量,当候选均衡方案仅有一个时,可以直接将其确定为中间均衡方案。
S1045:当候选均衡方案的数量不为一时,按照预设选择规则在候选均衡方案中选择中间均衡方案。
当候选均衡方案的数量不为一时,此时存在两种情况,解候选均衡方案的数量大于1,或者候选均衡方案的数量为0.
当候选均衡方案的数量大于1时,例如为两个时,可以在其按照预设选择规则在其中选择中间均衡方案。当候选均衡方案的数量为0时,可以将所有的初始均衡方案确定为候选均衡方案,并在其中按照预设选择规则确定中间均衡方案。
基于上述实施例,为了均衡各个计算设备所需的总体时长,可以通过优化具有最大总体时长的计算设备的网络层划分来减少其总体时长,提高总体计算效率。具体的,S105步骤可以包括:
S1051:根据数据通信量和各个计算设备的网络层通信速度,得到各个计算设备的通信时长,并利用通信时长和计算时长得到计算设备对应的总时长。
网络层通信速度即为在该计算设备中,网络层之间的数据传输速度,根据数据通信量和网络层通信速度,可以得到各个计算设备对应的通信时长。计算时长为计算设备计算所有负载所需的时长,通过将通信时长和计算时长相加即可得到计算设备处理负载所需的总时长。
S1052:确定最大总时长,并将中间均衡方案中最大总时长对应的目标计算设备进行网络层划分优化处理,得到优化均衡方案。
最大总时长即为所有总时长中的最大值,通过确定最大总时长确定目标计算设备,并对目标计算设备进行网络层划分优化处理。具体的,可以将目标计算设备中的最后一个网络层划分给目标计算设备的后序计算设备,并以此类推,完成网络层划分优化处理,得到优化均衡方案;或者将目标计算设备中的第一个网络层划分给目标计算设备的前序计算设备,完成网络层划分优化处理,得到优化均衡方案。
为了保证优化处理的有效性,即保证优化效果,本实施例中,可以对网络层进行两次优化并从中确定优化均衡方案,具体的:
S10521:将目标计算设备对应的网络层减少一层,并调整其他计算设备对应的网络层,得到第一优化方案。
第一次优化处理过程中,将目标计算设备对应的网络层减少一层,同时调整其他计算设备对应的网络层,即可得到第一优化方案。减少的网络层可以为最后一个网络层,也可以为第一个网络层。
S10522:将目标计算设备对应的网络层减少两层,并调整其他计算设备对应的网络层,得到第二优化方案。
在第二次优化处理过程中,可以将目标计算设备对应的网络层减少两层,得到第二优化方案。两个网络层可以为最后一个网络层和第一个网络层,或者可以为最后一个网络层和倒数第二个网络层,或者可以为第一个网络层和第二个网络层。
S10523:统计第一优化方案对应的第一时间性能参数和第二优化方案对应的第二时间性能参数。
在得到第一优化方案和第二优化方案后,分别统计第一优化方案对应的第一时间性能参数和第二优化方案对应的第二时间性能参数。第一时间性能参数和第二时间性能参数的统计过程请参考上述过程,本实施例在此不做赘述。
S10524:根据第一时间性能参数和第二时间性能参数,在第一优化方案和第二优化方案中确定候选优化方案。
根据第一时间性能参数和第二时间性能参数对第一优化方案和第二优化方案进行评估,选择时间性能更佳的方案作为候选优化方案。
S10525:根据候选优化方案对应的候选时间性能参数和时间性能参数,在候选优化方案和中间均衡方案中确定优化均衡方案。
在得到候选优化方案后,在候选优化方案和中间均衡方案中选择时间性能更优的方案作为优化均衡方案。
S1053:将优化均衡方案确定为中间均衡方案,并更新优化次数。
在本次优化处理结束后,将优化均衡方案确定为中间均衡方案,并更新优化次数。由于优化处理不可能无限制的进行,因此利用优化次数记录中间均衡方案已经经过的优化处理次数。
S1054:当优化次数达到预设优化次数阈值时,将中间均衡方案确定为最终均衡方案。
当优化次数达到预设优化次数阈值后,则将中间均衡方案确定为最终均衡方案,完成对目标模型的负载均衡处理。
请参考图2,图2为本发明实施例提供的一种具体的中间均衡方案调整方法流程图。其中,最大迭代次数MAX_ITR即为预设优化次数阈值,初始化的较优拆分策略即为split_index_before,对应的时间性能参数为t_before。i为优化次数,当i<MAX_ITR成立时,查找执行时间最长的设备,即目标计算设备,记录设备下表max_index,即目标计算设备即为max_index。通过减少max_index处理的网络层一层,得到第一优化方案split_index1,减少max_index处理的网络层两层,得到第二优化方案split_index2。统计其对应的第一时间性能参数和第二时间性能参数,并利用拆分策略对比模块对比两种拆分结果,得到时间性能更优的拆分结果split_index,即候选优化方案,其对应的时间性能参数为t_now。将候选优化方案与中间均衡方案进行平局,并得到优化均衡方案split_index_last,其对应的时间性能参数为t_last。将优化均衡方案确定为中间均衡方案,即完成split_index_before=split_index_last的赋值以及t_before=t_last的赋值,并将优化次数加一,重新进行i<MAX_ITR是否成立的判断,直至当i<MAX_ITR不成立时,返回split_index_before,即将中间均衡方案确定为最终均衡方案。
进一步,基于上述实施例,在得到最终均衡方案后,还可以执行其他操作,例如还可以按照对目标模型进行拆分并训练。具体的,还可以包括:
步骤11:按照最终均衡方案对目标模型进行拆分,得到多个网络层组。
由于最终均衡方案中记录了各个计算设备对应的网络层,因此可以按照其对目标网络进行拆分,得到各个计算设备对应的网络层组。
步骤12:将各个网络层组发送给对应的计算设备进行训练。
下面对本发明实施例提供的模型并行训练任务负载均衡装置进行介绍,下文描述的模型并行训练任务负载均衡装置与上文描述的模型并行训练任务负载均衡方法可相互对应参照。
请参考图3,图3为本发明实施例提供的一种模型并行训练任务负载均衡装置的结构示意图,包括:
获取模块310,用于获取目标模型中各个网络层的数据通信量和理论计算量;
初始计算量确定模块320,用于确定各个计算设备的理论算力,并根据理论算力和理论计算量得到各个计算设备对应的初始计算量;
初始方案获取模块330,用于根据初始计算量,采用多种设备临界层位置划分规则进行负载均衡操作,得到多个初始均衡方案;
中间方案确定模块340,用于统计初始均衡方案对应的时间性能参数,并根据时间性能参数在初始均衡方案中确定中间均衡方案;
最终方案获取模块350,用于根据数据通信量对中间均衡方案进行调整,得到最终均衡方案。
可选地,初始方案获取模块330,包括:
设备临界层检测单元,用于按照网络层顺序,根据初始计算量为各个计算设备划分网络层,并进行设备临界层检测;
第一均衡方案确定单元,用于当检测到设备临界层时,将设备临界层划分给前序计算设备,得到第一均衡方案;其中,前序计算设备为设备临界层对应的前序网络层所属的计算设备;
第二均衡方案确定单元,用于当检测到设备临界层时,将设备临界层划分给后序计算设备,得到第二均衡方案;其中,后序计算设备为设备临界层对应的后序网络层所属的计算设备;
初始均衡方案确定单元,用于将第一均衡方案和第二均衡方案确定为初始均衡方案。
可选地,中间方案确定模块340,包括:
时间性能参数获取单元,用于统计初始均衡方案中各个计算设备对应的计算时长,并利用计算时长计算初始均衡方案对应的时间平均值和时间标准差,得到时间性能参数;
判断单元,用于判断时间平均值是否小于第一阈值且时间标准差是否小于第二阈值;
候选均衡方案确定单元,用于若时间平均值小于第一阈值且时间标准差小于第二阈值,则将初始均衡方案确定为候选均衡方案;
第一确定单元,用于当候选均衡方案的数量为一时,确定候选均衡方案为中间均衡方案;
第二确定单元,用于当候选均衡方案的数量不为一时,按照预设选择规则在候选均衡方案中选择中间均衡方案。
可选地,最终方案获取模块350,包括:
总时长计算单元,用于根据数据通信量和各个计算设备的网络层通信速度,得到各个计算设备的通信时长,并利用通信时长和计算时长得到计算设备对应的总时长;
优化处理单元,用于确定最大总时长,并将中间均衡方案中最大总时长对应的目标计算设备进行网络层划分优化处理,得到优化均衡方案;
优化次数更新单元,用于将优化均衡方案确定为中间均衡方案,并更新优化次数;
最终均衡方案确定单元,用于当优化次数达到预设优化次数阈值时,将中间均衡方案确定为最终均衡方案。
可选地,优化处理单元,包括:
第一处理子单元,用于将目标计算设备对应的网络层减少一层,并调 整其他计算设备对应的网络层,得到第一优化方案;
第二处理子单元,用于将目标计算设备对应的网络层减少两层,并调整其他计算设备对应的网络层,得到第二优化方案;
时间性能参数统计子单元,用于统计第一优化方案对应的第一时间性能参数和第二优化方案对应的第二时间性能参数;
候选优化方案确定子单元,用于根据第一时间性能参数和第二时间性能参数,在第一优化方案和第二优化方案中确定候选优化方案;
优化均衡方案确定子单元,用于根据候选优化方案对应的候选时间性能参数和时间性能参数,在候选优化方案和中间均衡方案中确定优化均衡方案。
可选地,获取模块310,包括:
信息获取单元,用于获取各个网络层对应的参数信息和输入输出信息;
计算单元,用于利用参数信息计算理论计算量,并利用输入输出信息计算的数据通信量。
可选地,还包括:
拆分模块,用于按照最终均衡方案对目标模型进行拆分,得到多个网络层组;
发送模块,用于将各个网络层组发送给对应的计算设备进行训练。
下面对本发明实施例提供的模型并行训练任务负载均衡设备进行介绍,下文描述的模型并行训练任务负载均衡设备与上文描述的模型并行训练任务负载均衡方法可相互对应参照。
请参考图4,图4为本发明实施例提供的一种模型并行训练任务负载均衡设备的结构示意图。其中模型并行训练任务负载均衡设备400可以包括处理器401和存储器402,还可以进一步包括多媒体组件403、信息输入/信息输出(I/O)接口404以及通信组件405中的一种或多种。
其中,处理器401用于控制模型并行训练任务负载均衡设备400的整体操作,以完成上述的模型并行训练任务负载均衡方法中的全部或部分步骤;存储器402用于存储各种类型的数据以支持在模型并行训练任务负载均衡 设备400的操作,这些数据例如可以包括用于在该模型并行训练任务负载均衡设备400上操作的任何应用程序或方法的指令,以及应用程序相关的数据。该存储器402可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,例如静态随机存取存储器(Static Random Access Memory,SRAM)、电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、只读存储器(Read-Only Memory,ROM)、磁存储器、快闪存储器、磁盘或光盘中的一种或多种。
多媒体组件403可以包括屏幕和音频组件。其中屏幕例如可以是触摸屏,音频组件用于输出和/或输入音频信号。例如,音频组件可以包括一个麦克风,麦克风用于接收外部音频信号。所接收的音频信号可以被进一步存储在存储器402或通过通信组件405发送。音频组件还包括至少一个扬声器,用于输出音频信号。I/O接口404为处理器401和其他接口模块之间提供接口,上述其他接口模块可以是键盘,鼠标,按钮等。这些按钮可以是虚拟按钮或者实体按钮。通信组件405用于模型并行训练任务负载均衡设备400与其他设备之间进行有线或无线通信。无线通信,例如Wi-Fi,蓝牙,近场通信(Near Field Communication,简称NFC),2G、3G或4G,或它们中的一种或几种的组合,因此相应的该通信组件405可以包括:Wi-Fi部件,蓝牙部件,NFC部件。
模型并行训练任务负载均衡设备400可以被一个或多个应用专用集成电路(Application Specific Integrated Circuit,简称ASIC)、数字信号处理器(Digital Signal Processor,简称DSP)、数字信号处理设备(Digital Signal Processing Device,简称DSPD)、可编程逻辑器件(Programmable Logic Device,简称PLD)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述实施例给出的模型并行训练任务负载均衡方法。
下面对本发明实施例提供的计算机可读存储介质进行介绍,下文描述 的计算机可读存储介质与上文描述的模型并行训练任务负载均衡方法可相互对应参照。
本发明还提供一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述的模型并行训练任务负载均衡方法的步骤。
该计算机可读存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
本领域技术人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应该认为超出本发明的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系属于仅仅用来将一个实体或者操作与另一个实体或者操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语包括、包含或者其他任何变体意在涵盖非排他性的包含,从而 使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。
以上对本发明所提供的模型并行训练任务负载均衡方法、模型并行训练任务负载均衡装置、模型并行训练任务负载均衡设备和计算机可读存储介质进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (10)

  1. 一种模型并行训练任务负载均衡方法,其特征在于,包括:
    获取目标模型中各个网络层的数据通信量和理论计算量;
    确定各个计算设备的理论算力,并根据所述理论算力和所述理论计算量得到各个所述计算设备对应的初始计算量;
    根据所述初始计算量,采用多种设备临界层位置划分规则进行负载均衡操作,得到多个初始均衡方案;
    统计所述初始均衡方案对应的时间性能参数,并根据所述时间性能参数在所述初始均衡方案中确定中间均衡方案;
    根据所述数据通信量对所述中间均衡方案进行调整,得到最终均衡方案。
  2. 根据权利要求1所述的模型并行训练任务负载均衡方法,其特征在于,所述根据所述初始计算量,采用多种设备临界层位置划分规则进行负载均衡操作,得到多个初始均衡方案,包括:
    按照网络层顺序,根据所述初始计算量为各个所述计算设备划分所述网络层,并进行设备临界层检测;
    当检测到所述设备临界层时,将所述设备临界层划分给前序计算设备,得到第一均衡方案;其中,所述前序计算设备为所述设备临界层对应的前序网络层所属的所述计算设备;
    当检测到所述设备临界层时,将所述设备临界层划分给后序计算设备,得到第二均衡方案;其中,所述后序计算设备为所述设备临界层对应的后序网络层所属的所述计算设备;
    将所述第一均衡方案和所述第二均衡方案确定为所述初始均衡方案。
  3. 根据权利要求1所述的模型并行训练任务负载均衡方法,其特征在于,所述统计所述初始均衡方案对应的时间性能参数,并根据所述时间性能参数在所述初始均衡方案中确定中间均衡方案,包括:
    统计所述初始均衡方案中各个所述计算设备对应的计算时长,并利用所述计算时长计算所述初始均衡方案对应的时间平均值和时间标准差,得到所述时间性能参数;
    判断所述时间平均值是否小于第一阈值且所述时间标准差是否小于第二阈值;
    若是,则将所述初始均衡方案确定为候选均衡方案;
    当所述候选均衡方案的数量为一时,确定所述候选均衡方案为所述中间均衡方案;
    当所述候选均衡方案的数量不为一时,按照预设选择规则在所述候选均衡方案中选择所述中间均衡方案。
  4. 根据权利要求1所述的模型并行训练任务负载均衡方法,其特征在于,所述根据所述数据通信量对所述中间均衡方案进行调整,得到最终均衡方案,包括:
    根据所述数据通信量和各个所述计算设备的网络层通信速度,得到各个所述计算设备的通信时长,并利用所述通信时长和计算时长得到所述计算设备对应的总时长;
    确定最大总时长,并将所述中间均衡方案中所述最大总时长对应的目标计算设备进行网络层划分优化处理,得到优化均衡方案;
    将所述优化均衡方案确定为所述中间均衡方案,并更新优化次数;
    当所述优化次数达到预设优化次数阈值时,将所述中间均衡方案确定为所述最终均衡方案。
  5. 根据权利要求4所述的模型并行训练任务负载均衡方法,其特征在于,所述将所述中间均衡方案中所述最大总时长对应的目标计算设备进行网络层划分优化处理,得到优化均衡方案,包括:
    将所述目标计算设备对应的所述网络层减少一层,并调整其他计算设备对应的所述网络层,得到第一优化方案;
    将所述目标计算设备对应的所述网络层减少两层,并调整其他计算设备对应的所述网络层,得到第二优化方案;
    统计所述第一优化方案对应的第一时间性能参数和所述第二优化方案对应的第二时间性能参数;
    根据所述第一时间性能参数和所述第二时间性能参数,在所述第一优化方案和所述第二优化方案中确定候选优化方案;
    根据所述候选优化方案对应的候选时间性能参数和所述时间性能参数,在所述候选优化方案和所述中间均衡方案中确定所述优化均衡方案。
  6. 根据权利要求1所述的模型并行训练任务负载均衡方法,其特征在于,所述获取目标模型中各个网络层的数据通信量和理论计算量,包括:
    获取各个所述网络层对应的参数信息和输入输出信息;
    利用所述参数信息计算所述理论计算量,并利用所述输入输出信息计算所述数据通信量。
  7. 根据权利要求1所述的模型并行训练任务负载均衡方法,其特征在于,在得到最终均衡方案之后,还包括:
    按照所述最终均衡方案对所述目标模型进行拆分,得到多个网络层组;
    将各个所述网络层组发送给对应的所述计算设备进行训练。
  8. 一种模型并行训练任务负载均衡装置,其特征在于,包括:
    获取模块,用于获取目标模型中各个网络层的数据通信量和理论计算量;
    初始计算量确定模块,用于确定各个计算设备的理论算力,并根据所述理论算力和所述理论计算量得到各个所述计算设备对应的初始计算量;
    初始方案获取模块,用于根据所述初始计算量,采用多种设备临界层位置划分规则进行负载均衡操作,得到多个初始均衡方案;
    中间方案确定模块,用于统计所述初始均衡方案对应的时间性能参数,并根据所述时间性能参数在所述初始均衡方案中确定中间均衡方案;
    最终方案获取模块,用于根据所述数据通信量对所述中间均衡方案进行调整,得到最终均衡方案。
  9. 一种模型并行训练任务负载均衡设备,其特征在于,包括存储器和处理器,其中:
    所述存储器,用于保存计算机程序;
    所述处理器,用于执行所述计算机程序,以实现如权利要求1至7任一项所述的模型并行训练任务负载均衡方法。
  10. 一种计算机可读存储介质,其特征在于,用于保存计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述 的模型并行训练任务负载均衡方法。
PCT/CN2021/076963 2020-06-28 2021-02-20 模型并行训练任务负载均衡方法、装置、设备及存储介质 WO2022001134A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/010,725 US11868817B2 (en) 2020-06-28 2021-02-20 Load balancing method, apparatus and device for parallel model training task, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010597645.3A CN111752713B (zh) 2020-06-28 2020-06-28 模型并行训练任务负载均衡方法、装置、设备及存储介质
CN202010597645.3 2020-06-28

Publications (1)

Publication Number Publication Date
WO2022001134A1 true WO2022001134A1 (zh) 2022-01-06

Family

ID=72677564

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/076963 WO2022001134A1 (zh) 2020-06-28 2021-02-20 模型并行训练任务负载均衡方法、装置、设备及存储介质

Country Status (3)

Country Link
US (1) US11868817B2 (zh)
CN (1) CN111752713B (zh)
WO (1) WO2022001134A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115016947A (zh) * 2022-08-05 2022-09-06 中国空气动力研究与发展中心计算空气动力研究所 负载分配方法、装置、设备及介质
CN115996173A (zh) * 2022-11-14 2023-04-21 中国科学技术大学 面向分布式深度学习算子并行训练的通信优化方法与系统
CN116050499A (zh) * 2023-04-03 2023-05-02 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种模型并行训练中的自适应模型划分方法、系统及设备
CN116167463A (zh) * 2023-04-26 2023-05-26 之江实验室 一种模型训练的方法、装置、存储介质及电子设备

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111752713B (zh) * 2020-06-28 2022-08-05 浪潮电子信息产业股份有限公司 模型并行训练任务负载均衡方法、装置、设备及存储介质
CN112650590B (zh) * 2020-12-29 2024-03-15 北京奇艺世纪科技有限公司 任务的处理方法、装置及系统、分配方法和装置
CN113052332B (zh) * 2021-04-02 2023-02-14 浙江大学 基于设备均衡原理的分布式模型并行设备分配优化方法
CN114021733B (zh) * 2021-09-30 2023-11-14 苏州浪潮智能科技有限公司 模型训练优化方法、装置、计算机设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829517A (zh) * 2018-05-31 2018-11-16 中国科学院计算技术研究所 一种用于在集群环境下进行机器学习的训练方法和系统
CN109559734A (zh) * 2018-12-18 2019-04-02 百度在线网络技术(北京)有限公司 声学模型训练的加速方法和装置
CN110018817A (zh) * 2018-01-05 2019-07-16 中兴通讯股份有限公司 数据的分布式运行方法及装置、存储介质及处理器
CN110046048A (zh) * 2019-04-18 2019-07-23 杭州电子科技大学 一种基于工作量自适应快速重分配的负载均衡方法
CN110503201A (zh) * 2019-08-29 2019-11-26 苏州浪潮智能科技有限公司 一种神经网络分布式并行训练方法与装置
CN110689115A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 神经网络模型处理方法、装置、计算机设备及存储介质
CN111210019A (zh) * 2020-01-16 2020-05-29 电子科技大学 一种基于软硬件协同加速的神经网络推断方法
CN111752713A (zh) * 2020-06-28 2020-10-09 浪潮电子信息产业股份有限公司 模型并行训练任务负载均衡方法、装置、设备及存储介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101609580B1 (ko) * 2010-02-10 2016-04-07 삼성전자주식회사 무선 통신 시스템 및 그의 사용자 단말기와 이동성 관리 엔티티 간 연결 방법
US20150081893A1 (en) * 2013-09-17 2015-03-19 Netapp. Inc. Fabric attached storage
US9985984B1 (en) * 2014-10-27 2018-05-29 National Technology & Engineering Solutions Of Sandia, Llc Dynamic defense and network randomization for computer systems
US10747631B2 (en) * 2018-01-19 2020-08-18 DinoplusAI Holdings Limited Mission-critical AI processor with record and replay support
CN110795228B (zh) * 2018-08-03 2023-08-25 伊姆西Ip控股有限责任公司 用于训练深度学习模型的方法和制品、以及计算系统
CN109214504B (zh) * 2018-08-24 2020-09-04 北京邮电大学深圳研究院 一种基于fpga的yolo网络前向推理加速器设计方法
WO2020077540A1 (zh) * 2018-10-16 2020-04-23 华为技术有限公司 一种信息处理方法及电子设备
CN109598250B (zh) * 2018-12-10 2021-06-25 北京旷视科技有限公司 特征提取方法、装置、电子设备和计算机可读介质
CN110619595B (zh) * 2019-09-17 2021-04-13 华中科技大学 一种基于多fpga加速器互联的图计算优化方法
CN110889439B (zh) * 2019-11-08 2022-06-17 浪潮电子信息产业股份有限公司 一种图像特征提取方法、装置及电子设备和存储介质
CN110909886B (zh) * 2019-11-20 2022-11-04 北京小米移动软件有限公司 一种机器学习网络运行方法、装置及介质
US20230214370A1 (en) * 2021-06-24 2023-07-06 Beyond Aerospace Ltd. Distributed machine learning architecture with hybrid data normalization, proof of lineage and data integrity

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110018817A (zh) * 2018-01-05 2019-07-16 中兴通讯股份有限公司 数据的分布式运行方法及装置、存储介质及处理器
CN108829517A (zh) * 2018-05-31 2018-11-16 中国科学院计算技术研究所 一种用于在集群环境下进行机器学习的训练方法和系统
CN109559734A (zh) * 2018-12-18 2019-04-02 百度在线网络技术(北京)有限公司 声学模型训练的加速方法和装置
CN110046048A (zh) * 2019-04-18 2019-07-23 杭州电子科技大学 一种基于工作量自适应快速重分配的负载均衡方法
CN110503201A (zh) * 2019-08-29 2019-11-26 苏州浪潮智能科技有限公司 一种神经网络分布式并行训练方法与装置
CN110689115A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 神经网络模型处理方法、装置、计算机设备及存储介质
CN111210019A (zh) * 2020-01-16 2020-05-29 电子科技大学 一种基于软硬件协同加速的神经网络推断方法
CN111752713A (zh) * 2020-06-28 2020-10-09 浪潮电子信息产业股份有限公司 模型并行训练任务负载均衡方法、装置、设备及存储介质

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115016947A (zh) * 2022-08-05 2022-09-06 中国空气动力研究与发展中心计算空气动力研究所 负载分配方法、装置、设备及介质
CN115016947B (zh) * 2022-08-05 2022-10-21 中国空气动力研究与发展中心计算空气动力研究所 负载分配方法、装置、设备及介质
CN115996173A (zh) * 2022-11-14 2023-04-21 中国科学技术大学 面向分布式深度学习算子并行训练的通信优化方法与系统
CN115996173B (zh) * 2022-11-14 2023-06-20 中国科学技术大学 面向分布式深度学习算子并行训练的通信优化方法与系统
CN116050499A (zh) * 2023-04-03 2023-05-02 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种模型并行训练中的自适应模型划分方法、系统及设备
CN116050499B (zh) * 2023-04-03 2023-07-18 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种模型并行训练中的自适应模型划分方法、系统及设备
CN116167463A (zh) * 2023-04-26 2023-05-26 之江实验室 一种模型训练的方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
US20230195537A1 (en) 2023-06-22
US11868817B2 (en) 2024-01-09
CN111752713B (zh) 2022-08-05
CN111752713A (zh) 2020-10-09

Similar Documents

Publication Publication Date Title
WO2022001134A1 (zh) 模型并行训练任务负载均衡方法、装置、设备及存储介质
WO2022252456A1 (zh) 一种任务调度方法、装置、电子设备及可读存储介质
US10679145B2 (en) System and method for balancing computation with communication in parallel learning
US11586473B2 (en) Methods and apparatus for allocating a workload to an accelerator using machine learning
KR20210050485A (ko) 신경망 모델을 압축하는 방법 및 장치, 코퍼스 번역 방법 및 장치, 전자 장치, 프로그램 및 기록 매체
US10338963B2 (en) System and method of schedule validation and optimization of machine learning flows for cloud computing
US20200311546A1 (en) Method and apparatus for partitioning deep neural networks
WO2023024252A1 (zh) 网络模型训练方法、装置、电子设备及可读存储介质
CN107239348A (zh) 一种多核处理器调度方法、装置及移动终端
WO2021208392A1 (zh) 用于人机对话的语音技能跳转方法、电子设备及存储介质
US10187516B2 (en) Systems, non-transitory computer-readable media and methods for voice quality enhancement
CN114254733A (zh) 使用树形直接存储器存取(dma)总线的神经网络权重分布
CN112540849A (zh) 一种分布式计算作业的参数配置优化方法及系统
CN108667912B (zh) 一种云资源分配方法及装置
KR20210124934A (ko) 모델 트레이닝 방법, 장치, 개발 시스템, 전자장비, 컴퓨터 판독 가능 저장 매체 및 컴퓨터 프로그램
CN112988342A (zh) 用于优化线程调度的方法、系统、制品和装置
CN110503180B (zh) 模型处理方法、装置以及电子设备
CN112418416A (zh) 神经网络计算系统、神经网络计算方法和计算机系统
WO2020134547A1 (zh) 数据的定点化加速方法、装置、电子设备及存储介质
CN112990461B (zh) 构建神经网络模型的方法、装置、计算机设备和存储介质
WO2023050807A1 (zh) 一种数据处理方法、装置、系统、电子设备及存储介质
TW202303458A (zh) 神經網路中的動態啟動稀疏性
US11704562B1 (en) Architecture for virtual instructions
KR20210156538A (ko) 뉴럴 네트워크를 이용한 데이터 처리 방법 및 데이터 처리 장치
KR20210103367A (ko) 가속기, 가속기의 동작 방법 및 이를 포함한 전자 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21833611

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21833611

Country of ref document: EP

Kind code of ref document: A1