WO2021056390A1 - 卷积神经网络模型同步训练方法、集群及可读存储介质 - Google Patents

卷积神经网络模型同步训练方法、集群及可读存储介质 Download PDF

Info

Publication number
WO2021056390A1
WO2021056390A1 PCT/CN2019/108442 CN2019108442W WO2021056390A1 WO 2021056390 A1 WO2021056390 A1 WO 2021056390A1 CN 2019108442 W CN2019108442 W CN 2019108442W WO 2021056390 A1 WO2021056390 A1 WO 2021056390A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
gpu
neural network
convolutional neural
gpus
Prior art date
Application number
PCT/CN2019/108442
Other languages
English (en)
French (fr)
Inventor
曹芳
郭振华
刘海威
Original Assignee
浪潮电子信息产业股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮电子信息产业股份有限公司 filed Critical 浪潮电子信息产业股份有限公司
Publication of WO2021056390A1 publication Critical patent/WO2021056390A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of computer application technology, and in particular to a method for synchronous training of a convolutional neural network model, a cluster, and a readable storage medium.
  • the CNN model is composed of a series of different types of layers (for example, convolutional layer, fully connected layer, etc.), and the CNN model is usually trained using a dataset of labeled images.
  • the goal of CNN training is to obtain a high-precision model in the shortest possible time.
  • the convolutional neural network CNN
  • the size of the model becomes larger and larger, such as having hundreds of layers, with a total of 10 to 20 million parameters.
  • the increase in model size makes efficient model training more important. How to train in a shorter time to make the model converge and achieve higher accuracy has always been a subject of extensive research.
  • synchronous data parallelism is the most commonly used and widely used distributed model training method.
  • This method can reduce the obsolescence of the weights used to calculate the gradient, so that the model can finally achieve a higher convergence accuracy.
  • This method requires that the GPU models must be exactly the same to ensure that the training speeds of each GPU are the same, thereby reducing mutual waiting time.
  • due to the high price of GPU and the extremely fast update rate almost every research group has many different types of GPU boards. If only the same type of GPU is used for training, it will cause other types of GPUs to become idle. Waste of resources.
  • the purpose of this application is to provide a synchronous training method, cluster, and readable storage medium for a convolutional neural network model to quickly train a convolutional neural network on GPUs with different performances.
  • a method for synchronous training of convolutional neural network models including:
  • the training samples are samples for training a convolutional neural network model
  • the average gradient value of the gradient value is calculated, and the model parameter is updated by using the average gradient value, so that each GPU obtains the model parameter.
  • obtaining the processing performance parameters corresponding to each different model of GPU includes:
  • the processing time is used to determine the processing performance parameter.
  • the determining the processing performance parameter using the processing time-consuming method includes:
  • the target constant coefficient closest to the ratio is selected from the preset constant coefficient set as the processing performance parameter.
  • obtaining the processing performance parameters corresponding to each different model of GPU includes:
  • the processing performance parameter corresponding to each GPU of the different model is obtained from the storage device.
  • calculating the average gradient value of the gradient value includes:
  • the average gradient value is calculated.
  • the method includes:
  • the learning rate for training the convolutional neural network model in each GPU is adjusted.
  • assigning a corresponding amount of training data to each GPU according to the processing performance parameter includes:
  • the amount of training data corresponding to each GPU is determined.
  • the method before inputting training samples to each of the GPUs according to the amount of training data, the method further includes:
  • the total set of training samples is divided into training sample subsets corresponding to each of the GPUs; the training sample size of the training sample subsets matches the training data amount.
  • inputting training samples to each of the GPUs according to the amount of training data includes:
  • the corresponding training sample subset is input to each of the GPUs.
  • the method before inputting training samples to each of the GPUs according to the amount of training data, the method further includes:
  • the total set of training samples is divided into sample batches of various data sizes corresponding to each of the training data amounts.
  • inputting training samples to each of the GPUs according to the amount of training data includes:
  • the corresponding sample batch is input to each of the GPUs.
  • a synchronous training cluster of a convolutional neural network model includes:
  • a processor a plurality of GPUs of different models, and a storage device; the processor respectively has a communication connection with the GPU;
  • Each GPU has the convolutional neural network model
  • the processor is configured to obtain processing performance parameters corresponding to various GPUs of different models; determine the corresponding training data volume for each GPU according to the processing performance parameters; and allocate the training samples to the training data volume according to the training data volume.
  • Each of the GPUs obtain the gradient value of each of the GPUs used to adjust the convolutional neural network model; calculate the average gradient value of the gradient value, and use the average gradient value to update the model parameters so that each of the GPUs Obtaining the model parameters;
  • the GPU is configured to obtain the training samples from the storage device, and use the training samples to train the convolutional neural network model, and each GPU feeds back the gradient value to the processor ; And obtain the model parameters from the processor.
  • the processor is specifically configured to simultaneously issue the same data processing task to each of the GPUs; monitor each of the GPUs to obtain the processing time for each of the GPUs to complete the data processing task; use The processing takes time to determine the processing performance parameter;
  • Each GPU is specifically configured to execute the data processing task after receiving the data processing task.
  • the processor is specifically configured to determine the weighting coefficient of the gradient value corresponding to each GPU according to the processing performance parameter; after combining the gradient value corresponding to each GPU with the corresponding weighting coefficient, calculate the average gradient value.
  • the processor is specifically configured to, before inputting training samples to each GPU according to the amount of training data, further includes:
  • a readable storage medium having a computer program stored on the readable storage medium, and when the computer program is executed by a processor, realizes the steps of the synchronization training method for the convolutional neural network model.
  • the processing performance parameters corresponding to each GPU of different models can be obtained, and then the amount of training data corresponding to each GPU can be determined based on the processing performance parameters.
  • training samples are allocated to each GPU according to the amount of training data. Since the amount of training data input to each GPU corresponds to its processing performance parameters, the difference between the training time of each GPU for the same convolutional neural network according to the amount of training data allocated is reduced, and the training between each GPU The waiting time can be shortened and the training efficiency can be improved.
  • the model of the GPU used for training the convolutional neural network can be different, so the idle GPU can be reduced, and the hardware cost can be reduced.
  • the embodiment of the present application also provides a convolutional neural network model synchronization training cluster and a readable storage medium corresponding to the aforementioned convolutional neural network model synchronization training method, which has the above technical effects, and will not be repeated here.
  • FIG. 1 is a flowchart of a method for synchronous training of a convolutional neural network model in an embodiment of the application
  • FIG. 2 is a schematic diagram of a process for determining processing performance parameters in an embodiment of the application
  • FIG. 3 is a schematic diagram of a specific process for determining processing performance parameters in an embodiment of the application
  • FIG. 4 is a schematic diagram of a process for determining an average gradient value in an embodiment of the application
  • FIG. 5 is an implementation flowchart of a method for synchronous training of a convolutional neural network model in an embodiment of the application
  • FIG. 6 is an implementation flowchart of another method for synchronous training of a convolutional neural network model in an embodiment of the application
  • FIG. 7 is a schematic diagram of the composition structure of a synchronous training cluster of a convolutional neural network model in an embodiment of the application.
  • Figure 8 is a schematic diagram of a traditional data synchronization parallel training mode
  • FIG. 9 is a schematic diagram of a performance analysis flow in an embodiment of the application.
  • FIG. 10 is a schematic diagram of a distributed training mode in an embodiment of this application.
  • FIG. 7 is a schematic diagram of a convolutional neural network model synchronization training cluster to which a convolutional neural network model synchronization training method in an embodiment of the application is applicable.
  • the cluster includes: one processor, multiple GPUs of different models, and storage devices; the processors respectively have a communication connection with the GPU.
  • multiple GPUs of different models mean that there are at least two or more types of GPUs in the cluster.
  • the processor may be a central processing unit (CPU), a specific application integrated circuit, a digital signal processor, an off-the-shelf programmable gate array, or other programmable logic devices.
  • FIG. 1 is a flowchart of a method for synchronous training of a convolutional neural network model in an embodiment of the application. The method includes the following steps:
  • the processing performance parameters are parameters that characterize the processing performance of the GPU. That is, a parameter that characterizes the processing performance of the GPU.
  • the number of GPUs may be multiple, such as five, and there are at least two types of GPUs.
  • the processing performance parameter corresponding to each GPU of the different model is obtained from the storage device.
  • the processing performance parameter can be directly represented by the existing GPU performance parameter. At present, it reflects two important characteristics of GPU computing power: the number of CUDA cores and memory size; and two important indicators describing GPU performance: peak computing performance and memory bandwidth.
  • the processing performance parameters can also be recorded in advance for different processing costs of the same task, used to characterize processing performance, and stored in a storage device.
  • the data processing task can be a training task of training a convolutional neural network, or can be a common GPU processing task such as rendering an image.
  • relatively short processing tasks may be preferred.
  • the same data processing task can be issued to each GPU.
  • the GPU After the GPU receives the data processing task, it can execute the data processing task.
  • the processor can monitor the task execution status of each GPU, and obtain the processing time for each GPU to complete the data processing task. As the processing performance is better, the processing time is shorter, and the processing performance parameters can be determined based on the processing time.
  • the least common multiple corresponding to all processing time can be calculated. Then calculate the least common multiple and the corresponding ratio of each processing time. Considering that in computer calculations, binary calculations are often used, so in this embodiment, the target constant coefficient closest to the ratio can be selected from the preset constant coefficient set as the processing performance parameter.
  • the processing time corresponding to GPU1 is 5 seconds
  • the processing time corresponding to GPU2 is 50 seconds
  • the processing time corresponding to GPU3 is 20 seconds
  • GPU4 The corresponding processing time is 10 seconds
  • the least common multiple corresponding to all processing time is 100 seconds.
  • the processing performance parameter corresponding to GPU1 may be 16, the processing performance parameter corresponding to GPU2 may be 2, the processing performance parameter corresponding to GPU3 may be 4, and the processing performance parameter of GPU4 may be 8.
  • Method 3 Monitor each GPU to determine the amount of tasks completed per unit time:
  • the processing performance of the GPU can also be characterized based on the number of tasks completed per unit time.
  • the processing performance parameters can be determined based on the amount of tasks completed per unit time.
  • S102 Determine a corresponding amount of training data for each GPU according to the processing performance parameter.
  • the processing performance parameters can represent the respective processing performance of each GPU.
  • the corresponding training data amount is determined for different GPUs based on the processing performance parameters.
  • the amount of training data corresponding to each GPU can be determined according to the positive correlation between the processing performance and the amount of training data. For example, when the processing performance is better, the corresponding processing performance parameter is larger, and the GPU with the larger processing performance parameter can be allocated more training data, that is, the larger the amount of training data; when the processing performance is better, the corresponding processing performance parameter The smaller it is, the more training data can be allocated to the GPU with the smaller the processing performance parameter, that is, the larger the amount of training data.
  • linear division can be performed, that is, the ratio of the processing performance parameter to the corresponding amount of training data is the same.
  • S103 Allocate training samples to each GPU according to the amount of training data.
  • the training samples are samples for training the convolutional neural network model.
  • training samples for training the convolutional neural network model can be obtained in advance. After determining the amount of training data corresponding to each GPU, the training samples can be allocated to each GPU according to the amount of training data.
  • each GPU After each GPU obtains the corresponding training samples, it trains the convolutional neural network model. It should be noted that, in this embodiment, the convolutional neural network model trained by each GPU in the same time period is the same model.
  • the training process of the convolutional neural network model can be specifically referred to the common model training process, which will not be repeated here.
  • the processor After completing one round of training, the processor can determine the model parameters determined after the current round of training of the convolutional neural network model according to the training result.
  • other devices other than the processor such as any GPU participating in model training, or other dedicated model parameter determination devices may determine the model parameters based on the gradient values fed back by each GPU.
  • obtaining the gradient value corresponding to each GPU for adjusting the convolutional neural network model may be obtained by monitoring the GPU, or may be obtained by receiving training results fed back by each GPU.
  • S105 Calculate the average gradient value of the gradient value, and use the average gradient value to update the model parameters, so that each GPU can obtain the model parameters.
  • the gradient value can be averaged. Then, the average gradient value is used to update the model parameters, that is, the model reference is adjusted and updated according to the average gradient value.
  • the processor can directly feed back the updated model parameters to each GPU, or each GPU can obtain it by itself. After the GPU obtains the updated model parameters, the model parameters are used as the model parameters of the current training round of the convolutional neural network. In this way, model parameters corresponding to different GPU parallel sequence convolutional neural network models in the current training round can be used.
  • the learning rate can directly affect the convergence of the model.
  • Different learning rate change strategies will also affect the final iteration results. Therefore, when the training round reaches the specified value or the convolutional neural network model reaches the specified loss value, the learning rate of the training convolutional neural network model in each GPU can be adjusted.
  • processing performance parameters corresponding to various GPUs of different models can be obtained, and then the amount of training data corresponding to each GPU can be determined based on the processing performance parameters.
  • training samples are allocated to each GPU according to the amount of training data. Since the amount of training data input to each GPU corresponds to its processing performance parameters, the difference between the training time of each GPU for the same convolutional neural network according to the amount of training data allocated is reduced, and the training between each GPU The waiting time can be shortened and the training efficiency can be improved.
  • the model of the GPU used for training the convolutional neural network can be different, so the idle GPU can be reduced, and the hardware cost can be reduced.
  • the embodiments of the present application also provide corresponding improvement solutions.
  • the same steps as those in the above-mentioned embodiments or corresponding steps can be referred to each other, and the corresponding beneficial effects can also be referred to each other, which will not be repeated in the preferred/improved embodiments herein.
  • the gradient value obtained by training should be more in line with actual requirements than the gradient value obtained by training the GPU with fewer training samples.
  • different weighting coefficients can be set for different GPUs.
  • the calculation in the above S105 is used to adjust the average gradient value of the model parameters, which may also specifically include:
  • the processing performance parameter when the processing performance parameter is positively correlated with the processing performance, the larger the processing performance parameter, the larger the corresponding weighting coefficient; when the processing performance parameter is negatively related to the processing performance, the smaller the processing performance parameter is, the corresponding weighting is The smaller the coefficient.
  • the average gradient value is calculated according to the weighted calculation method.
  • FIG. 5 is an implementation flowchart of a method for synchronous training of a convolutional neural network model in an embodiment of the application, and the method includes:
  • S302 Determine a corresponding amount of training data for each GPU according to the processing performance parameters.
  • all training samples are divided into GPUs in a manner that matches the amount of training data. For example, when there are 6 GPUs, all training samples are divided into 6 training sample subsets, and the training sample size of each training sample subset matches the training data amount.
  • the storage addresses corresponding to different training sample subsets can be generated to the GPU, so that the GPU can read the corresponding training sample subsets by itself.
  • the corresponding subset of training samples can be generated to each GPU.
  • S306 Calculate the average gradient value of the gradient value, and use the average gradient value to update the model parameters, so that each GPU can obtain the model parameters.
  • FIG. 6 is an implementation flowchart of another method for synchronous training of a convolutional neural network model in an embodiment of the application, and the method includes:
  • S402 Determine a corresponding amount of training data for each GPU according to the processing performance parameter.
  • the total set of training samples can be directly divided into sample batches of different data sizes, that is, one type of GPU corresponds to one size of sample batches.
  • the corresponding sample batches can be input to each GPU according to the corresponding relationship between the sample batches and the GPU.
  • S406 Calculate the average gradient value of the gradient value, and use the average gradient value to update the model parameters, so that each GPU can obtain the model parameters.
  • the embodiment of the present application also provides a convolutional neural network model synchronization training cluster.
  • the convolutional neural network model synchronization training cluster described below is the same as the convolutional neural network model described above. Model synchronization training methods can be referenced correspondingly.
  • FIG. 7 is a schematic diagram of the composition structure of a synchronous training cluster of a convolutional neural network model in an embodiment of the application.
  • the cluster includes:
  • a processor 100 a plurality of GPUs 200 of different models, and a storage device 300; the processors respectively have a communication connection with the GPU;
  • the storage device stores the training samples of the convolutional neural network model
  • Each GPU has a convolutional neural network model
  • the processor is used to obtain the processing performance parameters corresponding to various GPUs of different models; determine the corresponding training data volume for each GPU according to the processing performance parameters; allocate training samples to each GPU according to the training data volume; obtain each GPU for adjusting the volume
  • the gradient value of the product neural network model calculate the average gradient value of the gradient value, and use the average gradient value to update the model parameters so that each GPU can obtain the model parameters;
  • the GPU is used to obtain training samples from the storage device, and use the training samples to train the convolutional neural network model.
  • Each GPU feeds back the gradient value to the processor; and obtains the model parameters from the processor.
  • the processing performance parameters corresponding to each GPU of different models can be obtained, and then the amount of training data corresponding to each GPU can be determined based on the processing performance parameters.
  • training samples are allocated to each GPU according to the amount of training data. Since the amount of training data input to each GPU corresponds to its processing performance parameters, the difference between the training time of each GPU for the same convolutional neural network according to the amount of training data allocated is reduced, and the training between each GPU The waiting time can be shortened and the training efficiency can be improved.
  • the models of GPUs used to train convolutional neural networks can be different, so idle GPUs can be reduced, and hardware costs can be reduced.
  • the processor is specifically used to issue the same data processing task to each GPU at the same time; monitor each GPU to obtain the processing time for each GPU to complete the data processing task separately; Determine the processing performance parameters at the time;
  • Each GPU is specifically used to execute the data processing task after receiving the data processing task.
  • processing time to determine processing performance parameters may specifically include:
  • the target constant coefficient closest to the ratio is selected from the preset constant coefficient set as the processing performance parameter.
  • the processor is specifically configured to determine the weighting coefficient of the gradient value corresponding to each GPU according to the processing performance parameter; after combining the gradient value of each GPU and the corresponding weighting coefficient, calculate the average gradient value.
  • the processor is specifically configured to, before inputting training samples to each GPU according to the amount of training data, further includes:
  • the processor inputs training samples to each GPU according to the amount of training data, specifically according to the corresponding relationship between the training sample subset and the GPU, input the corresponding training sample subset to each GPU; or, according to the sample batch and Correspondence between GPUs, input corresponding sample batches to each GPU.
  • the processor acquiring the processing performance parameters corresponding to each different model of GPU includes: acquiring the processing corresponding to each different model of GPU from the storage device according to the corresponding relationship between the GPU model and the processing performance parameter Performance parameters.
  • the processor is specifically configured to perform the learning rate of the training convolutional neural network model in each GPU when the training round reaches the specified value or the convolutional neural network model reaches the specified loss value. Adjustment.
  • the processor allocates the corresponding training data volume to each GPU according to the processing performance parameters, and can determine the training data corresponding to each GPU according to the positive correlation between the processing performance and the training data volume. the amount.
  • Embodiment 6 is a diagrammatic representation of Embodiment 6
  • the convolutional neural network model synchronization training method provided in the embodiments of the application can be applied to the convolutional neural network model synchronization training cluster, or in other words, the convolutional neural network model synchronization training cluster provided in the embodiments of the application can implement convolution Synchronous training method of neural network model.
  • the convolutional neural network model synchronization training cluster provided in the embodiments of the application can implement convolution Synchronous training method of neural network model.
  • each GPU obtains the training data of the same batch_size from the training data set, and then each GPU uses the training data to start calculations, which is required in the current iteration
  • the gradients calculated by each GPU are added and averaged, and then the average gradient values obtained are used to update the model parameters.
  • the faster GPU will stop after the calculation and wait for the slower GPU that has not completed the calculation, thereby reducing the training efficiency.
  • the above method embodiments propose a method for synchronous training of convolutional neural network models and a method for synchronous training of convolutional neural network models.
  • Cluster The following describes the method for implementing the synchronous training of the convolutional neural network model on the synchronous training cluster of the convolutional neural network model.
  • the performance analysis process can be set to analyze the performance of each model of GPU, and then find the appropriate batch_size number that enables each GPU to complete the training at the same time in a single iterative training. Then in the formal training process, according to the analysis results of the above analysis process, different sizes of batch_size training data are set for different models of GPUs, so that each GPU can complete the training in the same time during each round of iterative training. The idle waiting time of each GPU during the training process is effectively avoided, and the training efficiency is improved. At the same time, various GPUs of different models can be deployed to the same training cluster for effective use, avoiding waste of hardware resources.
  • the analysis process includes the following steps:
  • Step 1 It can be assumed that the size unit of the training data obtained in the training data set is minibatch. For each different model of GPU0 ⁇ GPUn, respectively obtain 1000 minibatch training data;
  • Step 2 Each GPU uses the same network structure to perform 1000 iterations of 1000 minibatch data (including forward and backward calculations), and count the time spent in each iteration;
  • Step 3 Calculate the average of the 1000 iterations of the GPU, and obtain the time t0, t1,...tn for a single minibatch single iteration training of GPU0, GPU1...GPUn;
  • Step 4 Obtain the least common multiple T of t0, t1...tn;
  • the distributed training process of the convolutional neural network model includes the following steps:
  • Step 1 The batch_size_i obtained according to the performance analysis process configures the amount of data that should be obtained in each iteration for each different model of GPU;
  • Each GPUi obtains the training data volume of batch_size_i from the training data set according to its own configuration; at the same time obtains the latest model parameters;
  • Step 3 Each GPUi performs forward calculation and backward calculation respectively to obtain the gradient value
  • Step 4 After each GPU completes a single iteration operation at the same time, use the average gradient value to update the model parameters;
  • Step 5 Return to step 2, and loop training until the model converges.
  • the training process of the convolutional neural network model synchronization training method is applied to the convolutional neural network model synchronization training cluster, that is, the CNN model synchronization data parallel training method for different types of GPU clusters, by adding various types of GPUs Go to the same training cluster to improve the utilization of existing resources; by adding an analysis process, that is, to obtain a batch that matches the performance, the idle waiting time of the GPU during each round of iterative training is minimized, thereby improving the training efficiency .
  • the embodiment of the present application also provides a readable storage medium.
  • the readable storage medium described below and the method for synchronization training of a convolutional neural network model described above may correspond to each other. .
  • a readable storage medium in which a computer program is stored, and when the computer program is executed by a processor, the steps of the method for synchronous training of a convolutional neural network model in the foregoing method embodiment are implemented.
  • the readable storage medium may specifically be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disk that can store program codes. Readable storage medium.

Abstract

一种卷积神经网络模型同步训练方法、集群及可读存储介质,该方法在训练卷积神经网络时,按照训练数据量向各个GPU分配训练样本。各个GPU按照被分配的训练数据量对同一个卷积神经网络的训练耗时之间的差异得到缩小,各个GPU之间训练等待时间可缩短,训练效率得到提升。与此同时,在本方法中,用于训练卷积神经网络的GPU的型号可不同,因而可减少闲置GPU,可降低硬件成本。

Description

卷积神经网络模型同步训练方法、集群及可读存储介质
本申请要求于2019年09月25日提交至中国专利局、申请号为201910912956.1、发明名称为“卷积神经网络模型同步训练方法、集群及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机应用技术领域,特别是涉及一种卷积神经网络模型同步训练方法、集群及可读存储介质。
背景技术
CNN模型由一系列不同类型的层组成(例如卷积层、全连接层等),通常使用标记图像的数据集来训练CNN模型。CNN训练的目标是在尽可能短的时间内获得高精度模型。随着卷积神经网络(CNN)越来越广泛地开发和使用,模型尺寸变得越来越大,如具有数百个层,总共有1000万到2000万个参数。模型尺寸增长使得高效的模型训练变得更加重要。如何在更短的时间内进行训练使模型收敛,且达到更高的精度,一直是一个被广泛研究的课题。
近些年,随着GPU硬件技术、网络模型结构和训练方法均取得了很大的突破,但是单机训练耗时过久的事实仍无法回避。人们投入了大量的工作和研究来提升分布式训练神经网络模型的效率。
目前,同步数据并行是最常用且应用最广泛的一种分布式模型训练方法。这种方法可以减少用于计算梯度的权重的陈旧性,使模型最终能够达到较高的收敛精度。这种方法要求GPU型号必须完全一致,以保证各个GPU训练速度一致,从而减少相互等待时间。实际上,由于GPU价格昂贵且更新换代速度极快,几乎每个研究团体都有多种不同型号的GPU板卡,如果仅使用同一种型号的GPU进行训练,将造成其他型号GPU闲置导致极大的资源浪费。
综上所述,如何在不同性能的GPU上快速训练卷积神经网络等问题,是目前本领域技术人员急需解决的技术问题。
发明内容
本申请的目的是提供一种卷积神经网络模型同步训练方法、集群及可读存储介质,以在不同性能的GPU上快速训练卷积神经网络。
为解决上述技术问题,本申请提供如下技术方案:
一种卷积神经网络模型同步训练方法,包括:
获取各个不同型号GPU对应的处理性能参数;
根据所述处理性能参数为各个所述GPU确定对应的训练数据量;
按照所述训练数据量分配训练样本给各个所述GPU;所述训练样本为训练卷积神经网络模型的样本;
获得各个所述GPU用于调节所述卷积神经网络模型的梯度值;
计算所述梯度值的平均梯度值,并利用所述平均梯度值更新模型参数,以便各个所述GPU获取所述模型参数。
优选地,获取各个不同型号GPU对应的处理性能参数,包括:
向各个所述GPU同时发布同一个的数据处理任务;
对各个所述GPU进行监测,获得各个所述GPU分别完成所述数据处理任务的处理耗时;
利用所述处理耗时确定所述处理性能参数。
优选地,所述利用所述处理耗时确定所述处理性能参数,包括:
计算所有所述处理耗时对应的最小公倍数,并计算所述最小公倍数与各个所述处理耗时分别对应的比值;
在预设常量系数集中选出与所述比值最接近的目标常量系数作为所述处理性能参数。
优选地,获取各个不同型号GPU对应的处理性能参数,包括:
按照GPU型号与处理性能参数的对应关系,从存储设备中获取各个不同型号GPU对应的处理性能参数。
优选地,计算所述梯度值的平均梯度值,包括:
根据所述处理性能参数确定各个GPU对应梯度值的加权系数;
将各个所述GPU对应梯度值与对应加权系数结合后,计算所述平均梯 度值。
优选地,在训练轮次达到指定数值或所述卷积神经网络模型达到指定损失值时,包括:
对各个所述GPU内训练所述卷积神经网络模型的学习率进行调整。
优选地,根据所述处理性能参数为各个所述GPU分配对应的训练数据量,包括:
按照处理性能与训练数据量的正相关对应关系,确定各个所述GPU对应的训练数据量。
优选地,在按照所述训练数据量向各个所述GPU输入训练样本之前,还包括:
将训练样本总集划分为与各个所述GPU分别对应的训练样本子集;所述训练样本子集的训练样本量与所述训练数据量匹配。
优选地,按照所述训练数据量向各个所述GPU输入训练样本,包括:
按照训练样本子集与GPU的对应关系,向各个所述GPU分别输入对应的训练样本子集。
优选地,在按照所述训练数据量向各个所述GPU输入训练样本之前,还包括:
将所述训练样本总集划分为与各个所述训练数据量分别对应的多种数据大小的样本批次。
优选地,按照所述训练数据量向各个所述GPU输入训练样本,包括:
按照样本批次与GPU的对应关系,向各个所述GPU分别输入对应的样本批次。
一种卷积神经网络模型同步训练集群,包括:
处理器,多个不同型号的GPU,存储设备;所述处理器分别与所述GPU具有通信连接;
所述存储设备中存储卷积神经网络模型的训练样本;
各个所述GPU中具有所述卷积神经网络模型;
其中,所述处理器,用于获取各个不同型号GPU对应的处理性能参数;根据所述处理性能参数为各个所述GPU确定对应的训练数据量;按照所述训练数据量分配所述训练样本给各个所述GPU;获得各个所述GPU用于调 节所述卷积神经网络模型的梯度值;计算所述梯度值的平均梯度值,并利用所述平均梯度值更新模型参数,以便各个所述GPU获取所述模型参数;
所述GPU,用于从所述存储设备中获取所述训练样本,并利用所述训练样本对所述卷积神经网络模型进行训练,各个所述GPU将所述梯度值反馈给所述处理器;并从所述处理器中获取所述模型参数。
优选地,所述处理器,具体用于向各个所述GPU同时发布同一个数据处理任务;对各个所述GPU进行监测,获得各个所述GPU分别完成所述数据处理任务的处理耗时;利用所述处理耗时确定所述处理性能参数;
各个所述GPU,具体用于在接收到所述数据处理任务后,执行所述数据处理任务。
优选地,所述处理器,具体用于根据所述处理性能参数确定各个GPU对应梯度值的加权系数;将各个所述GPU对应梯度值与对应加权系数结合后,计算所述平均梯度值。
优选地,所述处理器,具体用于在按照所述训练数据量向各个所述GPU输入训练样本之前,还包括:
将训练样本总集划分为与各个所述GPU分别对应的训练样本子集;所述训练样本子集的训练样本量与所述训练数据量匹配;
或,将所述训练样本总集划分为与各个所述训练数据量分别对应的多种数据大小的样本批次。
一种可读存储介质,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述卷积神经网络模型同步训练方法的步骤。
应用本申请实施例所提供的方法,获取各个不同型号GPU对应的处理性能参数;根据处理性能参数为各个GPU确定对应的训练数据量;按照训练数据量分配训练样本给各个GPU;训练样本为训练卷积神经网络模型的样本;获得各个GPU用于调节卷积神经网络模型的梯度值;计算梯度值的平均梯度值,并利用平均梯度值更新模型参数,以便各个GPU获取模型参数。
在本方法中,在利用同步数据并行对卷积神经网络进行训练之前,可获取各个不同型号GPU对应的处理性能参数,然后基于处理性能参数确定 各个GPU对应的训练数据量。在训练卷积神经网络时,则按照训练数据量向各个GPU分配训练样本。由于输入至各个GPU的训练数据量与其处理性能参数相对应,因此,各个GPU按照被分配的训练数据量对同一个卷积神经网络的训练耗时之间的差异得到缩小,各个GPU之间训练等待时间可缩短,训练效率得到提升。与此同时,在本方法中,用于训练卷积神经网络的GPU的型号可不同,因而可减少闲置GPU,可降低硬件成本。
相应地,本申请实施例还提供了与上述卷积神经网络模型同步训练方法相对应的卷积神经网络模型同步训练集群和可读存储介质,具有上述技术效果,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例中一种卷积神经网络模型同步训练方法的流程图;
图2为本申请实施例中一种处理性能参数确定流程示意图;
图3为本申请实施例中一种处理性能参数确定具体流程示意图;
图4为本申请实施例中一种平均梯度值确定流程示意图;
图5为本申请实施例中一种卷积神经网络模型同步训练方法的实施流程图;
图6为本申请实施例中另一种卷积神经网络模型同步训练方法的实施流程图;
图7为本申请实施例中一种卷积神经网络模型同步训练集群的组成结构示意图;
图8为一种传统的数据同步并行训练模式示意图;
图9为本申请实施例中一种性能分析流程示意图;
图10为本申请实施例中一种分布式训练模式示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
实施例一:
为了便于理解,先对本申请实施例所提供的卷积神经网络模型同步训练方法对应的方案所适用的硬件组成框架进行介绍。可以参见图7,图7为本申请实施例中一种卷积神经网络模型同步训练方法所适用的一种卷积神经网络模型同步训练集群示意图。可见,该集群包括:一个处理器,多个不同型号的GPU,存储设备;处理器分别与GPU具有通信连接。其中,多种不同型号的GPU指该集群中至少存在两种或两种以上型号类型的GPU。
该处理器,可以为中央处理器(Central Processing Unit,CPU),特定应用集成电路,数字信号处理器、现成可编程门阵列或者其他可编程逻辑器件等。
请参考图1,图1为本申请实施例中一种卷积神经网络模型同步训练方法的流程图,该方法包括以下步骤:
S101、获取各个不同型号GPU对应的处理性能参数。
其中,处理性能参数即表征GPU的处理性能的参数。即,表征GPU处理性能的参数。
在本实施例中,GPU的数量可为多个,如5个,且GPU的型号至少存在两种。
获取各个不同型号GPU对应的处理性能参数,可选用以下获取方式:
方式一、直接读取:
按照GPU型号与处理性能参数的对应关系,从存储设备中获取各个不同型号GPU对应的处理性能参数。该处理性能参数可以直接采用现有的GPU性能参数表示。目前,体现GPU计算能力的两个重要特征:CUDA核的个数和存储器大小;描述GPU性能的两个重要指标:计算性能峰值和存储器带宽。该处理性能参数还可由预先对同一个任务的不同处理耗进行记录,并用于表征处理性能,并存入存储设备中。
方式二、通过对GPU完成相同任务的处理耗时计算得到:
请参考图2,获取实现过程,包括:
S11、向各个GPU同时发布同一数据处理任务;
S12、对各个GPU进行监测,获得各个GPU分别完成数据处理任务的处理耗时;
S13、利用处理耗时确定处理性能参数。
其中,数据处理任务可以为训练卷积神经网络的训练任务,也可以为对某图像进行渲染等常见的GPU处理任务。为了快速获得处理性能参数,可优选耗时相对较短的处理任务。
在本实施例中,可向各个GPU发布同一个数据处理任务。GPU接收到数据处理任务之后,便可执行该数据处理任务。此时,处理器可对各个GPU的任务执行情况进行监控,获得各个GPU分别完成该数据处理任务的处理耗时。由于处理性能越好,处理耗时越短,因而基于处理耗时可确定出处理性能参数。
请参考图3,S13处理性能参数确定具体实现步骤,包括:
S131、计算所有处理耗时对应的最小公倍数,并计算最小公倍数与各个处理耗时分别对应的比值;
S132、在预设常量系数集中选出与比值最接近的目标常量系数作为处理性能参数。
为便于描述,下面将上述S131和S132结合起来进行说明。
在本实施例中,可计算所有处理耗时对应的最小公倍数。然后计算出最小公倍数与各个处理耗时分别对应比值。考虑到在计算机计算中,往往采用二进制方式进行计算,因而在本实施例可在预设常量系数集中选出与该比值最接近的目标常量系数作为处理性能参数。
例如,当存在GPU1,GPU2,GPU3和GPU4时,对于数据处理任务A,GPU1对应的处理耗时为5秒,GPU2对应的处理耗时为50秒,GPU3对应的处理耗时为20秒,GPU4对应的处理耗时为10秒,则所有处理耗时对应的最小公倍数即为100秒,GPU1的最小公倍数与GPU1的处理耗时比值即为100:5=20;GPU2的最小公倍数与GPU2的处理耗时比值即为100:50=2;GPU3的最小公倍数与GPU3的处理耗时比值即为100:20=5;GPU4的最小公 倍数与GPU4的处理耗时比值即为100:10=10。GPU1对应的处理性能参数可取16,GPU2对应的处理性能参数可取2,GPU3对应的处理性能参数可取4,GPU4的处理性能参数可取8。
方式三:对各个GPU进行监控,确定其单位时间内完成的任务量:
因处理性能越好,对应的完成任务数量更多,因而基于单位时间内完成的任务量也可表征GPU的处理性能。即可基于单位时间内完成的任务量确定出处理性能参数。具体实现过程可参照上述方式二,在此不再赘述。
S102、根据处理性能参数为各个GPU确定对应的训练数据量。
处理性能参数即可表征各个GPU分别对应的处理性能,在本实施例中,为了减少相互等待时间,在基于处理性能参数为不同的GPU分别确定出对应的训练数据量。
具体的,可按照处理性能与训练数据量的正相关对应关系,确定各个GPU对应的训练数据量。例如,当处理性能越好对应的处理性能参数越大,则可为处理性能参数越大的GPU分配更多的训练数据,即训练数据量越大;当处理性能越好,对应的处理性能参数越小时,则可为处理性能参数越小的GPU分配更多的训练数据,即训练数据量越大。在确定训练数据量时,可进行线性划分,即处理性能参数与对应的训练数据量的比值相同。
S103、按照训练数据量分配训练样本给各个GPU。
训练样本为训练卷积神经网络模型的样本。
在本实施例中,可预先获取用于训练卷积神经网络模型的训练样本。在确定出各个GPU分别对应的训练数据量之后,便可按照该训练数据量将训练样本分配给各个GPU。
每个GPU获得相应的训练样本之后,便对卷积神经网络模型进行训练。需要说明的是,在本实施例中,各个GPU在同一时段训练的卷积神经网络模型为同一个模型。卷积神经网络模型的训练过程可具体参见常见的模型训练过程,在此不再赘述。在完成一轮训练之后,处理器可根据训练结果确定出卷积神经网络模型的本轮训练后所确定的模型参数。当然,在其他具体实施例中,还可有由非处理器的其他设备(如任意一个参与模型训练的GPU,或其他专用的模型参数确定设备)基于各个GPU反馈的梯度值确定出模型参数。
S104、获得各个GPU用于调节卷积神经网络模型的梯度值。
在本实施例中,获取各个GPU对应的用于调节卷积神经网络模型的梯度值,可通过对GPU进行监控的方式获取,也可通过接收各个GPU反馈的训练结果的方式获取。
S105、计算梯度值的平均梯度值,并利用平均梯度值更新模型参数,以便各个GPU获取模型参数。
得到各个GPU用于调节卷积神经网络模型的梯度值之后,可对梯度值求均值。然后,利用平均梯度值更新模型参数,即将模型参照按照平均梯度值进行调整更新。处理器可直接将更新后的模型参数反馈给各个GPU,也可由各个GPU自行获取。GPU获得更新后的模型参数之后,将该模型参数作为卷积神经网络当前训练轮次的模型参数。如此,便可利用不同的GPU并行序列卷积神经网络模型在当前训练轮次对应的模型参数。
优选地,考虑到学习率为模型训练过程中一个非常重要的参数,学习率可以直接影响模型的收敛与否。不同的学习率变更策略也会影响最终的迭代结果。因此,在训练轮次达到指定数值或卷积神经网络模型达到指定损失值时,可对各个GPU内训练卷积神经网络模型的学习率进行调整。
应用本申请实施例所提供的方法,获取各个不同型号GPU对应的处理性能参数;根据处理性能参数为各个GPU确定对应的训练数据量;按照训练数据量分配训练样本给各个GPU;训练样本为训练卷积神经网络模型的样本;获得各个GPU用于调节卷积神经网络模型的梯度值;计算梯度值的平均梯度值,并利用平均梯度值更新模型参数,以便各个GPU获取模型参数。
在本方法中,在利用同步数据并行对卷积神经网络进行训练之前,可获取各个不同型号GPU对应的处理性能参数,然后基于处理性能参数确定各个GPU对应的训练数据量。在训练卷积神经网络时,则按照训练数据量向各个GPU分配训练样本。由于输入至各个GPU的训练数据量与其处理性能参数相对应,因此,各个GPU按照被分配的训练数据量对同一个卷积神经网络的训练耗时之间的差异得到缩小,各个GPU之间训练等待时间可缩短,训练效率得到提升。与此同时,在本方法中,用于训练卷积神经网络的GPU的型号可不同,因而可减少闲置GPU,可降低硬件成本。
需要说明的是,基于上述实施例,本申请实施例还提供了相应的改进方案。在优选/改进实施例中涉及与上述实施例中相同步骤或相应步骤之间可相互参考,相应的有益效果也可相互参照,在本文的优选/改进实施例中不再一一赘述。
实施例二:
优选地,考虑到各个GPU的处理性能不同的,对于分配的训练样本较多的GPU,其训练得到的梯度值应当比分配的训练样本较少的GPU的训练得到的梯度值更符合实际需求。基于此,在本实施例中,在计算平均梯度值时,可为不同的GPU设置不同的加权系数。
请参考图4,上述S105中的计算用于调节模型参数的平均梯度值,还可具体包括:
S251、根据处理性能参数确定各个GPU对应梯度值的加权系数。
具体的,当处理性能参数与处理性能呈正相关时,则处理性能参数越大,对应的加权系数越大;当处理性能参数与处理性能呈负相关时,则处理性能参数越小,对应的加权系数越小。
S252、将各个GPU对应梯度值与对应加权系数结合后,计算平均梯度值。
即按照加权计算方式计算平均梯度值。
实施例三:
为便于本领域技术人员更好地理解如何向GPU分配不同的训练样本,下面对训练样本具体如何进行分配进行说明。
请参考图5,图5为本申请实施例中一种卷积神经网络模型同步训练方法的实施流程图,该方法包括:
S301、获取各个不同型号GPU对应的处理性能参数。
S302、根据处理性能参数为各个GPU确定对应的训练数据量。
S303、将训练样本总集划分为与各个GPU分别对应的训练样本子集;训练样本子集的训练样本量与训练数据量匹配。
具体的,即将全部训练样本按照与训练数据量匹配的方式,将其划分给各个GPU。例如,当存在6个GPU时,则将全部训练样本划分为6个训练样本子集,且每一个训练样本子集的训练样本量与训练数据量相匹配对应。
S304、按照训练样本子集与GPU的对应关系,向各个GPU分别输入对应的训练样本子集。
具体的,可将不同训练样本子集对应的存储地址发生给GPU,以便GPU自行读取相应的训练样本子集。当然,也可有CPU读取至内存后,将相应的训练样本子集发生给各个GPU。
S305、获得各个GPU用于调节卷积神经网络模型的梯度值。
S306、计算梯度值的平均梯度值,并利用平均梯度值更新模型参数,以便各个GPU获取模型参数。
实施例四:
为便于本领域技术人员更好地理解如何向GPU分配不同的训练样本,下面对训练样本具体如何进行分配进行说明。
请参考图6,图6为本申请实施例中另一种卷积神经网络模型同步训练方法的实施流程图,该方法包括:
S401、获取各个不同型号GPU对应的处理性能参数。
S402、根据处理性能参数为各个GPU确定对应的训练数据量。
S403、将训练样本总集划分为与各个训练数据量分别对应的多种数据大小的样本批次。
为了提高样本输入效率,在向GPU分配样本时,可批量分配。可直接将训练样本总集划分为不同数据大小的样本批次(batch),即一种GPU对应一种大小的样本批次。
S404、按照样本批次与GPU的对应关系,向各个GPU分别输入对应的样本批次。
将训练样本总集划分为不同大小的样本批次后,可按照样本批次与GPU的对应关系,向各个GPU输入对应的样本批次。
S405、获得各个GPU用于调节卷积神经网络模型的梯度值。
S406、计算梯度值的平均梯度值,并利用平均梯度值更新模型参数,以便各个GPU获取模型参数。
实施例五:
相应于上面的方法实施例,本申请实施例还提供了一种卷积神经网络模型同步训练集群,下文描述的一种卷积神经网络模型同步训练集群与上 文描述的一种卷积神经网络模型同步训练方法可相互对应参照。
请参考图7,图7为本申请实施例中一种卷积神经网络模型同步训练集群的组成结构示意图。该集群包括:
处理器100,多个不同型号的GPU200,存储设备300;处理器分别与GPU具有通信连接;
存储设备中存储卷积神经网络模型的训练样本;
各个GPU中具有卷积神经网络模型;
其中,处理器,用于获取各个不同型号GPU对应的处理性能参数;根据处理性能参数为各个GPU确定对应的训练数据量;按照训练数据量分配训练样本给各个GPU;获得各个GPU用于调节卷积神经网络模型的梯度值;计算梯度值的平均梯度值,并利用平均梯度值更新模型参数,以便各个GPU获取模型参数;
GPU,用于从存储设备中获取训练样本,并利用训练样本对卷积神经网络模型进行训练,各个GPU将梯度值反馈给处理器;并从处理器中获取模型参数。
应用本申请实施例所提供的集群,获取各个不同型号GPU对应的处理性能参数;根据处理性能参数为各个GPU确定对应的训练数据量;按照训练数据量分配训练样本给各个GPU;训练样本为训练卷积神经网络模型的样本;获得各个GPU用于调节卷积神经网络模型的梯度值;计算梯度值的平均梯度值,并利用平均梯度值更新模型参数,以便各个GPU获取模型参数。
在本集群中,在利用同步数据并行对卷积神经网络进行训练之前,可获取各个不同型号GPU对应的处理性能参数,然后基于处理性能参数确定各个GPU对应的训练数据量。在训练卷积神经网络时,则按照训练数据量向各个GPU分配训练样本。由于输入至各个GPU的训练数据量与其处理性能参数相对应,因此,各个GPU按照被分配的训练数据量对同一个卷积神经网络的训练耗时之间的差异得到缩小,各个GPU之间训练等待时间可缩短,训练效率得到提升。与此同时,在本集群中,用于训练卷积神经网络的GPU的型号可不同,因而可减少闲置GPU,可降低硬件成本。
在本申请的一种具体实施方式中,处理器,具体用于向各个GPU同时 发布同一个数据处理任务;对各个GPU进行监测,获得各个GPU分别完成数据处理任务的处理耗时;利用处理耗时确定处理性能参数;
各个GPU,具体用于在接收到数据处理任务后,执行数据处理任务。
其中,利用处理耗时确定处理性能参数,可具体包括:
计算所有处理耗时对应的最小公倍数,并计算最小公倍数与各个处理耗时分别对应的比值;
在预设常量系数集中选出与比值最接近的目标常量系数作为处理性能参数。
在本申请的一种具体实施方式中,处理器,具体用于根据处理性能参数确定各个GPU对应梯度值的加权系数;将各个GPU对应梯度值与对应加权系数结合后,计算平均梯度值。
在本申请的一种具体实施方式中,处理器,具体用于在按照训练数据量向各个GPU输入训练样本之前,还包括:
将训练样本总集划分为与各个GPU分别对应的训练样本子集;训练样本子集的训练样本量与训练数据量匹配;
或,将训练样本总集划分为与各个训练数据量分别对应的多种数据大小的样本批次。
相应地,处理器,按照训练数据量向各个GPU输入训练样本,可具体为按照训练样本子集与GPU的对应关系,向各个GPU分别输入对应的训练样本子集;或,按照样本批次与GPU的对应关系,向各个GPU分别输入对应的样本批次。
在本申请的一种具体实施方式中,处理器,获取各个不同型号GPU对应的处理性能参数,包括:按照GPU型号与处理性能参数的对应关系,从存储设备中获取各个不同型号GPU对应的处理性能参数。
在本申请的一种具体实施方式中,处理器,具体用于在训练轮次达到指定数值或卷积神经网络模型达到指定损失值时,对各个GPU内训练卷积神经网络模型的学习率进行调整。
在本申请的一种具体实施方式中,处理器,在根据处理性能参数为各个GPU分配对应的训练数据量,可按照处理性能与训练数据量的正相关对应关系,确定各个GPU对应的训练数据量。
实施例六:
本申请实施例所提供的卷积神经网络模型同步训练方法可应用于卷积神经网络模型同步训练集群,或者说,在本申请实施例所提供的卷积神经网络模型同步训练集群可实现卷积神经网络模型同步训练方法。为了便于本领域技术人员更好的逻辑卷积神经网络模型同步训练方法如何应用于卷积神经网络模型同步训练集群,下面以具体的应用场景并参照现有训练方式进行详细说明。
传统的数据同步并行训练模式如附图8所示,在每轮迭代开始时每个GPU从训练数据集获取相同batch_size大小的训练数据,然后各个GPU利用训练数据开始计算,在当前次迭代中要求所有的GPU全部完成前向计算与后向计算并求得梯度值后,将各GPU计算得到的梯度进行加和平均,然后用得到的平均梯度值去更新模型参数,在此过程中,若GPU计算型号速度稍有差异,便会出现计算较快的GPU在计算完后停下来等待还未完成计算的较慢的GPU的情况,从而降低了训练效率。
为了能够在同一集群中充分利用不同型号的GPU资源,避免资源浪费,同时提高训练效率,上述方法实施例提出了一种卷积神经网络模型同步训练方法,以及一种卷积神经网络模型同步训练集群。下面对在卷积神经网络模型同步训练集群上实现卷积神经网络模型同步训练方法进行说明。
可设置性能分析流程,对各个型号的GPU进行性能分析,进而求得能够使各GPU在单次迭代训练中同时完成训练的合适的batch_size数量。然后在正式训练的过程中,按照上述分析流程的分析结果,针对不同型号的GPU设置不同大小的batch_size训练数据量,从而使得每轮迭代训练过程中各个GPU能够在相同的时间内完成训练,如此有效避免了各GPU在训练过程中的空闲等待时间,提高了训练效率。同时各种不同型号的GPU都能被部署到同一训练集群当中有效利用,避免了硬件资源浪费。
具体实施过程如下:
首先进行性能分析,请参考图9,分析过程包括以下步骤:
步骤1.可假设训练数据集中获取的训练数据大小单位为minibatch,对各个不同型号的GPU0~GPUn,分别获取1000份minibatch训练数据;
步骤2.各GPU使用相同的网络结构对1000份minibatch数据进行1000次 迭代计算(包括前向计算和后向计算),并统计每次迭代所耗时间;
步骤3.对GPU的1000次迭代时间分别求平均值,得到,GPU0、GPU1……GPUn的单个minibatch单次迭代训练所耗费的时间t0、t1……tn;
步骤4.求得t0、t1……tn的最小公倍数T;
步骤5.根据最小公倍数T求得各GPU在单次迭代过程中所需的训练数据量bach_size_i=T/ti*N。其中N为调整batch_size大小的常量系数。
请参考图10,对卷积神经网络模型进行分布式训练过程,具体包括以下步骤:
步骤1.按照性能分析流程求得的batch_size_i为各不同型号的GPU配置其每次迭代应获取的数据量;
步骤2.各GPUi按照自己的配置从训练数据集获取batch_size_i大小的训练数据量;同时获取最新模型参数;
步骤3.各GPUi分别进行前向计算和后向计算,获取梯度值;
步骤4.各GPU同时完成单次迭代运算后,利用平均梯度值更新模型参数;
步骤5.返回步骤2,循环训练,直到模型收敛。
可见,在卷积神经网络模型同步训练集群上应用了卷积神经网络模型同步训练方法的训练过程,即面向不同型号GPU集群的CNN模型同步数据并行训练方法,通过将各种不同型号的GPU添加到同一训练集群中去,提高现有资源的利用率;通过添加分析流程,即获取到与性能匹配的batch,最大限度的减少每轮迭代训练过程中GPU的空闲等待时间,从而提高了训练效率。
实施例七:
相应于上面的方法实施例,本申请实施例还提供了一种可读存储介质,下文描述的一种可读存储介质与上文描述的一种卷积神经网络模型同步训练方法可相互对应参照。
一种可读存储介质,可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述方法实施例的卷积神经网络模型同步训练方法的步骤。
该可读存储介质具体可以为U盘、移动硬盘、只读存储器(Read-Only  Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可存储程序代码的可读存储介质。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。

Claims (16)

  1. 一种卷积神经网络模型同步训练方法,其特征在于,包括:
    获取各个不同型号GPU对应的处理性能参数;
    根据所述处理性能参数为各个所述GPU确定对应的训练数据量;
    按照所述训练数据量分配训练样本给各个所述GPU;所述训练样本为训练卷积神经网络模型的样本;
    获得各个所述GPU用于调节所述卷积神经网络模型的梯度值;
    计算所述梯度值的平均梯度值,并利用所述平均梯度值更新模型参数,以便各个所述GPU获取所述模型参数。
  2. 根据权利要求1所述的卷积神经网络模型同步训练方法,其特征在于,获取各个不同型号GPU对应的处理性能参数,包括:
    向各个所述GPU同时发布同一个的数据处理任务;
    对各个所述GPU进行监测,获得各个所述GPU分别完成所述数据处理任务的处理耗时;
    利用所述处理耗时确定所述处理性能参数。
  3. 根据权利要求2所述的卷积神经网络模型同步训练方法,其特征在于,所述利用所述处理耗时确定所述处理性能参数,包括:
    计算所有所述处理耗时对应的最小公倍数,并计算所述最小公倍数与各个所述处理耗时分别对应的比值;
    在预设常量系数集中选出与所述比值最接近的目标常量系数作为所述处理性能参数。
  4. 根据权利要求1所述的卷积神经网络模型同步训练方法,其特征在于,获取各个不同型号GPU对应的处理性能参数,包括:
    按照GPU型号与处理性能参数的对应关系,从存储设备中获取各个不同型号GPU对应的处理性能参数。
  5. 根据权利要求1所述的卷积神经网络模型同步训练方法,其特征在于,计算所述梯度值的平均梯度值,包括:
    根据所述处理性能参数确定各个GPU对应梯度值的加权系数;
    将各个所述GPU对应梯度值与对应加权系数结合后,计算所述平均梯度值。
  6. 根据权利要求1所述的卷积神经网络模型同步训练方法,其特征在于,在训练轮次达到指定数值或所述卷积神经网络模型达到指定损失值时,包括:
    对各个所述GPU内训练所述卷积神经网络模型的学习率进行调整。
  7. 根据权利要求1所述的卷积神经网络模型同步训练方法,其特征在于,根据所述处理性能参数为各个所述GPU分配对应的训练数据量,包括:
    按照处理性能与训练数据量的正相关对应关系,确定各个所述GPU对应的训练数据量。
  8. 根据权利要求1所述的卷积神经网络模型同步训练方法,其特征在于,在按照所述训练数据量向各个所述GPU输入训练样本之前,还包括:
    将训练样本总集划分为与各个所述GPU分别对应的训练样本子集;所述训练样本子集的训练样本量与所述训练数据量匹配。
  9. 根据权利要求8所述的卷积神经网络模型同步训练方法,其特征在于,按照所述训练数据量向各个所述GPU输入训练样本,包括:
    按照训练样本子集与GPU的对应关系,向各个所述GPU分别输入对应的训练样本子集。
  10. 根据权利要求1所述的卷积神经网络模型同步训练方法,其特征在于,在按照所述训练数据量向各个所述GPU输入训练样本之前,还包括:
    将所述训练样本总集划分为与各个所述训练数据量分别对应的多种数据大小的样本批次。
  11. 根据权利要求10所述的卷积神经网络模型同步训练方法,其特征在于,按照所述训练数据量向各个所述GPU输入训练样本,包括:
    按照样本批次与GPU的对应关系,向各个所述GPU分别输入对应的样本批次。
  12. 一种卷积神经网络模型同步训练集群,其特征在于,包括:
    处理器,多个不同型号的GPU,存储设备;所述处理器分别与所述GPU具有通信连接;
    所述存储设备中存储卷积神经网络模型的训练样本;
    各个所述GPU中具有所述卷积神经网络模型;
    其中,所述处理器,用于获取各个不同型号GPU对应的处理性能参数; 根据所述处理性能参数为各个所述GPU确定对应的训练数据量;按照所述训练数据量分配所述训练样本给各个所述GPU;获得各个所述GPU用于调节所述卷积神经网络模型的梯度值;计算所述梯度值的平均梯度值,并利用所述平均梯度值更新模型参数,以便各个所述GPU获取所述模型参数;
    所述GPU,用于从所述存储设备中获取所述训练样本,并利用所述训练样本对所述卷积神经网络模型进行训练,各个所述GPU将所述梯度值反馈给所述处理器;并从所述处理器中获取所述模型参数。
  13. 根据权利要求12所述的卷积神经网络模型同步训练集群,其特征在于,所述处理器,具体用于向各个所述GPU同时发布同一个数据处理任务;对各个所述GPU进行监测,获得各个所述GPU分别完成所述数据处理任务的处理耗时;利用所述处理耗时确定所述处理性能参数;
    各个所述GPU,具体用于在接收到所述数据处理任务后,执行所述数据处理任务。
  14. 根据权利要求12所述的卷积神经网络模型同步训练集群,其特征在于,所述处理器,具体用于根据所述处理性能参数确定各个GPU对应梯度值的加权系数;将各个所述GPU对应梯度值与对应加权系数结合后,计算所述平均梯度值。
  15. 根据权利要求12所述的卷积神经网络模型同步训练集群,其特征在于,所述处理器,具体用于在按照所述训练数据量向各个所述GPU输入训练样本之前,还包括:
    将训练样本总集划分为与各个所述GPU分别对应的训练样本子集;所述训练样本子集的训练样本量与所述训练数据量匹配;
    或,将所述训练样本总集划分为与各个所述训练数据量分别对应的多种数据大小的样本批次。
  16. 一种可读存储介质,其特征在于,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至11任一项所述卷积神经网络模型同步训练方法的步骤。
PCT/CN2019/108442 2019-09-25 2019-09-27 卷积神经网络模型同步训练方法、集群及可读存储介质 WO2021056390A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910912956.1 2019-09-25
CN201910912956.1A CN110705705B (zh) 2019-09-25 2019-09-25 卷积神经网络模型同步训练方法、集群及可读存储介质

Publications (1)

Publication Number Publication Date
WO2021056390A1 true WO2021056390A1 (zh) 2021-04-01

Family

ID=69197652

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/108442 WO2021056390A1 (zh) 2019-09-25 2019-09-27 卷积神经网络模型同步训练方法、集群及可读存储介质

Country Status (2)

Country Link
CN (1) CN110705705B (zh)
WO (1) WO2021056390A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356540A (zh) * 2021-10-30 2022-04-15 腾讯科技(深圳)有限公司 一种参数更新方法、装置、电子设备和存储介质

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111722923A (zh) * 2020-05-29 2020-09-29 浪潮电子信息产业股份有限公司 一种异构资源的调用方法、装置和计算机可读存储介质
CN113743570A (zh) * 2020-05-29 2021-12-03 华为技术有限公司 一种神经网络的训练方法及相关设备
CN111738415B (zh) * 2020-06-17 2023-07-04 北京字节跳动网络技术有限公司 模型同步更新方法、装置及电子设备
CN111738416B (zh) * 2020-06-17 2023-07-18 北京字节跳动网络技术有限公司 模型同步更新方法、装置及电子设备
CN111860867B (zh) * 2020-07-24 2023-01-10 苏州浪潮智能科技有限公司 一种混合异构系统的模型训练方法、系统及相关装置
CN112508191A (zh) * 2020-12-14 2021-03-16 北京地平线信息技术有限公司 训练深度学习模型的方法及装置、电子设备及存储介质
CN113011563A (zh) * 2021-03-19 2021-06-22 北京大学 基于gpu的卷积神经网络批归一化处理方法
CN113327598B (zh) * 2021-06-30 2023-11-14 北京有竹居网络技术有限公司 模型的训练方法、语音识别方法、装置、介质及设备
CN114707532B (zh) * 2022-01-11 2023-05-19 中铁隧道局集团有限公司 一种基于改进的Cascade R-CNN的探地雷达隧道病害目标检测方法
CN114492801A (zh) * 2022-04-18 2022-05-13 中国科学院自动化研究所 一种神经网络的训练方法、装置及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121806A1 (en) * 2016-10-27 2018-05-03 International Business Machines Corporation Efficient parallel training of a network model on multiple graphics processing units
CN108021395A (zh) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 一种面向神经网络的数据并行处理方法及系统
CN109657793A (zh) * 2018-12-26 2019-04-19 广州小狗机器人技术有限公司 模型训练方法及装置、存储介质及电子设备
CN109902818A (zh) * 2019-01-15 2019-06-18 中国科学院信息工程研究所 一种面向深度学习训练任务的分布式加速方法及系统

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107018184B (zh) * 2017-03-28 2019-08-30 华中科技大学 分布式深度神经网络集群分组同步优化方法及系统
CN108182469A (zh) * 2017-12-27 2018-06-19 郑州云海信息技术有限公司 一种神经网络模型训练方法、系统、装置及存储介质
CN108460457A (zh) * 2018-03-30 2018-08-28 苏州纳智天地智能科技有限公司 一种面向卷积神经网络的多机多卡混合并行异步训练方法
CN108829517B (zh) * 2018-05-31 2021-04-06 中国科学院计算技术研究所 一种用于在集群环境下进行机器学习的训练方法和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121806A1 (en) * 2016-10-27 2018-05-03 International Business Machines Corporation Efficient parallel training of a network model on multiple graphics processing units
CN108021395A (zh) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 一种面向神经网络的数据并行处理方法及系统
CN109657793A (zh) * 2018-12-26 2019-04-19 广州小狗机器人技术有限公司 模型训练方法及装置、存储介质及电子设备
CN109902818A (zh) * 2019-01-15 2019-06-18 中国科学院信息工程研究所 一种面向深度学习训练任务的分布式加速方法及系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356540A (zh) * 2021-10-30 2022-04-15 腾讯科技(深圳)有限公司 一种参数更新方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN110705705A (zh) 2020-01-17
CN110705705B (zh) 2022-04-22

Similar Documents

Publication Publication Date Title
WO2021056390A1 (zh) 卷积神经网络模型同步训练方法、集群及可读存储介质
Le et al. Allox: compute allocation in hybrid clusters
EP2894564A1 (en) Job scheduling based on historical job data
Pastorelli et al. HFSP: size-based scheduling for Hadoop
US20200210228A1 (en) Scheduling Applications in CPU and GPU Hybrid Environments
US11868901B1 (en) Compiler for optimizing memory allocations within cores
CN112463390A (zh) 一种分布式任务调度方法、装置、终端设备及存储介质
US20220300323A1 (en) Job Scheduling Method and Job Scheduling Apparatus
CN115586961A (zh) 一种ai平台计算资源任务调度方法、装置及介质
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
TWI758223B (zh) 具有動態最小批次尺寸之運算方法,以及用於執行該方法之運算系統及電腦可讀儲存媒體
CN117061365A (zh) 一种节点选择方法、装置、设备及可读存储介质
CN111756802A (zh) 一种数据流任务在numa平台上的调度方法及系统
WO2020155083A1 (zh) 神经网络的分布式训练方法及装置
CN110222410A (zh) 一种基于Hadoop MapReduce的电磁环境仿真方法
JP2020003860A (ja) 学習システム、処理装置、処理方法、およびプログラム
TWI770534B (zh) 自動機器學習系統效能調優方法、裝置、設備及介質
KR20220136426A (ko) 머신 학습 가속기들에서의 큐 할당
CN113946274A (zh) 数据处理方法、装置、设备及介质
TWI831159B (zh) 存儲擴容方法及裝置、存儲介質與電子設備
WO2024001870A1 (zh) 一种人工智能模型的训练方法及相关设备
CN113485805B (zh) 基于异构加速平台的分布式计算调整方法、装置及设备
US11811862B1 (en) System and method for management of workload distribution
WO2024066847A1 (zh) 一种基于多裸片的计算方法及相关设备
CN114546279B (zh) Io请求预测方法、装置、存储节点及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19946305

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19946305

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19946305

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 16/09/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19946305

Country of ref document: EP

Kind code of ref document: A1